Human In The Loop Software Development Agents
- Jirat Pasuksmit tl;dr: “Recently, we created the ‘Human-in-the-loop LLM-based agents framework’, or HULA. HULA reads a Jira work item, creates a plan, writes code, and even raises a pull request. And it does all of this while keeping the engineer in the driver’s seat. So far, HULA has merged ~900 pull requests for Atlassian software engineers, saving their time and allowing them to focus on other important tasks.”featured in #610
Innovations In Evaluating AI Agent Performance
tl;dr: Just like athletes need more than one drill to win a competition, AI agents require consistent training based on real-world performance metrics to excel in their role. At QA Wolf, we’ve developed weighted “gym scenarios” to simulate real-world challenges and track their progress over time. How does our AI use these metrics to improve our accuracy continuously?featured in #609
featured in #609
Claude Code: Best Practices For Agentic Coding
- Boris Cherny tl;dr: “This post outlines general patterns that have proven effective, both for Anthropic's internal teams and for external engineers using Claude Code across various codebases, languages, and environments. Nothing in this list is set in stone nor universally applicable; consider these suggestions as starting points.”featured in #609
Innovations In Evaluating AI Agent Performance
tl;dr: Just like athletes need more than one drill to win a competition, AI agents require consistent training based on real-world performance metrics to excel in their role. At QA Wolf, we’ve developed weighted “gym scenarios” to simulate real-world challenges and track their progress over time. How does our AI use these metrics to continuously improve our accuracy? Visit our website to learn more.featured in #607
featured in #604
featured in #604
You Make Your Evals, Then Your Evals Make You.
- Tongfei Chen Yury Zemlyanskiy tl;dr: The post introduces AugmentQA, a benchmark for evaluating code retrieval systems using real-world software development scenarios rather than synthetic problems. AugmentQA uses codebases, developer questions, and keyword-based evaluation outperforming open-source models that excel on synthetic benchmarks but struggle with realistic tasks.featured in #603
Tracing The Thoughts Of A Large Language Model
tl;dr: Anthropic presents research on interpreting how Claude "thinks" internally. By developing an "AI microscope," they examine the mechanisms behind Claude's abilities across languages, reasoning, poetry, and mathematics. These insights not only reveal cognitive strategies and efforts to make AI more transparent.featured in #603
featured in #602