Essential Reading For Engineering Leaders

Human In The Loop Software Development Agents

- Jirat Pasuksmit

Architecture
AI

tl;dr: “Recently, we created the ‘Human-in-the-loop LLM-based agents framework’, or HULA. HULA reads a Jira work item, creates a plan, writes code, and even raises a pull request. And it does all of this while keeping the engineer in the driver’s seat. So far, HULA has merged ~900 pull requests for Atlassian software engineers, saving their time and allowing them to focus on other important tasks.”

featured in #610

Innovations In Evaluating AI Agent Performance

Management
AI

tl;dr: Just like athletes need more than one drill to win a competition, AI agents require consistent training based on real-world performance metrics to excel in their role. At QA Wolf, we’ve developed weighted “gym scenarios” to simulate real-world challenges and track their progress over time. How does our AI use these metrics to improve our accuracy continuously?

featured in #609

The Second Half

- Shunyu Yao

AI
ThoughtPiece

tl;dr: Shunyu, a researched at OpenAI, claims we’re at AI’s halftime. The second half of AI — starting now — will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?” To thrive in this second half, we’ll need a timely shift in mindset and skill set, ones perhaps closer to a product manager.

featured in #609

Claude Code: Best Practices For Agentic Coding

- Boris Cherny

AI
BestPractices

tl;dr: “This post outlines general patterns that have proven effective, both for Anthropic's internal teams and for external engineers using Claude Code across various codebases, languages, and environments. Nothing in this list is set in stone nor universally applicable; consider these suggestions as starting points.”

featured in #609

Innovations In Evaluating AI Agent Performance

Management
AI

tl;dr: Just like athletes need more than one drill to win a competition, AI agents require consistent training based on real-world performance metrics to excel in their role. At QA Wolf, we’ve developed weighted “gym scenarios” to simulate real-world challenges and track their progress over time. How does our AI use these metrics to continuously improve our accuracy? Visit our website to learn more.

featured in #607

What Are Cursor Rules?

- Zack Proser

AI
BestPractices

tl;dr: “Cursor Rules allow you to codify the foundational decisions in your codebase, to reduce hallucinations across agentic composer and chat sessions. By placing these rules in special files, you can tailor Cursor’s suggestions or completions to match your team’s coding style and best practices.”

featured in #604

AI Ambivalence

- Nolan Lawson

AI
ThoughtPiece

tl;dr: “So this is where I’ve landed: I’m using generative AI, probably just “dipping my toes in” compared to what maximalists like Steve Yegge promote, but even that little bit has made me feel less excited than defeated. I am defeated in the sense that I can’t argue strongly against using these tools (they bust out unit tests way faster than I can, and can I really say that I was ever lovingly-crafting my unit tests?), and I’m defeated in the sense that I can no longer confidently assert that brute-force statistics can never approach the ineffable beauty of the human mind that Chomsky described.”

featured in #604

You Make Your Evals, Then Your Evals Make You.

- Tongfei Chen Yury Zemlyanskiy

tl;dr: The post introduces AugmentQA, a benchmark for evaluating code retrieval systems using real-world software development scenarios rather than synthetic problems. AugmentQA uses codebases, developer questions, and keyword-based evaluation outperforming open-source models that excel on synthetic benchmarks but struggle with realistic tasks.

featured in #603

Tracing The Thoughts Of A Large Language Model

LLM
AI

tl;dr: Anthropic presents research on interpreting how Claude "thinks" internally. By developing an "AI microscope," they examine the mechanisms behind Claude's abilities across languages, reasoning, poetry, and mathematics. These insights not only reveal cognitive strategies and efforts to make AI more transparent.

featured in #603

Exploring Generative AI

- Birgitta Böckeler

CareerAdvice
AI

tl;dr: “While the advancements of AI have been impressive, we’re still far away from AI writing code autonomously for non-trivial tasks. They also give ideas of the types of skills that developers will still have to apply for the foreseeable future. Those are the skills we have to preserve and train for.”

featured in #602

/AI