/Tests

6 Hard Lessons We Learned About Automated Testing For GenAI Apps

- John Gluck tl;dr: Testing LLMs is not simple. The probabilistic output makes failures hard to identify while running the models repeatedly tends to become very expensive quickly. In this blog post, QA Wolf engineer John Gluck covers 6 things the team learned about building automated black-box regression tests for genAI applications.

featured in #531


Drata Secured 86% Faster QA Cycles

tl;dr: QA Wolf is delivering QA at DrataSpeed: (1) Regression testing is 90 minutes faster than before, and includes 4x more test cases. (2) Quickly onboarded and gave Drata’s QA resources space to work on new features, saving more than $500,000/year. (3) Went from overnight deploys to multiple times daily.

featured in #530


6 Hard Lessons We Learned About Automated Testing For GenAI Apps

- John Gluck tl;dr: Testing LLMs is not simple. The probabilistic output makes failures hard to identify while running the models repeatedly tends to become very expensive quickly. In this blog post, QA Wolf engineer John Gluck covers 6 things the team learned about building automated black-box regression tests for genAI applications.

featured in #529


Autotrader Saved $620K/YR Trading In Manual Testing For Automation

tl;dr: Automated testing with cruise control allowed: (1) Offset the need to hire six QA engineers, saving $600K+/year. (2) Returned more than 1,000 hours per year to the customer support team, saving $20,000/year. (3) Increased release velocity 15–20%. (4) Reduced QA cycles from 3+ days to 15 minutes.

featured in #528


Getting 100% Code Coverage Doesn't Eliminate Bugs

- Kostis Kapelonis tl;dr: “There are many articles already on the net explaining why this is a fallacy, but I recently discovered that sharing an actual code example goes a long way towards proving why 100% code coverage doesn’t mean zero bugs. These people have their “aha” moment when they look at real code, instead of recycling theoretical arguments over and over.”

featured in #527


Debugging With Production Neighbors

tl;dr: SLATE is Uber’s E2E testing tool for microservice architectures that allows testing of services alongside production dependencies. It enables developers to generate test requests mimicking production flows while targeting services under test. This blog explores three debugging options in SLATE: remote debugging of deployed instances, local debugging on developer machines, and debugging through filtered monitoring. These features aim to simplify troubleshooting in production-like environments.

featured in #525


Flaky Tests Overhaul At Uber

tl;dr: “A few years ago, we started tackling flaky tests in an effort to stabilize CI experience across our monorepos. The project first debuted in our Java monorepo and received good results in driving down frictions in developers’ workflow. However, as we evolved our CI infrastructure and started onboarding it to our largest repository with the most users, Go Monorepo, the stop-gap solution became increasingly challenging to scale to the scope.” The authors discuss a centralized system to track all tests. 

featured in #521


How To Test

- Alex Kladov tl;dr: “This post describes my current approach to testing. When I started programming professionally, I knew how to write good code, but good tests remained a mystery for a long time. This is not due to the lack of advice — on the contrary, there’s abundance of information & terminology about testing.”

featured in #520


Test-Driving HTML Templates

- Matteo Vaccari tl;dr: “When building a server-side rendered web application, it is valuable to test the HTML that's generated through templates. While these can be tested through end-to-end tests running in the browser, such tests are slow and more work to maintain than unit tests. Unit tests, written in the server-side environment, can check for valid HTML, and extract elements with CSS selectors to test the details of generated HTML. These test cases can be defined in a simple data structure to make them easier to understand and enhance.”

featured in #518


Avoid The Long Parameter List

- Gene Volovich tl;dr: “Always try to group data that belongs together and break up long, complicated parameter lists. The result will be code that is easier to read and maintain, and harder to make mistakes with.“ Gene shares examples.

featured in #517