Your AI-written tests prove nothing
A green CI badge is not evidence. A test suite that mocks a function and then asserts the mock was called has proven one thing: that mocks work. That is not news.
LLMs are extraordinarily good at generating code that looks like tests. They produce the imports, the setup, the teardown, the describe blocks, the matchers. What they are not good at is generating tests that would fail if the code under test were wrong. That second property is the whole point of testing, and it is the one that gets quietly lost.
I have spent the last eighteen months auditing test suites in AI-generated codebases. The patterns repeat across languages and frameworks. Here are the four I see most often, and the audit technique that catches them in an afternoon.
1. The mock that asserts itself
The most common pattern. The LLM writes a test, realises the function under test calls a dependency, mocks the dependency, and then - because something has to be in the expect block - asserts that the mock was called.
test('createUser saves to database', async () => {
const saveMock = jest.fn().mockResolvedValue({ id: 1 });
const db = { users: { save: saveMock } };
const service = new UserService(db);
await service.createUser({ name: 'Alice' });
expect(saveMock).toHaveBeenCalled();
});
This test will pass if createUser calls db.users.save. It will also pass if createUser calls db.users.save with complete garbage, wraps it in a try-catch that swallows the error, and returns undefined. The mock confirms that the call happened. It confirms nothing about whether the right thing was passed, whether the result was handled, or whether downstream invariants hold.
The fix is not complicated. Assert on the argument shape: expect(saveMock).toHaveBeenCalledWith(expect.objectContaining({ name: 'Alice', createdAt: expect.any(Date) })). Better: assert on the return value of createUser itself. Best: use a real in-memory database (sqlite, mongo-memory-server, testcontainers) and assert on observable state.
2. Shape checks masquerading as behaviour checks
The LLM writes a test for an API endpoint. It hits the endpoint, receives a response, and checks that the response has the right keys.
test('GET /users/:id returns user', async () => {
const res = await request(app).get('/users/42');
expect(res.status).toBe(200);
expect(res.body).toHaveProperty('id');
expect(res.body).toHaveProperty('name');
expect(res.body).toHaveProperty('email');
});
This test does not check that the returned user is user 42. It does not check that the email format is valid. It does not check that sensitive fields (password hash, reset tokens) are absent from the response. It checks that the response has three keys. Any endpoint that returns an object with those three keys, even one that ignores the ID parameter entirely, passes.
Real tests need actual values. expect(res.body.id).toBe(42). expect(res.body).not.toHaveProperty('passwordHash'). The negative assertion - what must not be in the response - is the one LLMs almost never write, and it is the one that catches real bugs.
3. Async tests that do not await
This one is brutal because it looks identical to a working test until you read it carefully.
test('sends welcome email on signup', () => {
const mailer = { send: jest.fn() };
const service = new SignupService(mailer);
service.signup({ email: 'a@b.c' });
expect(mailer.send).toHaveBeenCalled();
});
service.signup is async. The test does not await it. The assertion runs synchronously, before the async work completes, and mailer.send has not been called yet. In Jest, if nothing has been called, toHaveBeenCalled is false - except the test function returns immediately and the assertion is evaluated against the state before the microtask queue drains. Depending on framework behaviour, this either always passes (because the microtask runs before Jest checks), always fails, or flaps.
Most of the time it passes. Which is worse. The test is stamping approval on a code path it is not actually exercising.
The fix is one word: async on the test, await on the call. LLMs get this wrong because their training corpus contains both sync and async variants and they choose at random.
4. Tests that never had a failing state
The subtlest one. You ask an LLM to write a test for a function. It does. The test passes. You commit it. You never actually checked whether the test would fail if the function were broken.
This is a common failure mode around negative cases specifically. A test for "rejects invalid email" that passes both before and after the email validation code is present is proving nothing. A test for "returns 401 when not authenticated" that passes even when the auth middleware is removed is proving nothing.
The audit technique for this is mutation testing - run stryker (JavaScript), mutmut (Python), pitest (Java), Stryker.NET (C#). Mutation tools systematically break your code and check whether your tests notice. If they do not, those tests are cosmetic.
The afternoon audit
If you inherit an AI-generated test suite, here is the protocol. It takes 3-4 hours on a repo up to 50k lines.
- Step 1. Grep for
toHaveBeenCalled()andtoHaveBeenCalledTimeswith noWith. Every hit is a candidate for type 1. Read each one. Flag the ones that could pass with garbage input. - Step 2. Grep for
toHavePropertywithout a second argument. Every hit is a type 2 candidate. Ask: would this pass if the property value were wrong? - Step 3. Grep for
test\(|it\(followed by an arrow function with noasync. Inside each, check for any calls to functions ending in Async, any await that is missing, any Promise return that is not awaited. - Step 4. Pick five of the most important tests. Comment out the function body they are testing. Run the test. Does it fail? If not, the test was cosmetic.
- Step 5. Run mutation testing on the files with the densest test coverage. Aim for a mutation score above 70% on any file that owns business logic. Lower than that, the tests need rewriting.
None of this is rocket science. It is three greps and a mutation run. But the teams I audit have never done it, because they took the green CI badge as evidence. It is not evidence. It is the absence of a very specific kind of counter-evidence, and AI-generated tests have become extremely good at producing that absence while proving nothing.
Ship boring releases.
Book a 20-min call.