AI-Powered Test Report & Release Advisor

An LLM agent that reads every API regression run, classifies failures, tracks trends, and returns a plain-language release go/no-go, replacing manual report triage with a single command.

1000+

tests analyzed per run

Minutes

to a release decision, not hours

1 day → mins

management release review

PythonLLM agent integrationOpenAI APIRobot FrameworkSSE / RESTInline SVGCI/CD

Problem statement

A large API regression suite produced thousands of pass/fail results on every build. Before each release, someone had to read the raw output, work out what had regressed since the last run, separate genuine failures from flaky noise, and write a summary the release manager could act on. That review cost an engineer an hour or two per run, and the quality of it changed depending on who did it.

The goal was a tool that reads a full regression run end to end and returns a release-readiness call in plain language, automatically, on every build.

How it works

A Python pipeline compares the current run against a saved benchmark run. It parses both result sets, classifies every failure against a library of error patterns, measures pass-rate and duration deltas, flags duration anomalies, and generates a set of key insights. All of that is assembled into a single structured prompt.

The prompt goes to an LLM agent, which returns an assessment split into five fixed sections: execution health, failure analysis, API health signals, a risk assessment with a release recommendation, and key action items. The pipeline renders the whole thing into one self-contained HTML report with inline SVG charts and no external assets, so it can be emailed, stored as a CI artifact, or opened straight in a browser.

The code is split into focused modules (discovery, parsing, analysis, prompt building, the API client, history, charts, reporting) so each concern changes independently. The AI prompt lives in a plain-text template anyone can edit without touching code.

LLM integration and trend tracking

The API client supports a real LLM agent, a public OpenAI endpoint, and a mock mode. Responses stream back as server-sent events; the client parses them, retries on failure, and falls back to a mock response if the agent is unreachable, so report generation never breaks the build.

A history layer records only failed and skipped tests to a CSV, which keeps it small even for suites of 1000+ tests. That history feeds the AI so it can tell a one-off failure from a repeat offender, recognize flaky tests, and weight its recommendation on whether failures are trending up or down.

Value delivered

Manual log triage before every release became a single command and a two-minute report read, with a consistent decision every time instead of one that depended on who ran it.

It also changed the conversation at the management level. Instead of parsing raw test logs, leadership gets a clear, plain-language read on release health backed by historical trend analysis, cutting what had been a full day of management review down to minutes.

Because the report is self-contained HTML, it drops straight into CI as an artifact and into release sign-off with no dashboard to host or maintain. Tracking failures historically, not just per run, means systemic issues surface as patterns instead of getting lost in a wall of red.

Design choices

Keeping the prompt in an editable template, separate from code, lets the AI's tone and focus be tuned by anyone without a redeploy.

A guaranteed fallback path matters more than the model itself. The pipeline always produces a report, even when the agent is down, so it can sit on the critical path of a build without becoming a single point of failure.

Facing a similar problem?

Let's talk