Do AI Coding Agents Break Your CI/CD Pipeline?

Last month a client called me because their GitHub Actions bill had tripled. Three engineers, same repo, same codebase. The only thing that changed: they gave Copilot and Codex commit access. Dozens of small PRs per day, most of them failing, none of them getting fixed. Their CI was churning through builds that would never turn green.

So I went looking for data.

I conducted a study that analyzed 24,560 pull requests across 447 open-source repositories. I linked every PR to its actual GitHub Actions workflow runs, classified job outcomes into six categories, and used propensity score matching to control for confounders like repo size, PR complexity, and timing. The dataset split was 20,463 AI-generated PRs against 4,097 human-authored ones.

higher odds of CI failure for AI-authored PRs

OR = 1.194·p = 0.003·n = 24,560

Observable confounders like repo age, star count, and PR size explain about 54% of the raw gap. The rest persists after adjustment.

But that number is basically useless on its own. The interesting part is where the failures cluster.

Small PRs are the actual problem

The AI penalty exists only for small pull requests.

Under 55 lines of code, AI-generated changes fail at twice the rate of human changes (OR = 2.12). Over 400 lines, the difference disappears. For the very largest PRs, AI actually does slightly better than humans.

CI Failure Odds by PR Size

AI vs. Human (Odds Ratio)

< 55 LOC

2.12x

55 - 200

~1.4x

200 - 400

~1.1x

400+

~0.95x

1.0x baseline (equal failure rates)

The reason is straightforward. Small AI PRs are speculative single-shot attempts. The agent tries something, pushes it, moves on. Nobody reviews a 12-line bot commit the way they review a 600-line feature branch. Large AI PRs tend to get reviewed, edited, tested locally before they ever hit the pipeline. Size is a proxy for how much a human cared about the change before it shipped.

It fails lint, not tests

Here is the part that surprised me. After controlling for within-repo clustering, the test failure difference between AI and human PRs is not statistically significant. Tests pass at roughly the same rate.

What blows up is linting. AI PRs trigger 76% more lint violations than human PRs.

Where AI PRs Fail

Lint Violations

+0%

Statistically significant (p < 0.001)

Test Failures~0%

Not statistically significant

If you have ever watched Copilot generate a perfectly functional function with tabs instead of spaces, single quotes instead of doubles, and imports in the wrong order, that is exactly what is happening at scale across 447 repos. The agent writes code that compiles, runs, passes tests, and then gets rejected by a formatting rule it was never told about.

Nobody fixes the broken ones

This is the finding that should worry you more than the failure rate itself.

When an AI PR fails CI, it almost never gets repaired. 9.3% of failed AI PRs eventually pass on a subsequent run. For human PRs, that number is 23.4%. Failed AI code is 2.5 times more likely to be abandoned than fixed.

Failed PR Repair Rate

Percentage of failed PRs that eventually pass

Human PRs

AI PRs

Failed AI code is 2.5x more likely to be abandoned than fixed

Your pipeline is not just running more failed builds. It is running failed builds that will sit in the queue until someone manually closes them, eating compute the entire time. That client I mentioned? They had 340 open bot PRs, all red, all untouched for weeks. Nobody even looked at them anymore.

What to do about it

The first thing is to stop letting every tiny bot commit trigger your full pipeline. A two-minute pre-check (lint and type check only, no build, no integration tests) catches the 76% lint gap before you burn expensive CI minutes on code that was never going to pass anyway. GitHub Actions supports this natively with path filters and job conditionals. Most teams I work with set this up in an afternoon.

Second, close stale bot PRs automatically. If an AI PR fails and shows no commit activity within 48 hours, it is dead. Close it. A GitHub Action or a simple cron-based script that queries for bot-authored PRs older than two days with failing checks will clean your queue in seconds. The 91% abandonment rate means you are safe to be aggressive here.

Third (and this is the easiest win): pass your linter configuration to the agent. If you use ESLint, Prettier, Ruff, Black, whatever your stack demands, include those config files in the agent's context window. Most AI coding tools support this now. The lint gap is not a fundamental limitation of AI code generation. It is a configuration problem. Fix the config, and a huge chunk of your failures disappear.

The last piece is pipeline architecture. Most CI setups were designed around human rhythms: a handful of PRs per day, each one representing hours of focused work. AI agents submit dozens of PRs per day, each one a quick attempt. Your concurrency limits, your caching strategy, your notification rules, the way you allocate runners - all of it assumed a pattern that no longer holds. An audit of those assumptions takes a few hours and can cut your CI spend by 40-60%.

The real question

AI coding agents are not uniformly bad at CI. They fail on small speculative changes and lint compliance. They do fine on large, well-reviewed contributions. The blanket "AI code is worse" narrative does not match the data.

The real question is whether your infrastructure was designed for the way these tools actually work. For most teams I talk to, the answer is no. And the fix is cheaper than the problem.

This research analyzed 24,560 PRs across 447 GitHub repositories using propensity score matching with eight covariates. The full paper will be published on arXiv.

Share this article

Is your CI/CD ready for AI agents?

Most pipelines weren't built for the way AI coding tools work. A quick audit can save you hours of wasted compute every week.

Book a free consultation