Your Best Engineer Just Started Vibecoding. Here's What Happened to Your CI Bill.

Jake was the best hire I'd seen at that company in two years. Joined as a mid-level, ramped fast, shipped clean code. By month three he was closing tickets faster than anyone else on the team. By month four his lead pulled me aside and said something was wrong.

The CI bill had doubled. Not gradually. One month.

I asked her to show me what changed. She opened the PR list and scrolled. And scrolled. Jake had submitted 47 pull requests in the last two weeks. Most were small, under 50 lines. More than half had red badges next to them. He'd moved on to the next thing before the first one finished building.

She called it vibecoding. Jake had started using Cursor full-time, letting the agent write most of his code, pushing PRs the moment the agent finished generating them. His velocity numbers looked incredible in the standup metrics. His actual merge rate was 31%.

I see some version of this about once a month now.

The data behind the pattern

I co-authored a study that analyzed 24,560 pull requests across 447 GitHub repositories. We matched every PR on eight covariates (repo age, star count, PR size, language, timing, a few others) and tracked what happened when each one hit CI.

AI-authored PRs fail 19.4% more often than human ones after adjusting for confounders. That number on its own doesn't tell you much. Where the failures land is the interesting part.

Small PRs, under 55 lines, are where everything falls apart. AI-generated changes at that size fail at twice the rate of equivalent human changes (OR = 2.12, z = 8.23). Go above 400 lines and the difference disappears completely. The very largest AI PRs actually outperform human ones by a small margin.

Jake's PRs averaged 23 lines. He was hitting the worst part of the curve every single time.

Why the small ones break

A 12-line PR from an AI agent is a fundamentally different artifact than a 600-line feature branch from the same tool. The small one is a single shot. Agent sees a task, generates a fix, opens a PR. No iteration, no review, no awareness of project-specific conventions. It doesn't know your team switched to double quotes last March, or that your import ordering follows a custom ESLint plugin that three people on the team even know about.

Large AI PRs look different because a human was involved. Someone gave the agent a detailed prompt, reviewed the diff before pushing, or the change touched enough files that the agent had to ingest more of the codebase. Human presence acts as a quality filter.

The failure mode is surprisingly specific. Test failures between AI and human PRs show no statistically significant difference after you control for within-repo clustering. Tests pass at roughly the same rate regardless of who wrote the code.

What kills small AI PRs is linting. 76% more lint violations than human PRs. The agent writes code that compiles, runs, passes every test, and then gets rejected because it used tabs instead of spaces. Lint rules live in config files the agent was never shown.

The part that actually costs money

Jake's lead assumed the fix was obvious. Tell Jake to review his PRs before pushing. He started doing that. Sort of. He'd glance at the diff, see that it looked reasonable, push it. The linting issues kept happening because Jake didn't run the linter locally either. He trusted the agent's output and figured CI would catch anything wrong.

CI did catch it. Every single time. At $0.008 per minute.

Fifteen failed pipeline runs per day, each one taking 12 to 18 minutes. Roughly 4 hours of wasted compute daily. Over a month, 120 hours of runner time producing nothing.

But the compute cost was honestly the smaller problem.

Every failed run triggers a notification. When the whole team gets a Slack alert for every CI failure, and most of those failures are Jake's abandoned bot PRs, people stop looking. I watched it happen at this company in real time. Within two weeks of Jake going full vibecode, the team's median response time to legitimate CI failures went from about 12 minutes to over 3 hours. The noise trained them to stop paying attention.

Nobody fixes the broken ones

When an AI PR fails CI, 9.3% eventually get repaired and pass on a later run. For human PRs the number is 23.4%. Ninety-one percent of failed AI pull requests get abandoned. The agent tries once, fails, moves on.

Jake's numbers were worse than the dataset average. He had 34 open PRs with red badges. The oldest was five weeks old. He'd forgotten about every single one of them. They sat in the queue, eating dashboard space, sending notification pings into the void. Every new engineer who joined the project spent their first afternoon scrolling through dead PRs trying to figure out what was real work and what was noise.

What his lead set up (one afternoon)

The fix was not telling Jake to stop using Cursor. That would have been stupid. His good PRs, the ones that actually passed, were genuinely good. Fast, well-structured, often better than what he'd have written by hand. The problem was the pipeline, not the tool.

First thing she did: a fast pre-check workflow. Lint and type check only, no build, no integration tests. Runs in about 30 seconds. If the pre-check fails, the full pipeline never triggers. This alone cut their wasted CI minutes by roughly 60%. She set it up in GitHub Actions as a separate workflow on pull_request, configured as a required status check. Took maybe two hours including testing.

Then she put the team's ESLint config and Prettier rules into Jake's .cursorrules file. Cursor reads that file and uses it as context when generating code. The lint failure rate on his PRs dropped by about a third within the first week. Ten minutes of setup.

Auto-close came next. A daily cron action that finds any PR labeled ai-generated with failing checks and no activity for 48 hours, then closes it with a comment explaining why. GitHub's stale action handles this. Given that 91% of failed AI PRs never get fixed, closing after two days of silence is not aggressive. You're cleaning up work that was already dead.

Last piece: concurrency groups keyed on the PR branch. When Jake's agent pushed three times to the same branch in 20 minutes, only the latest commit got a full pipeline run. The earlier ones got cancelled automatically. Stopped him from monopolizing runners while other engineers waited for their branches to build.

The whole thing took one afternoon. Their CI bill dropped back to within 15% of what it had been before. Jake kept using Cursor. He still submitted more PRs than anyone on the team. The ones that made it through the pre-check gate merged at a higher rate than the team average.

The conversation nobody wants to have

His lead told me later that the hardest part wasn't the technical setup. It was telling Jake that his velocity numbers were a lie.

He'd been reporting 47 PRs in two weeks at standup. Looked incredible on paper. But subtract the ones that failed CI and were never fixed, the ones closed as stale, the ones that duplicated work someone else had already done because Jake hadn't checked first. His actual shipped contribution was 11 merged PRs. Still respectable. Not the 4x multiplier he thought he was getting.

The vibecoding pitch is that you can move faster. And you can. But only if your pipeline is built for the pattern these tools produce: dozens of small speculative PRs per day, most of which fail on lint rules the agent doesn't know about, almost none of which get fixed when they break. If your CI was designed for five to ten carefully reviewed human PRs per day, it will buckle.

The answer is four changes to your CI setup that take about five hours, and one honest look at what "velocity" means when a bot is doing the typing.

Based on an empirical study of 24,560 PRs across 447 open-source repositories. Names and details have been changed. I help startups and teams set up CI/CD pipelines, DevOps infrastructure, and development workflows that work with AI tools instead of against them.

Share this article

Is your CI/CD ready for AI agents?

Most pipelines weren't built for the way AI coding tools work. A quick audit can save you hours of wasted compute every week.

Book a free consultation