AI is making software faster but less stable
AI coding tools are increasing output, but speed without delivery discipline pushes risk into review, testing, operations, and incident response.

AI coding tools changed the rhythm of software development.
A developer can now ask for a component, a migration, a test file, a refactor, a parser, a GitHub Action, a Terraform module, or a bug fix and get something useful in seconds. Not always correct. Not always safe. Not always maintainable. But useful enough to move faster.
That speed is real.
It is also incomplete.
Software does not become stable because code was written quickly. It becomes stable because teams understand requirements, review changes carefully, test behavior, control releases, observe production, respond to incidents, and learn from failures.
AI helps with some of that. It can also overload the rest of it.
That is the real problem.
AI is making software faster at the point of code generation, but many teams have not upgraded the systems around code generation. Review, testing, CI, deployment, observability, security, and rollback are still built for a slower world.
So the bottleneck moves.
The code appears faster. The risk appears later.
The speed is real
It would be dishonest to pretend AI coding tools do not help.
They do.
They are good at scaffolding. They are good at translating examples into code. They can explain unfamiliar APIs. They can generate tests. They can help with repetitive refactors. They can make a developer less blocked on syntax, boilerplate, or first drafts.
GitHub's Octoverse 2025 describes AI, agents, and typed languages as one of the biggest shifts in software development in more than a decade. It also reports large activity levels across GitHub, including millions of issues closed and tens of millions of pull requests merged per month in 2025.
Google's DORA team also reports that AI adoption among software development professionals has surged. In Google's summary of the 2025 DORA report, AI adoption reached 90 percent among software development professionals, and workers reported spending a median of two hours per day with AI tools.
The point is not that every generated line is good.
The point is that AI is now inside the normal development workflow.
This changes engineering economics.
When code is cheaper to produce, teams produce more of it. When drafts are easier to create, more ideas become pull requests. When agents can work in parallel, more branches appear. When test generation is easy, test files multiply. When refactors are easier, old code gets touched more often.
Some of this is excellent.
But every change still creates risk.
A faster authoring loop does not automatically create a safer delivery loop.
That distinction matters.
The stability problem is showing up downstream
The most important pattern in AI-assisted development is not "developers are faster."
It is this:
AI moves work from writing code to validating code.
That is not a small shift.
TechRadar recently summarized research claiming that heavy AI tool users deploy more frequently, but also report more frequent issues and slower recovery. The report said 45 percent of frequent AI users deploy daily compared with 15 percent of occasional users, while 69 percent of heavy AI users reported frequent deployment issues. It also described the downstream burden moving into QA, remediation, and infrastructure work.
Harness describes a similar pattern in its State of AI in Software Engineering research. Its public summary highlights more deployment failures, rising manual toil for QA and operations teams, and unpredictable costs.
DORA's 2025 report gives a more careful version of the same idea. It says AI acts as an amplifier. It can magnify an organization's strengths, but it can also magnify weaknesses. Google's announcement of the report states that AI adoption continues to have a negative relationship with software delivery stability.
That is the center of this article.
AI does not remove the need for engineering discipline.
It increases the need for it.
The system does not fail at the prompt.
It fails at the handoff between generation and production.
That handoff is where most teams are underprepared.
More code is not the same as more progress
There is a quiet assumption behind many AI coding claims:
If developers produce more code, the organization moves faster.
Sometimes that is true.
Often it is not.
Progress is not measured by lines of code. It is measured by working behavior in production. A feature is not done when the code compiles. It is done when users can use it safely, the team can operate it, and the business can trust it.
AI can increase code output without increasing product progress.
This happens when generated code creates hidden work:
Reviewers need more time to understand changes.
QA needs more time to test edge cases.
Security needs more time to inspect dependencies.
DevOps needs more time to fix pipeline failures.
Senior engineers need more time to correct architecture drift.
Incident responders need more time to debug unfamiliar code.
Product teams need more time to clarify behavior that AI guessed.
The pull request is not free because the first draft was cheap.
This is why teams can feel both faster and more tired.
They are writing less boilerplate, but reviewing more uncertainty.
They are shipping more changes, but cleaning up more failures.
They are using better tools, but carrying more coordination load.
The work did not disappear.
It moved.
The trust gap is part of the architecture
Stack Overflow's 2025 Developer Survey shows the trust gap clearly. AI tool usage and planned usage remain high, but trust in AI accuracy is weak. Stack Overflow's public reporting says 46 percent of developers do not trust AI output accuracy, while only 33 percent do.
That is not because developers are anti-AI.
It is because developers have used the tools enough to know their shape.
AI can produce correct-looking code that is wrong. It can use outdated APIs. It can ignore project conventions. It can invent tests that assert implementation details. It can miss security issues. It can overfit to examples. It can confidently explain a bug while missing the real cause.
This creates a new kind of review problem.
Traditional review asks:
Is this logic correct?
Is the design reasonable?
Are tests enough?
Does this fit our codebase?
AI-assisted review adds more questions:
Did the model invent behavior?
Did it copy an outdated pattern?
Did it add a dependency we do not need?
Did it weaken a test?
Did it hide complexity behind generated code?
Did it produce code the author fully understands?
Did it change behavior outside the requested scope?
That last question is huge.
AI tools are often helpful, but they are not careful in the same way a responsible engineer is careful. They may touch too much. They may "clean up" code that should not be changed. They may improve style while altering behavior.
So trust cannot be a feeling.
Trust must be built into the delivery system.
You do not trust AI-generated code because it looks good.
You trust it because it survived the same engineering process as every other change.
Code review becomes a pressure point
AI makes it easier to create pull requests.
It does not make it easier for humans to review all of them.
This is one of the biggest practical problems in AI-assisted software development. Teams celebrate faster code creation, then discover that review capacity did not increase at the same rate.
Senior engineers become the bottleneck.
They now need to review:
More code
Larger diffs
More generated tests
More dependency changes
More unfamiliar patterns
More subtle behavior changes
More code written by people who may not fully understand it
That creates fatigue.
And fatigue is dangerous.
A tired reviewer starts approving changes based on surface quality. The code is formatted. The tests pass. The description sounds plausible. The author says AI helped. The reviewer has ten more PRs waiting.
That is how risk enters production.
A good AI-era review process needs more structure.
| Review area | What to check |
|---|---|
| Scope | Did the change do only what was requested? |
| Ownership | Does the author understand the code? |
| Dependencies | Did it add packages or services unnecessarily? |
| Tests | Do tests check behavior, not just mocks? |
| Security | Are inputs, secrets, auth, and permissions handled correctly? |
| Performance | Did the change add expensive loops, queries, or calls? |
| Operability | Can we log, debug, and roll back this change? |
| Product behavior | Does this match the intended user outcome? |
The reviewer should not need to guess which parts were AI-generated.
Teams should normalize saying:
AI helped draft this. I reviewed the logic, changed these parts, and I need extra attention on the auth flow.
That is not weakness.
That is good engineering communication.
Tests matter more, not less
AI can generate tests quickly.
That does not mean the tests are good.
Many AI-generated tests are shallow. They check the happy path. They mock too much. They assert implementation details. They duplicate the bug. They pass because they test what the code does, not what the code should do.
This is a serious problem.
If AI writes the code and AI writes weak tests for that code, the pipeline can create a false sense of safety.
The result is green CI with fragile behavior.
Teams need better testing habits, not just more tests.
A useful AI-era testing strategy includes:
Unit tests for core logic
Integration tests for important boundaries
Contract tests for APIs
Regression tests for past bugs
Property-based tests where logic has many input combinations
End-to-end tests for critical user flows
Security tests for authentication and authorization
Load tests for high-traffic paths
Manual exploratory testing for risky product behavior
AI can help write these.
But humans need to decide what matters.
A good test starts with a risk question:
What would be bad if this change were wrong?
Then write tests around that.
For example:
| Change | Bad outcome | Test focus |
|---|---|---|
| Payment logic | Incorrect charge or refund | Edge cases, idempotency, audit trail |
| Auth middleware | Unauthorized access | Role and tenant isolation |
| Search ranking | Bad results | Relevance and regression set |
| Migration | Data loss | Backup, rollback, dry run |
| Email system | Wrong recipient | Template data and recipient rules |
| AI feature | Unsafe or false output | Eval set and policy tests |
AI can draft tests, but the team must define the danger.
CI pipelines were not built for this volume
AI increases the number of changes. That increases load on CI.
More pull requests mean more builds, more test runs, more preview environments, more security scans, more containers, more artifacts, more cache misses, and more flaky failures.
If the CI system was already slow, AI makes it worse.
A slow CI system creates bad behavior:
Developers skip tests locally.
Reviewers approve before checks finish.
Teams rerun flaky tests without fixing them.
Pull requests sit for hours.
Small changes batch into large changes.
Engineers avoid refactoring because feedback is too slow.
AI does not solve slow feedback loops. It makes them more painful.
The answer is not only "buy more runners."
That may help, but CI design matters too.
A modern AI-era CI system should:
Run fast checks first
Split tests by risk and speed
Cache dependencies aggressively
Detect and quarantine flaky tests
Run security checks early
Avoid rebuilding unchanged parts
Use preview environments for risky UI changes
Keep main branch protected
Make failure messages clear
Track CI cost and duration over time
Fast code generation needs fast verification.
Otherwise the team creates code faster than it can prove the code works.
Feature flags become more important
When change volume increases, release control matters more.
Feature flags help separate deployment from release. The code can be deployed while the feature remains off, limited to internal users, or rolled out to a small percentage of traffic.
LaunchDarkly describes feature flags as a way to control production behavior in real time, target releases precisely, and recover quickly. That is exactly what AI-assisted teams need.
The key idea is simple:
Deploy safely before exposing widely.
Feature flags do not make bad code good.
They reduce blast radius.
That matters because AI-assisted development can increase the number of small changes reaching production. If every change is released to everyone at once, instability spreads quickly.
A good feature flag practice includes:
Flags for risky behavior changes
Clear flag owners
Expiration dates for temporary flags
Monitoring tied to rollout
Fast disable path
Cleanup process after release
No long-term flag mess
Feature flags are not only a product tool.
They are a stability tool.
They let teams move faster without pretending every change is safe.
Rollback is not optional anymore
If AI increases change velocity, rollback becomes a core design requirement.
A rollback is not a failure. A rollback is a safety mechanism.
Teams should be able to answer:
Can we turn this feature off?
Can we revert this deployment quickly?
Can we restore changed data?
Can we undo a migration?
Can we identify which change caused the issue?
Can we reduce traffic to the broken path?
Can we fail over to a simpler behavior?
The harder it is to rollback, the more dangerous frequent deployment becomes.
Different changes need different rollback plans.
| Change type | Rollback plan |
|---|---|
| UI change | Turn off flag or redeploy previous version |
| API behavior | Route old clients to old behavior |
| Database migration | Use backward-compatible migration |
| Background job | Pause queue or disable worker |
| AI prompt change | Version prompts and restore prior version |
| Model change | Route traffic back to previous model |
| Config change | Revert config with audit trail |
| Data update | Restore from snapshot or audit log |
AI-assisted teams should make rollback part of pull request review.
Ask:
If this goes wrong, how do we safely undo it?
If nobody knows, the change is not ready.
Observability has to cover the whole path
Production stability depends on knowing what changed and what happened after it changed.
For AI-assisted software, observability needs to connect several layers:
Code changes
CI runs
Deployment events
Feature flag changes
Runtime errors
Latency
Logs
Traces
User behavior
AI model calls
Token usage
Agent tool calls
Cost
OpenTelemetry's GenAI work is important because AI systems introduce new telemetry needs. The OpenTelemetry project has been defining semantic conventions for generative AI operations, including model attributes, token usage, latency, prompts, completions, tool calls, and tool results where teams opt in.
That matters because AI-generated features often fail in ways normal logs do not explain.
An AI feature may be slow because a model call used too many tokens. It may be wrong because retrieval returned stale context. It may be expensive because retries multiplied. It may be unsafe because a tool was called with bad arguments.
You need traces that show the path.
Good observability answers:
Which deployment introduced the problem?
Which feature flag was on?
Which model version was used?
Which prompt version was used?
Which tool was called?
How many retries happened?
Which users were affected?
What was the cost?
Did rollback fix it?
If the team cannot answer these questions, AI will feel unpredictable even when the root cause is ordinary software failure.
Security review gets harder
AI-generated code can introduce normal security bugs.
It can also introduce them at a higher volume.
Common risks include:
Missing authorization checks
Weak input validation
Unsafe dependency choices
Secret leakage in logs
SQL injection in generated query code
Insecure file handling
Overly broad cloud permissions
Broken tenant isolation
Dangerous default configurations
Generated code copied from outdated examples
OWASP's LLM application guidance focuses on risks like prompt injection, excessive agency, sensitive information disclosure, supply-chain vulnerabilities, and insecure output handling. Those risks matter when AI is part of the application. But ordinary application security still matters when AI is part of development.
AI does not remove the OWASP Top 10.
It gives teams more code where those risks can appear.
A practical security rule:
AI can suggest code, but it cannot approve security risk.
Security gates should be automated where possible, but human review is still needed for sensitive paths:
Authentication
Authorization
Payments
Admin features
Data exports
Customer data
Cloud permissions
AI tool execution
Dependency changes
Infrastructure changes
The more AI speeds up code creation, the more important it becomes to make security checks part of the default path.
Security cannot be a meeting at the end.
AI changes the meaning of developer productivity
Traditional productivity measures are already flawed.
AI makes them worse.
Lines of code, number of commits, pull requests created, and tickets closed can all increase while product quality decreases.
A developer who generates five pull requests that need heavy rework is not necessarily more productive than a developer who ships one careful change.
A team that deploys daily with constant incidents is not healthier than a team that deploys weekly with confidence.
Productivity must include stability.
Better measures include:
| Measure | Why it matters |
|---|---|
| Lead time for changes | How quickly safe changes reach users |
| Change failure rate | How often changes break production |
| Time to restore service | How quickly teams recover |
| Review time | Whether review is becoming a bottleneck |
| Escaped defects | Bugs found after release |
| Rollback frequency | Whether releases are unstable |
| Incident count | Operational health |
| Developer satisfaction | Whether speed creates burnout |
| Test signal quality | Whether tests catch real issues |
| Cost per change | Whether AI increases hidden infrastructure cost |
This is where DORA's framing is useful. Software delivery performance is not just deployment frequency. It includes lead time, change failure rate, failed deployment recovery time, and reliability.
AI should improve the whole system.
Not just the typing speed.
Platform engineering becomes the safety layer
As AI accelerates development, platform engineering becomes more important.
Not because every company needs a huge internal platform.
Because teams need safe defaults.
A good platform gives developers:
Approved templates
Standard CI pipelines
Secure deployment paths
Built-in observability
Feature flag integration
Secret management
Dependency scanning
Cost visibility
Environment creation
Rollback tools
Documentation
Golden paths
AI can generate code inside those boundaries.
Without boundaries, AI generates variety. Variety can be useful in exploration, but it is dangerous in production systems.
The goal is not to limit developers for no reason.
The goal is to make the safe path the easy path.
If the easiest way to create a service includes logging, metrics, auth, tests, deployment, rollback, and security scanning, AI-generated code has a better chance of landing in a stable system.
If every team invents its own pipeline, AI will amplify inconsistency.
Platform engineering turns AI speed into repeatable delivery.
The new delivery pipeline
The old mental model was simple:
Write code. Review. Test. Deploy.
The AI-era pipeline needs more explicit checkpoints.
This pipeline does not reject AI.
It assumes AI is part of the work.
But it prevents AI from skipping the work that makes software reliable.
A strong AI-era pipeline includes:
Human planning before generation
Small scoped changes
Clear PR descriptions
Automated linting and formatting
Type checks
Unit and integration tests
Dependency review
Secret scanning
Security checks
Code ownership rules
Feature flags for risky changes
Canary or staged rollout
Observability and alerts
Rollback path
Post-release validation
That may sound heavy.
It is lighter than cleaning up production failures caused by unreviewed acceleration.
A practical checklist for teams
Here is a realistic checklist for teams adopting AI coding tools.
Before coding
Write the expected behavior in plain language.
Define what should not change.
Identify risky files and systems.
Decide whether AI should draft, explain, test, or refactor.
Keep the task small enough to review.
During coding
Ask AI for small changes, not giant rewrites.
Review generated code before running it.
Avoid accepting dependency additions casually.
Keep generated code aligned with project conventions.
Make the author explain the final code.
Before merge
Check scope carefully.
Run tests locally or in CI.
Review generated tests for real signal.
Check authorization and data access.
Check logs for secret leakage.
Check performance-sensitive paths.
Add regression tests for bug fixes.
Require owner review for critical systems.
Before release
Use feature flags for risky behavior.
Deploy gradually where possible.
Confirm dashboards and alerts exist.
Confirm rollback path.
Monitor errors, latency, and user impact.
Watch cost changes for AI features.
After release
Add production failures to regression tests.
Update prompts or templates if they caused issues.
Track review load and CI load.
Remove stale feature flags.
Share patterns that worked.
Fix platform gaps instead of blaming individuals.
This is the boring work.
It is also the work that lets teams use AI safely.
What good AI adoption looks like
Good AI adoption does not look like everyone generating as much code as possible.
It looks like better flow.
Developers spend less time on boilerplate and more time on design. Reviewers see smaller changes with clearer intent. Tests become more focused. CI becomes faster. Releases become safer. Observability gets better. Incidents become easier to debug. Teams learn which AI use cases help and which create cleanup work.
Good AI adoption has constraints.
Examples:
AI can draft tests, but humans define test strategy.
AI can write migrations, but migrations must be backward compatible.
AI can draft security-sensitive code, but owners must review it.
AI can suggest dependencies, but dependency review must approve them.
AI can help with incidents, but humans own decisions.
AI can write feature code, but rollout is controlled by flags.
AI can refactor, but behavior must be protected by tests.
The goal is not to make AI powerless.
The goal is to make AI useful inside a system that understands risk.
The future is faster and more disciplined
AI will keep getting better at writing code.
That does not mean software delivery will become automatically stable.
The opposite may be true for teams without strong engineering systems. As code gets easier to produce, the limiting factor becomes everything after code generation.
Review.
Testing.
Security.
Deployment.
Observability.
Rollback.
Ownership.
User trust.
The teams that benefit most from AI will not be the teams that generate the most code. They will be the teams that convert AI speed into reliable product changes.
That requires discipline.
It requires small changes. Strong tests. Fast CI. Feature flags. Good logs. Good traces. Security gates. Rollback paths. Platform defaults. Honest metrics. Human accountability.
None of that sounds as exciting as an AI agent writing a feature from a prompt.
But it is what makes the feature safe to ship.
AI is making software faster.
Whether it makes software better depends on the engineering around it.



