develicit
← Back to posts

Redefining "Good PRs" in the AI Era — What to Change, Why, and How

·10 min read

AI has reshaped PR productivity, size, and generation speed. The "small / frequent / line-by-line" review policy can't survive as-is. What to bundle as a PR (unit), what to review (target), and who blocks (gate) — three axes must be redesigned together.

1. The problem — why "good PRs" are breaking now

Primary research and industry reports since AI adoption show a consistent pattern: individual productivity rises, but org-level delivery performance stagnates or declines.

  • Google DORA 2024 Report: A 25% increase in AI adoption brings +3.4% code quality, +3.1% review speed, +7.5% documentation quality — yet delivery throughput −1.5% and stability −7.2%. 75% use AI output daily; trust in AI output is only 39%.12
  • Google DORA 2025 — State of AI-assisted Software Development: AI acts as an amplifier; it scales whichever organizational strengths and weaknesses already exist. Strategic investment in foundational systems determines ROI more than tooling does.3
  • Faros AI "Productivity Paradox" 2025 (10,000+ developers, 1,255 teams): high-AI teams complete +21% more tasks and merge +98% more PRs, but PR review time rises +91%. By Amdahl's law, the slowest stage (review) governs overall throughput.45
  • Faros AI Engineering Report 2026 ("Acceleration Whiplash"): a "senior engineer tax" emerges. Median time-to-first-review +156.6%, average review time +199.6%, median linger time +441.5%. AI output looks plausible on the surface, hiding defects deeper and making review cognitively more expensive.6
  • GitClear 2025 AI Copilot Code Quality (211M LOC, 2021–2025): "moved" code — the marker of refactoring/reuse — dropped from 25% to under 10%, while copy/paste rose from 8% to ~18%. For the first time on record, copied lines exceeded moved lines. AI optimizes short-term output at the cost of long-term maintainability.7
  • Peng et al., arXiv 2302.06590 (GitHub/Microsoft): in a controlled trial the Copilot group completed identical tasks 55.8% faster. Individual coding velocity gains are real.8
  • SmartBear–Cisco Code Review Study (the largest industry code-review study): defect detection holds at 70–90% when a review covers 200–400 LOC, ≤500 LOC/hour, within 60–90 minutes. Past that envelope detection drops sharply — meaning the 1,000-line PRs AI generates are structurally beyond human review.910
Individual
Code quality
+3.4%DORA 2024
Code review speed
+3.1%DORA 2024
Documentation quality
+7.5%DORA 2024
Task completion
+21%Faros 2025
Copilot group coding speed
+55.8%Peng 2023
Organization
Delivery throughput
−1.5%DORA 2024
Delivery stability
−7.2%DORA 2024
PR review time
+91%Faros 2025
Median time-to-first-review
+156.6%Faros 2026
Median review linger time
+441.5%Faros 2026

The slowest stage (review) governs throughput — Amdahl's Law

Fig 1. The AI productivity paradox — individual metrics climb while org-level delivery slips
Moved (refactor / reuse)
Reuse declining
2021
25%
2025
<10%
Copy / Paste
Duplication surging
2021
8%
2025
~18%
For the first time on record, copied lines exceeded moved lines
Fig 2. GitClear — across 211M LOC over four years: reuse fell, copying rose
small
sweet spot
70–90% defect detection
detection declines
AI territory
0
200
400
1,000
LOC
Human reviewer envelope
200–400 LOC · 60–90 min · ≤ 500 LOC/hr
AI-generated PR
~1,000 LOC — structurally beyond human review capacity
Fig 3. SmartBear–Cisco — per-review cognitive limits (the sweet spot) and where AI-generated PRs land
Assumption behind today's policyWhat's true after AI (source)
Keep PRs under 200–400 linesThe SmartBear–Cisco threshold reflects human cognition; AI clears it in a single shot
Reviewer reads line-by-line and LGTMsFaros AI: +98% merges, +91% review time; 2026 follow-up shows median linger time +441.5%
The author understands intent bestIntent lives in the spec, not in AI-generated code — the rationale behind Spec-Driven Development (arXiv 2602.00180)
Quality is maintained by refactor/reuseGitClear 2025: moved 25% → <10%, copy/paste 8% → ~18%
1–2 reviewers gate the mergeDORA 2024: throughput −1.5%, stability −7.2% — individual speed ↑ vs system performance ↓

2. What to change — redesigning the three axes

Axis 1. PR unit policy — one intent, not a line cap

  • Before: "PRs under 400 lines" (the SmartBear–Cisco-derived industry consensus)10
  • After: "A PR is one verifiable intent" — the LOC cap is demoted to a secondary signal

Concrete rules:

  • 1 PR = 1 acceptance-criteria unit. Even an 800-line PR is fine if it satisfies a single spec.
  • AI-generated boilerplate (tests, migrations, generated code) gets a separate label so reviewers know what to read closely vs skim. (GitClear's "moved vs copy/paste" distinction becomes a policy signal.)7
  • LOC caps become warning lines, not blocking: when the SmartBear–Cisco threshold (~400 LOC, ~500 LOC/hr) is exceeded, an automated comment asks "Can this intent be split further?"9

Axis 2. Review target policy — review intent, not code

Reviewer → Verifier. Review the spec, acceptance criteria, and constraints, not the diff. Spec-Driven Development (SDD) treats the spec as the primary artifact and code as the secondary one — directly aligned with relieving the AI-era code-review bottleneck.11

SDD-adjacent findings:

  • Constitutional Spec-Driven Development (arXiv 2602.02584): enforcing constraints as a "constitution" at the spec stage cut security defects by 73% in-domain, with no slowdown.12
  • Red Hat guide: compared to ad-hoc "vibe coding," SDD raises the consistency and verifiability of AI output.13

What the new PR template should contain:

  1. Intent — what is changing and why (1–3 sentences)
  2. Acceptance Criteria — what must pass for "done" (Given/When/Then or a checklist; Gherkin recommended)11
  3. Constraints / Non-goals — what must not be touched, domain contracts
  4. Verification Evidence — test output, screenshots, logs, benchmarks (a human must be able to reproduce)
  5. AI Co-author ratio / Risk zone — annotate which parts AI generated and which a human verified
  6. Rollback Plan — especially required where instant rollback is impossible (e.g. mobile)
PR #1234 · feat(auth): add wallet.read OAuth scope
01## IntentWhat is changing and why
Fix 403 thrown to new users hitting the payments page. Add OAuth scope `wallet.read` so authorization flows match.
02## Acceptance CriteriaPass criteria (Given/When/Then)
- [ ] New signups no longer 403 on payments entry - [ ] Existing tokens keep working (backward compatible) - [ ] Missing scope returns explicit `E_SCOPE_MISSING`
03## Constraints / Non-goalsWhat this PR will not touch
× OAuth client IDs unchanged × Session cookie names unchanged × Payment transaction logic unchanged
04## Verification EvidenceReproducible evidence
▸ Unit 8/8 passing (CI #482) ▸ Staging E2E recording: link ▸ DB migration dry-run log: link
05## AI Co-author / Risk ZoneWhere AI-generated and human-verified code meet
AI 80% — boilerplate, test cases Human-verified: scope mapping logic, expiration handling
06## Rollback PlanHow to revert
Toggle feature flag `auth_v2` off → reverts within 30s DB migration backward-compatible; drop in v+2
Fig 4. The new PR description template — 6 fields from Intent through Rollback that must be filled before merge

Axis 3. Gate policy — single approval → multi-layer trust (Swiss Cheese)

A single LGTM gate can't hold any more. Google Engineering Practices still states "the primary goal of code review is to improve the health of the codebase," but in the AI era that work has to spread across multiple gates to avoid the one-reviewer bottleneck.14

L1
Deterministic Guardrails
Lint · types · unit/integration tests · security scans
auto-block
L2
AI Code Review
First-pass filter: syntax · style · obvious bugs · security patterns
warn
L3
Spec / Intent Review
Humans review intent + acceptance criteria — before coding starts
block
L4
Human Block Zone
Tribal knowledge · regulated paths · native critical paths
block
L5
Post-merge Observability
Monitoring · feature flags · canary · auto-rollback
recover

A PR must clear L1 → L2 → L3 → L4 to merge; L5 is the post-merge safety net

Fig 5. Multi-layer gates — L1–L5 each catch different risks, replacing the single LGTM bottleneck

Layer-by-layer rationale:

  • L1 is the "small batch + robust testing" basics DORA 2024 emphasizes.1
  • L2 exploits the "cognitive load reduction from AI" measured by CACM (Ziegler et al., GitHub Research).15
  • L3 is the pre-code intent and acceptance-criteria review SDD recommends.11
  • L4 is bounded to tribal knowledge, regulated paths, and native critical paths.
  • L5 is post-merge recovery (feature flags, canary, auto-rollback) that reinforces DORA's stability metrics.1

3. Mobile / app considerations

"Ship fast, revert faster" is comfortable on server / web, but mobile apps have deploy cadence and rollback constraints. The "small batch + robust testing" principle DORA 2024 highlighted needs heavier application on mobile.1

Server / Web
Required gates L1 · L2 · L5
rollback
< 1 min
Observability-based recovery is enough
OTA / CodePush
Required gates L1 · L2 · L3
rollback
hours
Partial reliance on L5
Native binary
Required gates L1 – L4
rollback
days (store review)
Payments · signing · key mgmt → L4 human block required

Same PR, different surface — the merge gates must adapt to deploy economics

Fig 6. Rollback time by deploy surface — the costlier the revert, the heavier the pre-deploy gates
  • Make pre-deploy gates (L1–L4) heavier, and lean less on post-deploy observability (L5).
  • Native critical paths — payments, signing, key management, WalletConnect — require L4 human block. Exclude them from AI-auto-merge.
  • Include UI snapshot tests and UI automation tests in L1. Visual regressions are hard to catch via observability.
  • Split OTA/CodePush-able areas from native code via PR labels to reflect rollback-cost differences in policy.
  • State the force-update policy and revert cost in every PR description.

4. Metrics — what to measure so the policy stays alive

To see whether the policy is working, layer AI-era-specific metrics on top of DORA's four keys (throughput / stability).14

MetricMeaning / sourceGoal
PR Review Lead Time (median / mean)Faros AI's most-degraded headline metric.6
Code Churn Rate (% of new code rewritten within 2 weeks)Tracks GitClear's short-term churn at the team level.7
Copy/Paste vs Moved Code RatioGitClear's central signal — duplication vs reuse.7Moved ↑, Copy/Paste ↓
Rubber Stamp Rate (% of PRs with 0–1 review comments)Signal of review formalization; SmartBear's "active review" principle.16
Delivery Throughput / Change Failure Rate / MTTRDORA 4 keys — confirms org-level effect.1Throughput ↑, Failure ↓, MTTR ↓
Spec Review Coverage% of PRs with explicit acceptance criteria — SDD adoption indicator.11

5. A four-phase transition roadmap

0
2 weeks
Measure

Capture baseline: DORA 4 keys + review metrics + churn

Done. Baseline dashboard live
1
1 month
PR template + AI review

Enforce Intent/AC/Evidence template, add L2 AI code review

Done. Syntax/style comments down 80%
2
2 months
Spec / Intent pilot

Pre-code acceptance-criteria review (1–2 teams)

Done. Spec-to-Code drift measurable
3
ongoing
Human Block Zone

Humans gate only critical paths; rest auto-approved

Done. Review wait time down 50%
Fig 7. Four-phase transition — measure → AI first-pass review → spec pilot → human-block zone
  • Phase 0 (Measure, 2 weeks) — capture DORA 4 keys + Faros-style review metrics + GitClear churn as the baseline.17
  • Phase 1 (PR template + AI first-pass review, 1 month) — enforce the new Intent/AC/Evidence template; add GitHub Copilot Code Review (or equivalent) as L2.15
  • Phase 2 (Spec / Intent review pilot, 2 months) — adopt pre-code acceptance-criteria review on 1–2 teams (SDD).11
  • Phase 3 (Human Block Zone separation, ongoing) — humans gate only critical paths; the rest auto-approves via L1+L2+L5 while preserving DORA's "basics" principle.1 Goal: review wait time −50%, production defect rate stable.

6. Open questions — leaving them honest

  • The spec-writing bottleneck. The code-review bottleneck may simply migrate into a spec-review bottleneck. The SDD literature names this limit explicitly.11
  • Junior growth paths. Traditional code review was also a learning mechanism (Google eng-practices stresses mentoring).14 What replaces it in a verifier-centric model?
  • Legal accountability. Who owns failure when AI generates and AI reviews? DORA 2024's reported 39% trust figure points to this gap.1
  • Same-kind verification blind spot. When the same model family generates and verifies, systemic biases reproduce on both sides.
  • Individual vs organization gap. Faros and DORA agree on "individual productivity ↑, org delivery ↓." How do we narrow it?41

7. Conclusion — in one line

A "good PR" is no longer a small diff. It's one verifiable intent plus the multi-layered evidence proving it.

Demote the LOC cap to a warning line (SmartBear–Cisco threshold); promote intent, acceptance criteria, and multi-layer gates to first-class policy.


References

Google DORA Research

Faros AI Research

GitClear

SmartBear / Cisco Code Review Study

Google Engineering Practices

Academic papers (arXiv / CACM)

Industry analysis

Footnotes

  1. Accelerate State of DevOps Report 2024 — DORA 2 3 4 5 6 7 8 9 10

  2. Announcing the 2024 DORA report — Google Cloud Blog

  3. 2025 State of AI-assisted Software Development — DORA

  4. The AI Productivity Paradox Report 2025 — Faros AI 2 3

  5. Are AI coding assistants really saving time, money and effort? — Faros AI

  6. The AI Engineering Report 2026: The AI Acceleration Whiplash — Faros AI 2

  7. AI Copilot Code Quality 2025 — GitClear 2 3 4 5

  8. Peng et al., The Impact of AI on Developer Productivity — arXiv 2302.06590

  9. Code Review at Cisco Systems — SmartBear (PDF) 2

  10. What Is Code Review? — SmartBear 2

  11. Spec-Driven Development — arXiv 2602.00180 2 3 4 5 6

  12. Constitutional Spec-Driven Development — arXiv 2602.02584

  13. How spec-driven development improves AI coding quality — Red Hat

  14. The Standard of Code Review — Google eng-practices 2

  15. Measuring GitHub Copilot's Impact on Productivity — CACM (Ziegler et al.) 2

  16. Best Practices for Peer Code Review — SmartBear


Comments