Redefining "Good PRs" in the AI Era — What to Change, Why, and How

AI has reshaped PR productivity, size, and generation speed. The "small / frequent / line-by-line" review policy can't survive as-is. What to bundle as a PR (unit), what to review (target), and who blocks (gate) — three axes must be redesigned together.

1. The problem — why "good PRs" are breaking now

Primary research and industry reports since AI adoption show a consistent pattern: individual productivity rises, but org-level delivery performance stagnates or declines.

Google DORA 2024 Report: A 25% increase in AI adoption brings +3.4% code quality, +3.1% review speed, +7.5% documentation quality — yet delivery throughput −1.5% and stability −7.2%. 75% use AI output daily; trust in AI output is only 39%.¹²
Google DORA 2025 — State of AI-assisted Software Development: AI acts as an amplifier; it scales whichever organizational strengths and weaknesses already exist. Strategic investment in foundational systems determines ROI more than tooling does.³
Faros AI "Productivity Paradox" 2025 (10,000+ developers, 1,255 teams): high-AI teams complete +21% more tasks and merge +98% more PRs, but PR review time rises +91%. By Amdahl's law, the slowest stage (review) governs overall throughput.⁴⁵
Faros AI Engineering Report 2026 ("Acceleration Whiplash"): a "senior engineer tax" emerges. Median time-to-first-review +156.6%, average review time +199.6%, median linger time +441.5%. AI output looks plausible on the surface, hiding defects deeper and making review cognitively more expensive.⁶
GitClear 2025 AI Copilot Code Quality (211M LOC, 2021–2025): "moved" code — the marker of refactoring/reuse — dropped from 25% to under 10%, while copy/paste rose from 8% to ~18%. For the first time on record, copied lines exceeded moved lines. AI optimizes short-term output at the cost of long-term maintainability.⁷
Peng et al., arXiv 2302.06590 (GitHub/Microsoft): in a controlled trial the Copilot group completed identical tasks 55.8% faster. Individual coding velocity gains are real.⁸
SmartBear–Cisco Code Review Study (the largest industry code-review study): defect detection holds at 70–90% when a review covers 200–400 LOC, ≤500 LOC/hour, within 60–90 minutes. Past that envelope detection drops sharply — meaning the 1,000-line PRs AI generates are structurally beyond human review.⁹¹⁰

Individual

↑

Code quality

+3.4%DORA 2024

Code review speed

+3.1%DORA 2024

Documentation quality

+7.5%DORA 2024

Task completion

+21%Faros 2025

Copilot group coding speed

+55.8%Peng 2023

Organization

↓

Delivery throughput

−1.5%DORA 2024

Delivery stability

−7.2%DORA 2024

PR review time

+91%Faros 2025

Median time-to-first-review

+156.6%Faros 2026

Median review linger time

+441.5%Faros 2026

The slowest stage (review) governs throughput — Amdahl's Law

Fig 1. The AI productivity paradox — individual metrics climb while org-level delivery slips

Moved (refactor / reuse)

Reuse declining ↓

2021

25%

2025

<10%

Copy / Paste

Duplication surging ↑

2021

2025

~18%

For the first time on record, copied lines exceeded moved lines

Fig 2. GitClear — across 211M LOC over four years: reuse fell, copying rose

small

sweet spot

70–90% defect detection

detection declines

AI territory

200

400

1,000

LOC

Human reviewer envelope

200–400 LOC · 60–90 min · ≤ 500 LOC/hr

AI-generated PR

~1,000 LOC — structurally beyond human review capacity

Fig 3. SmartBear–Cisco — per-review cognitive limits (the sweet spot) and where AI-generated PRs land

Assumption behind today's policy	What's true after AI (source)
Keep PRs under 200–400 lines	The SmartBear–Cisco threshold reflects human cognition; AI clears it in a single shot
Reviewer reads line-by-line and LGTMs	Faros AI: +98% merges, +91% review time; 2026 follow-up shows median linger time +441.5%
The author understands intent best	Intent lives in the spec, not in AI-generated code — the rationale behind Spec-Driven Development (arXiv 2602.00180)
Quality is maintained by refactor/reuse	GitClear 2025: moved 25% → `<10%`, copy/paste 8% → ~18%
1–2 reviewers gate the merge	DORA 2024: throughput −1.5%, stability −7.2% — individual speed ↑ vs system performance ↓

2. What to change — redesigning the three axes

Axis 1. PR unit policy — one intent, not a line cap

Before: "PRs under 400 lines" (the SmartBear–Cisco-derived industry consensus)¹⁰
After: "A PR is one verifiable intent" — the LOC cap is demoted to a secondary signal

Concrete rules:

1 PR = 1 acceptance-criteria unit. Even an 800-line PR is fine if it satisfies a single spec.
AI-generated boilerplate (tests, migrations, generated code) gets a separate label so reviewers know what to read closely vs skim. (GitClear's "moved vs copy/paste" distinction becomes a policy signal.)⁷
LOC caps become warning lines, not blocking: when the SmartBear–Cisco threshold (~400 LOC, ~500 LOC/hr) is exceeded, an automated comment asks "Can this intent be split further?"⁹

Axis 2. Review target policy — review intent, not code

Reviewer → Verifier. Review the spec, acceptance criteria, and constraints, not the diff. Spec-Driven Development (SDD) treats the spec as the primary artifact and code as the secondary one — directly aligned with relieving the AI-era code-review bottleneck.¹¹

SDD-adjacent findings:

Constitutional Spec-Driven Development (arXiv 2602.02584): enforcing constraints as a "constitution" at the spec stage cut security defects by 73% in-domain, with no slowdown.¹²
Red Hat guide: compared to ad-hoc "vibe coding," SDD raises the consistency and verifiability of AI output.¹³

What the new PR template should contain:

Intent — what is changing and why (1–3 sentences)
Acceptance Criteria — what must pass for "done" (Given/When/Then or a checklist; Gherkin recommended)¹¹
Constraints / Non-goals — what must not be touched, domain contracts
Verification Evidence — test output, screenshots, logs, benchmarks (a human must be able to reproduce)
AI Co-author ratio / Risk zone — annotate which parts AI generated and which a human verified
Rollback Plan — especially required where instant rollback is impossible (e.g. mobile)

PR #1234 · feat(auth): add wallet.read OAuth scope

01## IntentWhat is changing and why

Fix 403 thrown to new users hitting the payments page. Add OAuth scope `wallet.read` so authorization flows match.

02## Acceptance CriteriaPass criteria (Given/When/Then)

- [ ] New signups no longer 403 on payments entry - [ ] Existing tokens keep working (backward compatible) - [ ] Missing scope returns explicit `E_SCOPE_MISSING`

03## Constraints / Non-goalsWhat this PR will not touch

× OAuth client IDs unchanged × Session cookie names unchanged × Payment transaction logic unchanged

04## Verification EvidenceReproducible evidence

▸ Unit 8/8 passing (CI #482) ▸ Staging E2E recording: link ▸ DB migration dry-run log: link

05## AI Co-author / Risk ZoneWhere AI-generated and human-verified code meet

AI 80% — boilerplate, test cases Human-verified: scope mapping logic, expiration handling

06## Rollback PlanHow to revert

Toggle feature flag `auth_v2` off → reverts within 30s DB migration backward-compatible; drop in v+2

Fig 4. The new PR description template — 6 fields from Intent through Rollback that must be filled before merge

Axis 3. Gate policy — single approval → multi-layer trust (Swiss Cheese)

A single LGTM gate can't hold any more. Google Engineering Practices still states "the primary goal of code review is to improve the health of the codebase," but in the AI era that work has to spread across multiple gates to avoid the one-reviewer bottleneck.¹⁴

Deterministic Guardrails

Lint · types · unit/integration tests · security scans

auto-block

AI Code Review

First-pass filter: syntax · style · obvious bugs · security patterns

warn

Spec / Intent Review

Humans review intent + acceptance criteria — before coding starts

block

Human Block Zone

Tribal knowledge · regulated paths · native critical paths

block

Post-merge Observability

Monitoring · feature flags · canary · auto-rollback

recover

A PR must clear L1 → L2 → L3 → L4 to merge; L5 is the post-merge safety net

Fig 5. Multi-layer gates — L1–L5 each catch different risks, replacing the single LGTM bottleneck

Layer-by-layer rationale:

L1 is the "small batch + robust testing" basics DORA 2024 emphasizes.¹
L2 exploits the "cognitive load reduction from AI" measured by CACM (Ziegler et al., GitHub Research).¹⁵
L3 is the pre-code intent and acceptance-criteria review SDD recommends.¹¹
L4 is bounded to tribal knowledge, regulated paths, and native critical paths.
L5 is post-merge recovery (feature flags, canary, auto-rollback) that reinforces DORA's stability metrics.¹

3. Mobile / app considerations

"Ship fast, revert faster" is comfortable on server / web, but mobile apps have deploy cadence and rollback constraints. The "small batch + robust testing" principle DORA 2024 highlighted needs heavier application on mobile.¹

Server / Web

Required gates L1 · L2 · L5

rollback

< 1 min

Observability-based recovery is enough

OTA / CodePush

Required gates L1 · L2 · L3

rollback

hours

Partial reliance on L5

Native binary

Required gates L1 – L4

rollback

days (store review)

Payments · signing · key mgmt → L4 human block required

Same PR, different surface — the merge gates must adapt to deploy economics

Fig 6. Rollback time by deploy surface — the costlier the revert, the heavier the pre-deploy gates

Make pre-deploy gates (L1–L4) heavier, and lean less on post-deploy observability (L5).
Native critical paths — payments, signing, key management, WalletConnect — require L4 human block. Exclude them from AI-auto-merge.
Include UI snapshot tests and UI automation tests in L1. Visual regressions are hard to catch via observability.
Split OTA/CodePush-able areas from native code via PR labels to reflect rollback-cost differences in policy.
State the force-update policy and revert cost in every PR description.

4. Metrics — what to measure so the policy stays alive

To see whether the policy is working, layer AI-era-specific metrics on top of DORA's four keys (throughput / stability).¹⁴

Metric	Meaning / source	Goal
PR Review Lead Time (median / mean)	Faros AI's most-degraded headline metric.⁶	↓
Code Churn Rate (% of new code rewritten within 2 weeks)	Tracks GitClear's short-term churn at the team level.⁷	↓
Copy/Paste vs Moved Code Ratio	GitClear's central signal — duplication vs reuse.⁷	Moved ↑, Copy/Paste ↓
Rubber Stamp Rate (% of PRs with 0–1 review comments)	Signal of review formalization; SmartBear's "active review" principle.¹⁶	↓
Delivery Throughput / Change Failure Rate / MTTR	DORA 4 keys — confirms org-level effect.¹	Throughput ↑, Failure ↓, MTTR ↓
Spec Review Coverage	% of PRs with explicit acceptance criteria — SDD adoption indicator.¹¹	↑

5. A four-phase transition roadmap

2 weeks

Measure

Capture baseline: DORA 4 keys + review metrics + churn

Done. Baseline dashboard live

1 month

PR template + AI review

Enforce Intent/AC/Evidence template, add L2 AI code review

Done. Syntax/style comments down 80%

2 months

Spec / Intent pilot

Pre-code acceptance-criteria review (1–2 teams)

Done. Spec-to-Code drift measurable

ongoing

Human Block Zone

Humans gate only critical paths; rest auto-approved

Done. Review wait time down 50%

Fig 7. Four-phase transition — measure → AI first-pass review → spec pilot → human-block zone

Phase 0 (Measure, 2 weeks) — capture DORA 4 keys + Faros-style review metrics + GitClear churn as the baseline.¹⁷
Phase 1 (PR template + AI first-pass review, 1 month) — enforce the new Intent/AC/Evidence template; add GitHub Copilot Code Review (or equivalent) as L2.¹⁵
Phase 2 (Spec / Intent review pilot, 2 months) — adopt pre-code acceptance-criteria review on 1–2 teams (SDD).¹¹
Phase 3 (Human Block Zone separation, ongoing) — humans gate only critical paths; the rest auto-approves via L1+L2+L5 while preserving DORA's "basics" principle.¹ Goal: review wait time −50%, production defect rate stable.

6. Open questions — leaving them honest

The spec-writing bottleneck. The code-review bottleneck may simply migrate into a spec-review bottleneck. The SDD literature names this limit explicitly.¹¹
Junior growth paths. Traditional code review was also a learning mechanism (Google eng-practices stresses mentoring).¹⁴ What replaces it in a verifier-centric model?
Legal accountability. Who owns failure when AI generates and AI reviews? DORA 2024's reported 39% trust figure points to this gap.¹
Same-kind verification blind spot. When the same model family generates and verifies, systemic biases reproduce on both sides.
Individual vs organization gap. Faros and DORA agree on "individual productivity ↑, org delivery ↓." How do we narrow it?⁴¹

7. Conclusion — in one line

A "good PR" is no longer a small diff. It's one verifiable intent plus the multi-layered evidence proving it.

Demote the LOC cap to a warning line (SmartBear–Cisco threshold); promote intent, acceptance criteria, and multi-layer gates to first-class policy.

References

Google DORA Research

Faros AI Research

GitClear

AI Copilot Code Quality: 2025 Data (211M LOC, 2021–2025)

SmartBear / Cisco Code Review Study

Google Engineering Practices

Academic papers (arXiv / CACM)

Industry analysis

Redefining "Good PRs" in the AI Era — What to Change, Why, and How

Comments