17 Code review

17.1 Your first pull request

You’ve done the engineering. The logic is in a module, it’s tested, it’s configured, and you open a pull request to merge it. Within the hour, twelve comments come back. Some question your approach, some point out a case you didn’t handle, one asks why you named something the way you did. If you’ve worked alone — as most data scientists have — this lands as criticism: the work was put forward and found wanting.

It isn’t. Code review is a normal, expected, routine part of how teams ship software, applied to every change regardless of who wrote it, and the most senior engineer on the team gets the same twelve comments. Reframing it from “my work is being judged” to “my work is getting a second pair of eyes” is the first and most important step, because review is one of the highest-value practices in this book — and one that data science culture, built around solo notebooks, almost entirely lacks.

17.2 What review is for

Review does two jobs at once. The obvious one is quality: a second reader catches what the author cannot see — a missed edge case, a subtle bug, an assumption that won’t hold — precisely because they didn’t write it and aren’t blinded by what they meant. The less obvious, and arguably more valuable, job is knowledge sharing: the reviewer comes away understanding how the code works, so the project no longer lives in one person’s head. A team where every change is reviewed is a team where no single person leaving takes the only understanding of a system with them.

What review is not for is style. Whether the quotes are single or double, whether the line is too long — those are for the formatter and linter to settle automatically (Chapter 5), not for a human to litigate in comments. A review that descends into style nitpicks is one where the automation hasn’t been set up yet.

Data Science Bridge

Code review is peer review, applied to code. You already know and trust peer review: before an analysis goes to a stakeholder or a paper goes out, a colleague reads it, checks the reasoning, questions the method, and catches the error you were too close to see. Code review is the same institution — a knowledgeable peer reads your change, examines the logic, and pushes back where it’s unclear or wrong. The cultural machinery you respect in research already exists for code.

Where the analogy breaks down: peer review of a paper is rare and heavy — a whole study, reviewed in depth, once. Code review is frequent and light — a single change, reviewed quickly, many times a week. That difference inverts an instinct carried over from research, where you polish a large body of work before submitting it. Code review works best in the opposite mode: small, frequent changes reviewed in minutes, not a quarter’s work dropped in one enormous request.

17.3 Small changes, reviewed often

The single biggest determinant of whether review works is the size of the change. A thousand-line pull request gets a rubber stamp — no reviewer can hold that much in their head, so they skim it and approve. A fifty-line pull request gets a real review, because a reviewer can actually understand all of it and reason about whether it’s correct. Small changes are also reversible (the small commits of Chapter 2) and keep the main branch moving rather than blocked behind a giant merge.

This cuts directly against a notebook habit: the instinct to do all the work, then share it once it’s “done”. Reviewable work is the opposite — a stream of small, self-contained changes, each understandable on its own. Splitting work this way is a skill, and it’s worth developing, because it’s what makes everything else about review function.

17.4 What automated checks free you to do

Continuous integration (Chapter 13) and linters handle everything mechanical: does it pass the tests, is it formatted, does it type-check. That isn’t a substitute for review — it’s what enables good review, by clearing away the trivia so the human can spend their attention on what only a human can judge. And what only a human can judge is the conceptual correctness that no tool will catch:

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

# Pure noise. The target is independent of all 2,000 features, so the only
# honest accuracy is chance: 0.50. Anything above that is leakage, not skill.
rng = np.random.default_rng(42)
X = rng.normal(size=(200, 2_000))
y = rng.integers(0, 2, 200)

# The leak: choose the 20 "best" features using ALL the data, then cross-validate.
# The selection has already seen every fold's held-out labels.
X_selected = SelectKBest(f_classif, k=20).fit_transform(X, y)
leaky = cross_val_score(LogisticRegression(max_iter=1000), X_selected, y, cv=5).mean()

# The fix: put selection inside the pipeline, so it refits on each fold's
# training split and never touches that fold's held-out data.
honest = cross_val_score(
    make_pipeline(SelectKBest(f_classif, k=20), LogisticRegression(max_iter=1000)),
    X, y, cv=5,
).mean()

print(f"select-then-validate (leaky): {leaky:.2f}")
print(f"select-inside-pipeline:       {honest:.2f}")
print(f"true accuracy (no signal):    0.50")

select-then-validate (leaky): 0.83
select-inside-pipeline:       0.57
true accuracy (no signal):    0.50

Read those two numbers again, because the gap is the whole argument. There is no signal in that data — none, by construction — and the leaky version reports accuracy in the eighties. It didn’t find a pattern; it selected twenty features that happened to correlate with the labels including the labels it was later tested on, and then congratulated itself. The honest pipeline lands near chance, which is the correct answer.

Now notice what would have happened to this code on its way to production. Both versions run. Both pass the linter, the type checker, and every test you would plausibly write for them. CI goes green for both, because nothing in CI knows that fit_transform(X, y) before a split means something different from fit_transform after one. The four-line difference between them is invisible to every automated gate in this book — and it inflated the headline number by thirty points.

That is what human review is for. Not the missing docstring: the change where the code is correct as code and wrong as inference. A reviewer who knows what cross-validation is supposed to protect reads those two functions and sees the problem in seconds. No tool you can buy will.

17.5 Making notebooks reviewable

There’s an obstacle here that a software team never hits, and pretending otherwise would be unhelpful: a notebook is close to unreviewable by default. Open a .ipynb diff on GitHub and you get JSON — cell metadata, execution counts that changed because you re-ran things, and, worst of all, base64-encoded image output. A three-line change to a function can present as four thousand changed lines, most of them a re-rendered plot. No reviewer is reading that, so in practice nobody reviews notebooks, which is a large part of why data science work goes unreviewed at all.

Two tools fix most of it. nbdime understands notebook structure, so nbdiff shows you a change cell by cell — source separated from output, in a form a human can actually read — and nbdiff-web renders it side by side. nbstripout addresses the cause rather than the symptom: installed as a Git filter, it strips outputs and execution counts on the way into a commit, so what’s version-controlled is the code you wrote rather than the pictures it produced.

pip install nbdime nbstripout

nbdime config-git --enable --global   # git diff now understands notebooks
nbstripout --install                  # strip outputs on commit, in this repo

Stripping outputs is the higher-value move and the one people hesitate over, because the rendered figures are the notebook’s value while you’re working. But committed outputs are stale the moment anyone changes anything upstream, and a stale figure in version control is a wrong figure that looks authoritative — the same argument the reproducible-pipeline chapter makes about pasted numbers. The output belongs to a run, not to the source. If a figure matters enough to keep, it should be regenerated by the pipeline (Chapter 22), not frozen into a commit.

This is also the practical answer to the earlier point about small changes. A notebook is hard to split into self-contained reviewable pieces precisely because it’s one linear document with everything in it. The more logic that lives in src/ and the thinner the notebook gets (Chapter 6 and Chapter 9), the more of your work becomes reviewable in the ordinary way — which is a reason for the thin-notebook discipline that has nothing to do with tidiness.

17.6 Giving and receiving review

Giving good review is a skill. Be specific (point at the line, not the vibe), be kind (it’s the code under discussion, never the person), and distinguish a blocking problem (“this leaks the test set”) from a suggestion (“this might read more clearly as a comprehension”) so the author knows what must change versus what’s optional — and say why, because the reasoning is what teaches. Receiving review is also a skill: read comments as being about the code, respond to each (even if only to disagree, with a reason), and ask when something’s unclear rather than guessing.

A short checklist helps a reviewer be systematic: is it correct, is it tested, is it readable, are the edge cases handled, and — for data science specifically — does it leak, does it hard-code a secret, are the data assumptions stated. The first three are general; the last few are the ones a data science team must add.

Author’s Note

Data science work is overwhelmingly solo and unreviewed. Your notebook is yours; nobody reads it; the first time another human sees the code is often when it has already broken something. Against that backdrop, review feels like exposure — an audit, a vote of no confidence — and the instinct is to be defensive.

The reframe is that being reviewed is a privilege rather than an indictment. Twelve comments means someone has actually read and now understands your work — so it’s no longer trapped in your head, can be maintained by someone else, and has had at least one bug caught before it became your 3am incident. That is the cheapest quality-and-knowledge mechanism a team has, and it’s running for you. The same is true in reverse: reviewing other people’s code is the fastest way to learn a codebase, far quicker than any documentation, because you see how it’s really put together one change at a time. The discomfort of the first few reviews is the price of never again being the only person who understands a critical piece of work.

17.7 Summary

Review is how a team keeps quality high and knowledge shared:

Review does two jobs. It catches what the author can’t see, and it spreads understanding so a system doesn’t live in one head. Style is the linter’s job, not the reviewer’s.
Small changes get real review. A huge pull request gets rubber-stamped; a small one gets understood — so split work into self-contained changes, against the notebook instinct to share it all at once.
Automation frees humans for the conceptual. CI and linters handle the mechanical so reviewers can focus on correctness, design, and the bugs no tool catches — like a data leak that runs perfectly.
Notebooks need tooling to be reviewable at all. nbdime diffs them structurally and nbstripout keeps outputs out of commits — without both, a notebook diff is unreadable JSON and the review simply doesn’t happen.
Review is a skill, both ways. Give it specifically and kindly, separating blocking issues from suggestions; receive it as being about the code, responding to each comment.

The next chapter is about the other half of sharing understanding — the kind that doesn’t need a reviewer present to read it: documentation.

17.8 Exercises

Open a small, focused pull request (aim for under ~100 lines) for a change to one of your own projects, with a description of what it does and why. Then read it back as a reviewer would — is it easy to follow, and what specifically made it so or not?
Review someone else’s pull request (a colleague’s, or an open-source one) and leave at least three comments, each clearly marked as a blocking issue or a suggestion, each saying why. Which was harder: finding the issues, or phrasing them constructively?
Find (or plant) a conceptual bug that automated checks would miss — a data leak, the wrong metric, an off-by-one in a split — in some code. Explain why a linter and the tests pass it, and what kind of attention from a reviewer would catch it.
Conceptual: Some habits of academic peer review would make you a poor code reviewer if you carried them across wholesale — the anonymous reviewer, the accept-or-reject verdict, the expectation that the author mounts a defence. Pick one, describe concretely what it would look like if a colleague brought it to your pull requests, and say what code review does instead and why that difference matters to a team shipping every week.
Conceptual: Draft a short code-review checklist for a data science team. What items would you add that a general software checklist wouldn’t have, and what would you deliberately leave off — and why does leaving those off make reviews better?

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Code review {#sec-code-review} ## Your first pull request {#sec-first-pr} You've done the engineering. The logic is in a module, it's tested, it's configured, and you open a pull request to merge it. Within the hour, twelve comments come back. Some question your approach, some point out a case you didn't handle, one asks why you named something the way you did. If you've worked alone — as most data scientists have — this lands as criticism: the work was put forward and found wanting. It isn't. Code review is a normal, expected, routine part of how teams ship software, applied to every change regardless of who wrote it, and the most senior engineer on the team gets the same twelve comments. Reframing it from "my work is being judged" to "my work is getting a second pair of eyes" is the first and most important step, because review is one of the highest-value practices in this book — and one that data science culture, built around solo notebooks, almost entirely lacks. ## What review is for {#sec-what-review-for} Review does two jobs at once. The obvious one is *quality*: a second reader catches what the author cannot see — a missed edge case, a subtle bug, an assumption that won't hold — precisely because they didn't write it and aren't blinded by what they meant. The less obvious, and arguably more valuable, job is *knowledge sharing*: the reviewer comes away understanding how the code works, so the project no longer lives in one person's head. A team where every change is reviewed is a team where no single person leaving takes the only understanding of a system with them. What review is *not* for is style. Whether the quotes are single or double, whether the line is too long — those are for the formatter and linter to settle automatically (@sec-readable-code), not for a human to litigate in comments. A review that descends into style nitpicks is one where the automation hasn't been set up yet. ::: {.callout-note} ## Data Science Bridge Code review is peer review, applied to code. You already know and trust peer review: before an analysis goes to a stakeholder or a paper goes out, a colleague reads it, checks the reasoning, questions the method, and catches the error you were too close to see. Code review is the same institution — a knowledgeable peer reads your change, examines the logic, and pushes back where it's unclear or wrong. The cultural machinery you respect in research already exists for code. Where the analogy breaks down: peer review of a paper is rare and heavy — a whole study, reviewed in depth, once. Code review is frequent and light — a single change, reviewed quickly, many times a week. That difference inverts an instinct carried over from research, where you polish a large body of work before submitting it. Code review works best in the opposite mode: small, frequent changes reviewed in minutes, not a quarter's work dropped in one enormous request. ::: ## Small changes, reviewed often {#sec-small-prs} The single biggest determinant of whether review works is the size of the change. A thousand-line pull request gets a rubber stamp — no reviewer can hold that much in their head, so they skim it and approve. A fifty-line pull request gets a real review, because a reviewer can actually understand all of it and reason about whether it's correct. Small changes are also reversible (the small commits of @sec-version-control) and keep the main branch moving rather than blocked behind a giant merge. This cuts directly against a notebook habit: the instinct to do all the work, then share it once it's "done". Reviewable work is the opposite — a stream of small, self-contained changes, each understandable on its own. Splitting work this way is a skill, and it's worth developing, because it's what makes everything else about review function. ## What automated checks free you to do {#sec-automated-vs-human} Continuous integration (@sec-ci) and linters handle everything mechanical: does it pass the tests, is it formatted, does it type-check. That isn't a substitute for review — it's what *enables* good review, by clearing away the trivia so the human can spend their attention on what only a human can judge. And what only a human can judge is the conceptual correctness that no tool will catch: ```{python} #| label: review-catches-leakage #| echo: true import numpy as np from sklearn.feature_selection import SelectKBest, f_classif from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline # Pure noise. The target is independent of all 2,000 features, so the only # honest accuracy is chance: 0.50. Anything above that is leakage, not skill. rng = np.random.default_rng(42) X = rng.normal(size=(200, 2_000)) y = rng.integers(0, 2, 200) # The leak: choose the 20 "best" features using ALL the data, then cross-validate. # The selection has already seen every fold's held-out labels. X_selected = SelectKBest(f_classif, k=20).fit_transform(X, y) leaky = cross_val_score(LogisticRegression(max_iter=1000), X_selected, y, cv=5).mean() # The fix: put selection inside the pipeline, so it refits on each fold's # training split and never touches that fold's held-out data. honest = cross_val_score( make_pipeline(SelectKBest(f_classif, k=20), LogisticRegression(max_iter=1000)), X, y, cv=5, ).mean() print(f"select-then-validate (leaky): {leaky:.2f}") print(f"select-inside-pipeline: {honest:.2f}") print(f"true accuracy (no signal): 0.50") ``` Read those two numbers again, because the gap is the whole argument. There is no signal in that data — none, by construction — and the leaky version reports accuracy in the eighties. It didn't find a pattern; it selected twenty features that happened to correlate with the labels *including the labels it was later tested on*, and then congratulated itself. The honest pipeline lands near chance, which is the correct answer. Now notice what would have happened to this code on its way to production. Both versions run. Both pass the linter, the type checker, and every test you would plausibly write for them. CI goes green for both, because nothing in CI knows that `fit_transform(X, y)` before a split means something different from `fit_transform` after one. The four-line difference between them is invisible to every automated gate in this book — and it inflated the headline number by thirty points. That is what human review is for. Not the missing docstring: the change where the code is correct as *code* and wrong as *inference*. A reviewer who knows what cross-validation is supposed to protect reads those two functions and sees the problem in seconds. No tool you can buy will. ## Making notebooks reviewable {#sec-reviewable-notebooks} There's an obstacle here that a software team never hits, and pretending otherwise would be unhelpful: a notebook is close to unreviewable by default. Open a `.ipynb` diff on GitHub and you get JSON — cell metadata, execution counts that changed because you re-ran things, and, worst of all, base64-encoded image output. A three-line change to a function can present as four thousand changed lines, most of them a re-rendered plot. No reviewer is reading that, so in practice nobody reviews notebooks, which is a large part of why data science work goes unreviewed at all. Two tools fix most of it. `nbdime` understands notebook structure, so `nbdiff` shows you a change cell by cell — source separated from output, in a form a human can actually read — and `nbdiff-web` renders it side by side. `nbstripout` addresses the cause rather than the symptom: installed as a Git filter, it strips outputs and execution counts on the way *into* a commit, so what's version-controlled is the code you wrote rather than the pictures it produced. ```bash pip install nbdime nbstripout nbdime config-git --enable --global # git diff now understands notebooks nbstripout --install # strip outputs on commit, in this repo ``` Stripping outputs is the higher-value move and the one people hesitate over, because the rendered figures *are* the notebook's value while you're working. But committed outputs are stale the moment anyone changes anything upstream, and a stale figure in version control is a wrong figure that looks authoritative — the same argument the reproducible-pipeline chapter makes about pasted numbers. The output belongs to a run, not to the source. If a figure matters enough to keep, it should be regenerated by the pipeline (@sec-repro-pipeline), not frozen into a commit. This is also the practical answer to the earlier point about small changes. A notebook is hard to split into self-contained reviewable pieces precisely because it's one linear document with everything in it. The more logic that lives in `src/` and the thinner the notebook gets (@sec-functions-modules and @sec-project-structure), the more of your work becomes reviewable in the ordinary way — which is a reason for the thin-notebook discipline that has nothing to do with tidiness. ## Giving and receiving review {#sec-giving-receiving} Giving good review is a skill. Be specific (point at the line, not the vibe), be kind (it's the code under discussion, never the person), and distinguish a blocking problem ("this leaks the test set") from a suggestion ("this might read more clearly as a comprehension") so the author knows what must change versus what's optional — and say *why*, because the reasoning is what teaches. Receiving review is also a skill: read comments as being about the code, respond to each (even if only to disagree, with a reason), and ask when something's unclear rather than guessing. A short checklist helps a reviewer be systematic: is it correct, is it tested, is it readable, are the edge cases handled, and — for data science specifically — does it leak, does it hard-code a secret, are the data assumptions stated. The first three are general; the last few are the ones a data science team must add. ::: {.callout-tip} ## Author's Note Data science work is overwhelmingly solo and unreviewed. Your notebook is yours; nobody reads it; the first time another human sees the code is often when it has already broken something. Against that backdrop, review feels like exposure — an audit, a vote of no confidence — and the instinct is to be defensive. The reframe is that being reviewed is a privilege rather than an indictment. Twelve comments means someone has actually read and now understands your work — so it's no longer trapped in your head, can be maintained by someone else, and has had at least one bug caught before it became your 3am incident. That is the cheapest quality-and-knowledge mechanism a team has, and it's running *for* you. The same is true in reverse: reviewing other people's code is the fastest way to learn a codebase, far quicker than any documentation, because you see how it's really put together one change at a time. The discomfort of the first few reviews is the price of never again being the only person who understands a critical piece of work. ::: ## Summary {#sec-code-review-summary} Review is how a team keeps quality high and knowledge shared: 1. **Review does two jobs.** It catches what the author can't see, and it spreads understanding so a system doesn't live in one head. Style is the linter's job, not the reviewer's. 2. **Small changes get real review.** A huge pull request gets rubber-stamped; a small one gets understood — so split work into self-contained changes, against the notebook instinct to share it all at once. 3. **Automation frees humans for the conceptual.** CI and linters handle the mechanical so reviewers can focus on correctness, design, and the bugs no tool catches — like a data leak that runs perfectly. 4. **Notebooks need tooling to be reviewable at all.** `nbdime` diffs them structurally and `nbstripout` keeps outputs out of commits — without both, a notebook diff is unreadable JSON and the review simply doesn't happen. 5. **Review is a skill, both ways.** Give it specifically and kindly, separating blocking issues from suggestions; receive it as being about the code, responding to each comment. The next chapter is about the other half of sharing understanding — the kind that doesn't need a reviewer present to read it: *documentation*. ## Exercises {#sec-code-review-exercises} 1. Open a small, focused pull request (aim for under ~100 lines) for a change to one of your own projects, with a description of what it does and why. Then read it back as a reviewer would — is it easy to follow, and what specifically made it so or not? 2. Review someone else's pull request (a colleague's, or an open-source one) and leave at least three comments, each clearly marked as a blocking issue or a suggestion, each saying why. Which was harder: finding the issues, or phrasing them constructively? 3. Find (or plant) a conceptual bug that automated checks would miss — a data leak, the wrong metric, an off-by-one in a split — in some code. Explain why a linter and the tests pass it, and what kind of attention from a reviewer would catch it. 4. **Conceptual:** Some habits of academic peer review would make you a poor code reviewer if you carried them across wholesale — the anonymous reviewer, the accept-or-reject verdict, the expectation that the author mounts a defence. Pick one, describe concretely what it would look like if a colleague brought it to your pull requests, and say what code review does instead and why that difference matters to a team shipping every week. 5. **Conceptual:** Draft a short code-review checklist for a data science team. What items would you add that a general software checklist wouldn't have, and what would you deliberately leave off — and why does leaving those off make reviews better?