Appendix D — Exercise answers

D.1 Chapter 1: From notebook to system

Exercise 1: Restart Kernel and Run All

This exercise is diagnostic — there’s no single “right” answer. Common failures include:

NameError for variables defined in cells that the kernel ran out of order. For example, a variable created in cell 15 but used in cell 8, which only worked because you happened to run cell 15 first during interactive exploration.
FileNotFoundError for data files with hard-coded paths that only exist on your machine or in a specific working directory.
Cells that depend on outputs from cells you’ve since deleted or commented out.

The point isn’t to fix every failure immediately — it’s to see how much of your notebook’s correctness depends on invisible state rather than explicit structure.

Exercise 2: Extract a function

Here’s an example using the chapter’s customer filtering logic:

import pandas as pd
import numpy as np

def filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame:
    """Select customers whose spend exceeds the given threshold."""
    return customers[customers["spend"] > threshold].copy()

# Verify with a small test input
test_data = pd.DataFrame({"spend": [50, 150, 250]})
result = filter_high_value(test_data, threshold=100)
assert len(result) == 2, f"Expected 2 rows, got {len(result)}"
assert list(result["spend"]) == [150, 250], "Should contain only rows above threshold"
print("All assertions passed")

All assertions passed

The key properties gained: the function has a name that describes its purpose, its inputs are explicit (no reliance on global variables), and the assert statements verify the logic independently of the notebook’s broader state.

Exercise 3: Score a notebook

This is a self-assessment — your honest scores matter more than the numbers themselves. In practice, most data science notebooks score highest on modularity (often 2–3, since most have at least some cell-level separation) and lowest on testability (often 1, since few notebooks include any automated verification). Reproducibility varies widely: a notebook that loads from a fixed CSV with pinned dependencies might score 4, while one that relies on a live database connection and pip install scores 1.

The value of this exercise is identifying your weakest property and asking whether strengthening it would have saved you time in a recent project. If the answer is yes, that’s where to invest first.

Exercise 4: Holdout set / test suite analogy

Two ways the analogy holds:

Both are verification mechanisms applied after the creative work. You build the model, then validate. You write the code, then test. Neither replaces the work; both catch problems the author missed.
Both require separation — a holdout set must be kept separate from training data, and tests must check behaviour from outside the code, not just re-run it. Contamination in either case undermines the verification.

Two ways the analogy breaks down:

Model validation is probabilistic; software testing is deterministic. A holdout accuracy of 82% might be perfectly acceptable — you’re measuring how well the model generalises. A test that passes 82% of the time is broken. Tests are pass/fail: the code either does what you specified or it doesn’t.
Holdout sets evaluate performance on data drawn from the same distribution. Tests evaluate correctness against cases the developer explicitly constructed, including edge cases and error conditions that may never appear in production data. This means tests can catch failures that no amount of validation data would reveal, but they can also miss failures that real-world data would expose. Each has a blind spot the other doesn’t share.

Exercise 5: Run a colleague’s notebook

This exercise is experiential — the answer is your documentation of the attempt. Common discoveries include:

File paths that assume a specific directory structure or operating system
Environment dependencies not captured anywhere (specific package versions, system libraries, environment variables)
Cells that must be run in a non-obvious order, or cells that must be skipped
Configuration values with no explanation of how they were chosen

Each discovery maps to one of the chapter’s four system properties. Hard-coded paths and undocumented dependencies are reproducibility failures. Monolithic cells that do many things are modularity failures. The absence of any automated checks is a testability failure. And magic numbers without context are readability failures.

D.2 Chapter 2: Version control

Exercise 1: Put a project under version control

Experiential — there’s no single right answer, but a sound initial commit contains only what you authored: code (.py modules, and notebooks with outputs stripped), configuration (requirements.txt, pyproject.toml, and the .gitignore itself), and documentation (a README). What you deliberately exclude, and where each belongs instead:

Data → a data-versioning tool (DVC) or object storage. It’s too large and too volatile for Git’s keep-everything-forever history, and may be sensitive.
Trained models and artefacts (.pkl, .joblib) → a model registry or artefact store. They’re large, binary, and regenerated rather than written by hand.
Secrets (.env, API keys) → a secrets manager, or a local .env that is never committed. A credential committed even once persists in the history after you delete it.
Caches and environments (__pycache__, .ipynb_checkpoints, .venv) → not versioned at all; they’re regenerated locally.

The discipline is the one from the chapter: commit what you author, store what you generate or receive somewhere better suited to it.

Exercise 2: Clean notebook diffs

Two routes achieve this. nbstripout installs a Git filter that strips outputs and execution counts on commit, so the tracked version holds only code and markdown:

pip install nbstripout
nbstripout --install        # registers the filter for this repository

Or pair the notebook with a script representation using Jupytext, and treat the script as the reviewed artefact:

pip install jupytext
jupytext --set-formats ipynb,py:percent analysis.ipynb

After either, make a one-line change and inspect git diff (or nbdiff if you use nbdime). The diff should now show only your change rather than a wall of metadata. The verification is the exercise: you’ve turned an unreadable JSON diff into a reviewable one.

Exercise 3: Commit messages for past decisions

Self-directed. The instructive part is comparing your messages against what a filename or inline comment could carry. A message such as “Drop signup_channel: 60% missing after the May tracking change, and imputing it was injecting signal” records both the reason and the evidence — information a filename like model_v3 cannot hold and a comment like # dropped signup_channel omits. Because the message is attached to the exact change, attributed, dated, and surfaced by git blame, the “why” survives long after the people who remember the review meeting have moved on. That permanence is what neither the filename nor the comment provides.

Exercise 4: Branch for an experiment

Experiential. A typical flow:

git switch -c experiment/log-spend-features
# ...edit, commit on the branch...
git switch main                              # main is untouched and still runs
git merge experiment/log-spend-features      # if the experiment worked
# or:  git branch -D experiment/log-spend-features   # if it didn't

Keeping main untouched means that at every moment you have a known-good version to fall back to, demo, or hand over — and discarding a failed experiment leaves no residue, no model_v2_BAD.ipynb lingering in the folder. Compared with copying files, the branch makes the experiment both comparable (you can diff it against main) and disposable (deleting the branch erases the dead end cleanly).

Exercise 5: Git versus an experiment tracker

Git versions the code and its history of changes: it answers “what is the code, and how did it come to be this way?”, with branching, merging, and line-by-line history. It records nothing about what a given run produced. An experiment tracker such as MLflow records runs: the metrics, parameters, and artefacts a specific execution generated, so you can compare AUC across fifty hyperparameter settings — something Git has no concept of. Conversely, a tracker offers no line-by-line source history or merge.

The division of labour reflects two different questions. Versioning your code is about the process that produces results; tracking results is about the outputs of running that process on particular data. Reproducibility needs both — the exact code version and the run record — which is why mature projects link each tracked run back to the Git commit that produced it.

D.3 Chapter 3: Environments and dependencies

Exercise 1: Produce a lockfile

Experiential. The flow separates intent from the resolved result:

# requirements.in holds your direct dependencies (the abstract spec)
pip install pip-tools
pip-compile requirements.in        # -> fully pinned requirements.txt (the lock)
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt    # rebuild from the lock

The point of the exercise is what the lock contains that your requirements.in did not: every transitive dependency, pinned to an exact version. Rebuilding in a fresh environment and finding the project still runs confirms you’ve captured the environment, not just your top-level wishes.

Exercise 2: Audit unpinned dependencies

Compare what’s declared (often nothing, or a few >= constraints) against pip freeze. The instructive part is naming a library where a major-version bump could change results and saying how: scikit-learn has changed estimator defaults between releases (a different solver or tie-handling shifts predictions); pandas changed copy-on-write behaviour and default dtypes; NumPy’s random Generator stream can differ across versions. In each case the code is untouched but the numbers move — which is precisely the failure mode pinning prevents.

Exercise 3: What pinning controls

Pinning your Python-package versions controls library behaviour — a changed default, a re-implemented algorithm, a deprecated parameter. It does not control the Python interpreter version, the operating system, the underlying maths libraries (BLAS/LAPACK), or hardware/GPU floating-point behaviour. To control that second category you reach for a container (Docker), which pins the interpreter, system libraries, and OS alongside the packages; for bit-exact numerics you would additionally fix BLAS threading and any framework determinism flags. Pinning versions is necessary but not sufficient for full reproducibility.

Exercise 4: Reproduce a colleague’s environment

Experiential. The valuable output is the list of things the lockfile alone didn’t capture, each with its proper home:

The Python version → a .python-version file or pyproject.toml.
A system library a wheel links against (e.g. a C library, a CUDA runtime) → a container image or a documented set of OS packages.
An environment variable the code reads at runtime → a committed .env.example documenting the names (never the values; see Configuration and secrets).

Each gap is a reproducibility failure waiting to happen, and each has a better home than a colleague’s memory.

Exercise 5: When abstract beats locked

Abstract >= requirements are the right choice when your code is a library meant to be installed alongside other packages: over-constraining versions would make it hard for consumers to satisfy everyone’s dependencies at once, so you specify the minimum you need and stay flexible. An exact lock is essential when your code is an endpoint — a deployed service, a scheduled pipeline, a reproducible analysis — where the same versions must reappear every time and nothing downstream depends on your version range. The difference is structural: a library is a dependency of other things (flexibility aids interoperability); an application is the final consumer (exact reproduction is the whole point).

D.4 Chapter 4: The command line

Exercise 1: Capture a workflow

Experiential. A reasonable result is a Makefile with one target per step, run end to end with a single make. The discovery worth noting is the step that “depended on you remembering to do something first” — creating an output directory, setting an environment variable, downloading data before the feature step. Those implicit prerequisites are exactly what a task runner makes explicit: a target that creates the directory, or a dependency declaration (features: data) that enforces the order so no one has to remember it.

Exercise 2: Answer a question with the shell

For example, counting the distinct values in the third column of a CSV (skipping the header):

tail -n +2 data.csv | cut -d, -f3 | sort | uniq | wc -l

or counting rows matching a condition with grep -c. This feels natural for quick, line-oriented filtering and counting on flat text, and it’s faster than starting a Python session. You wish for a DataFrame the moment fields contain commas or quotes (naive cut mis-parses real CSV), when you need typed aggregation, or when a join is involved — that’s the boundary the Data Science Bridge describes.

Exercise 3: Exit codes and chaining

A validation script signals failure by exiting non-zero:

# validate.py
import sys
import pandas as pd

df = pd.read_csv("data/processed.csv")
if df.empty or "target" not in df.columns:
    print("validation failed: empty data or missing target", file=sys.stderr)
    sys.exit(1)          # non-zero: tells the shell something went wrong
print("validation passed")

python validate.py && python train.py   # train runs only if validate exits 0

This matters because automation and CI decide pass/fail from exit codes, not from reading output. A non-zero exit halts the && chain, so bad data never reaches training, and a CI server marks the build red instead of silently continuing. The exit code is the machine-readable verdict the whole pipeline depends on.

Exercise 4: Surviving a dropped connection

A command started in a plain SSH session is a child of that session. When the connection drops, the session ends and the job is sent a hang-up signal (SIGHUP), so it dies — unless you took extra steps (nohup, disown). tmux (or screen) changes this by running your shell inside a session that lives on the server, decoupled from your client connection: you detach (or simply lose the connection), the session and everything in it keep running, and you reattach later to find the job still going or finished. For any job measured in hours, that decoupling is the difference between a result and a wasted afternoon.

Exercise 5: When a pipeline counts the wrong thing

The figure can be wrong in several directions at once. grep has no idea which column it is looking at, so a row with a customer reference of INV-2026-118, a product name containing 2026, or a shipping date in 2026 attached to an order placed in December 2025 all count as matches. It has no idea about rows either: a quoted field containing an embedded newline is two lines to wc -l, and the header line counts as an order if it happens to contain the string. And a match is a match once — a line mentioning 2026 twice still contributes one, which is right here but would not be if you were counting occurrences. The pandas version avoids all of this because it asks a question of one typed column: order_date has been parsed into a datetime, .dt.year is a genuine year rather than a substring, and a row is a row regardless of how many newlines the raw text contained.

The culprit is not grep, which does its job perfectly. It is the pipe’s data model: it carries untyped text, split on newlines, with no notion of fields, types, or records. That generality is exactly why every tool on the system can be composed with every other, and exactly why the composition stops being trustworthy once the meaning of your data lives in its structure rather than its characters.

D.5 Chapter 5: Readable code

Exercise 1: Refactor for readability

Experiential. The change that surprises people most is that renaming alone surfaces confusion: if you can’t think of a good name for a variable, that’s often a sign you don’t fully understand what it holds, or that it holds two different things at different points. Replacing magic numbers with named constants does the same — naming 0.73 forces you to articulate what it is. Adding a type signature and a one-line docstring then makes the contract explicit without touching the logic. If the refactor was purely cosmetic and nothing became clearer, the original was already readable; usually something does.

Exercise 2: Formatter and linter

Experiential. The instructive step is sorting the linter’s output into real defects and pure style. Real defects include unused imports, variables assigned but never used, names that shadow a builtin (list, dict), bare except: clauses that swallow errors, and mutable default arguments. Pure style includes line length, quote style, and spacing — exactly the things the formatter fixes automatically. The lesson is the division of labour: automate the style pile entirely so that human attention in review goes to the defect pile and to the logic, which no tool can check.

Exercise 3: Split a function

Experiential. The test of a good decomposition is to read just the new function names in sequence and ask whether they narrate what the original function did. If they do — load_raw, drop_invalid_rows, add_spend_per_day, aggregate_by_cohort — the names are carrying the structure, and a reader can understand the whole from them. If a step needs a comment to explain what it does, the name isn’t doing its job yet.

Exercise 4: Rewrite it, or explain it?

For most inherited code the rewrite wins, and the reason is what the next reader does rather than what they want to know. Someone opening a sampling correction is usually there to change it — a new weighting scheme, a different edge case — and a comment cannot help them do that safely. Prose sits beside the code, drifts out of date the first time someone edits the body without editing the paragraph, and offers no protection against misreading the line they are about to modify. Names, decomposition and a signature live in the thing being changed, so they cannot fall out of step with it.

The opposite choice is right when the difficulty is not in the code but in the reasoning behind it. A function whose body is five clear lines implementing a non-obvious statistical correction — where the mystery is why this adjustment, under which assumptions, from which paper — gains nothing from further decomposition and everything from a paragraph recording the justification. That is the honest boundary of the methods-section analogy: prose is the right tool for the reasoning a reader could never recover from the code, and the wrong tool for the intent they need in order to change it.

Exercise 5: When readability isn’t worth it

Throwaway names are the right call for genuinely scratch code that will be deleted within the session — a quick check of a distribution, a one-off plot to settle a question, a snippet you’re using to understand an API. The specific signal that it has crossed the line is promotion: you copy it into another notebook, you find yourself relying on its output days later, or you hand it to someone else. At that moment the code has become “kept”, will be read many times, and earns the few minutes of naming and documentation. The skill is noticing the promotion and cleaning up then, rather than writing every scratch cell as if it were production.

D.6 Chapter 6: Functions, modules, packages

Exercise 1: Extract a copy-pasted function

Experiential. The payoff is visible the moment you make the follow-up change: with the logic in one imported module, you edit one place and every caller gets the fix; with copies scattered across notebooks, you had to find and edit each one — and missing one is exactly how the copies drift out of sync. “How many places did I have to change?” going from several to one is the single-source-of-truth principle made concrete.

Exercise 2: Global to pure function

Experiential. Once every input arrives as an argument, the function’s result depends only on those arguments, so it returns the same answer no matter what cells ran before it. The verification — calling it after deliberately changing some unrelated global or re-running cells out of order, and getting the same result — is the property that also makes it testable in the next chapter.

Exercise 3: Make a project installable

Experiential. The error you were previously working around is ModuleNotFoundError (or the sys.path.append("..") hack used to dodge it), which works only from the directory you happened to launch from. After a minimal pyproject.toml and pip install -e ., the package is importable by name from anywhere in the environment, and because the install is editable, changes to the source take effect immediately without reinstalling. The fresh-notebook test confirms the import no longer depends on where you started.

Exercise 4: Library-author disciplines

A discipline you do not need for code only you use: a stable public API with backwards compatibility, semantic versioning, and deprecation cycles. While you’re the only user you can rename, re-signature, and restructure freely. A discipline you should adopt the moment a colleague imports your code: a stable interface — don’t change function names or argument meanings out from under them without warning — and a documented public surface so they can use it without reading the implementation. The trigger for the switch is precisely “someone else now depends on this”.

Exercise 5: What belongs in a package

Code that should stay in the notebook is exploratory, one-off, or presentation-specific: the narrative of a particular analysis, plots tailored to one report, throwaway checks. Code that has earned a place in a module is reusable logic — data cleaning, feature engineering, model training and evaluation — that you’ll run more than once or in more than one place. The signal that logic has crossed the line is reuse: you’ve used it (or want to) a second time, you need to test it, or someone else needs it. “I’m about to copy this” is the clearest possible prompt to extract it instead.

D.7 Chapter 7: Testing stochastic code

Exercise 1: Test a deterministic transform

Experiential. The instructive part is usually the edge cases: writing test_on_empty_input or test_with_all_zeros forces you to decide what the function should do in those situations — return an empty result, raise a clear error, propagate NaNs — when the original code never made that decision explicit. A test you can’t write because you don’t know the expected answer is a sign the contract is underspecified, which is a finding in itself.

Exercise 2: Make a stochastic function testable

Experiential. Having the function accept an explicit rng argument turns hidden global randomness into an injected dependency you control. The exact test then fixes the seed and asserts a specific result; the tolerance test asserts a statistical property — say, that the mean of many draws is within some band of the expected value. The tolerance should be justified: wide enough that it won’t fail by chance (a few standard errors of the quantity you’re checking), tight enough that a real defect would breach it. Stating why you chose the band is part of the answer.

Exercise 3: Test an invariant

Experiential. An invariant is a property that must hold for every input — preserved row count, no new missing values, output bounded in a range, mean zero after standardising. Checking it across many random inputs is property-based testing done by hand; hypothesis automates the input generation and, when a property fails, shrinks the counterexample to the smallest input that triggers it, which is often the fastest route to understanding the bug.

Exercise 4: Why model.score(...) > 0.85 is a poor unit test

It conflates evaluation with testing. The assertion is really trying to answer “is the model good enough?”, which is an evaluation question — answered on a continuum, against a baseline, and monitored over time — not a pass/fail property of the code. As a unit test it fails on three counts: it’s fragile (it breaks the first time the data shifts, with no code defect), uninformative (a failure doesn’t localise any bug), and slow. What you should test about the pipeline is the deterministic machinery around the model: that data validation rejects malformed input, that transforms produce the expected columns and leak nothing, that the pipeline runs end to end on a tiny sample, and that a saved model round-trips to identical predictions.

Exercise 5: Why a flaky test is worse than none

A test that fails one run in ten and is habitually re-run until green is dishonestly noisy: it trains the team to treat failures as background noise to be cleared by re-running, which is exactly the habit that lets a real failure slip through unnoticed. It also blocks or destabilises CI and erodes trust in the whole suite. No test is at least honestly silent; a flaky test actively degrades everyone’s response to failure. Two fixes that keep the test: make it deterministic by fixing the seed so the stochastic element is pinned; or, if it’s genuinely checking a statistical property, replace the brittle assertion with a principled tolerance (several standard errors wide) or an invariant that must always hold. Re-running until it passes, or deleting it, are the two non-answers.

D.8 Chapter 8: Debugging and profiling

Exercise 1: Read a traceback

Experiential. Read bottom-up: the final line is the exception type and message (what went wrong), and the deepest frame in your own code is where to look — third-party frames below it are usually just the machinery that surfaced your mistake. The fact you’d check first follows directly from those two (a ZeroDivisionError in a line dividing by active_days says: look for a zero). People are routinely surprised how often the traceback alone, read properly, fully explains the bug — the panic that makes us skim it is the real obstacle.

Exercise 2: Use a debugger instead of print

Experiential. The thing a debugger shows that a print does not is the entire live state at the moment of failure — every variable, not just the one you anticipated printing. That’s how you spot the cause you weren’t looking for: the column that’s unexpectedly all zeros, the frame with the wrong shape, the value that’s a string where you assumed a float. Print debugging can only show what you already suspected; the debugger shows what you didn’t.

Exercise 3: Replace print with logging

Experiential. A reasonable mapping is INFO for milestones (“loaded N rows”, “training complete”), WARNING for recoverable oddities (“clipped 12 negative values”), and DEBUG for fine detail. The payoff is the final step: flipping a single level setting switches between a quiet production run and a verbose diagnostic one without editing — or later removing — any of the statements, which is precisely what makes logging persist where scattered prints get deleted.

Exercise 4: Profile and fix

Experiential. The lesson lands hardest when the hot spot is not where you expected — the slow step is often an innocent-looking apply or a repeated recomputation, not the obviously heavy model fit. Fix the dominant one (vectorise the loop, cache the repeated work) and measure the change rather than assuming it helped. The discipline is to optimise what the profiler points at, and only once something is actually too slow.

Exercise 5: Four questions, four tools

A variable’s current value → print (or a quick inspect). Adequate, because the question is narrow and you already know what you want to see.
The full state at a failure → a debugger (pdb or an IDE). print can’t show everything at once, and it forces you to guess in advance which variables will matter.
What happened in a run you weren’t watching → logging. print has no levels, timestamps, or persistence, and you weren’t there to read it scroll past.
Where the time went → a profiler. print can’t attribute time, and hand-timed guesses are biased toward the parts you already suspect.

print is the right tool only for the first; for the other three it’s a poor stand-in because it answers “what is this value now?” and nothing else — it cannot capture full state, persist a structured record, or measure performance.

D.9 Chapter 9: Project structure

Exercise 1: Reorganise a flat project

Experiential. The instructive discovery is usually something being mutated in place — a raw CSV edited to fix a typo, a column renamed in the source file, a row dropped by hand. Once raw data is read-only, that edit has to become a transformation step whose output lands in data/interim/ or data/processed/, leaving the original untouched. Finding the in-place edit is finding the point where your work stopped being reproducible from source.

Exercise 2: Write a README

Experiential. Every question the colleague still has to ask is a gap in either the README or the structure, and the common ones are revealing: an undocumented environment variable, a data source that needs credentials or special access, a setup step that “everyone knows”, or the order in which things must run. The exercise works precisely because you can’t see your own assumptions — the colleague’s questions surface them.

Exercise 3: A single source of truth for paths

Experiential. The original would break on a colleague’s machine because an absolute path like /Users/you/project/data.csv simply doesn’t exist there. Deriving every path from one project root (resolved from __file__ or a config value) makes the project portable: it runs unchanged on a laptop, a server, or inside a container, because only the root differs. The hard-coded path is one of the most common reasons “it works on my machine” and nowhere else.

Exercise 4: Write the rule for your own layout

A workable rule usually turns on kind and lifecycle rather than topic: “every file belongs to exactly one of authored-and-version-controlled, received-and-immutable, or generated-and-regenerable, and lives in the directory for that category.” That is enough for a colleague to place a new file unaided, and it is the sentence a layout is really encoding — src/ and configs/ are authored, data/raw/ is received, data/processed/ and models/ are generated.

The interesting part is the file that resists it. A notebook that both explores and produces the figures for a report is authored code and a generated artefact at once; a lookup table that someone hand-edits is authored data that arrived as raw. Sometimes the file is genuinely misplaced and should be split — the notebook writes its figures to a generated directory rather than beside itself. Sometimes the rule is too crude and needs an extra category, which is a real answer rather than a failure. Working out which is which is the maintenance work a schema would have done for you automatically, and it is the price of a convention that nothing enforces.

Exercise 5: When structure is overkill

A genuinely one-off analysis — a quick answer for a meeting, a teaching example, a throwaway exploration — should stay a single notebook in a single folder; imposing src/, tests/, and a structured data/ on it is pure overhead. The signal that it has earned the scaffolding is longevity and dependence: it will run again (on new data, or on a schedule), someone else needs to run or maintain it, it needs tests, or helper files are starting to accumulate at the top level. “This is going to outlive the week” is the trigger.

D.10 Chapter 10: Data pipelines

Exercise 1: Break a monolith into stages

Experiential. The payoff usually shows up as a stage that proves reusable in a context you hadn’t anticipated — the cleaning function reused by a different analysis, or a feature transform reused at serving time. That unplanned reuse is exactly what the monolithic cell made impossible, because the useful part was welded to everything around it.

Exercise 2: Add a validation gate

Experiential. The bad data a gate would have caught at the boundary includes an upstream schema change (a renamed or dropped column), an unexpected null where the next stage assumes completeness, an out-of-range value (negative spend, a date in the future), a duplicated key, or a target column leaking into the features. The point is where the failure surfaces: a gate turns a cryptic error three stages downstream into a precise message at the moment the bad data entered.

Exercise 3: An idempotent, cached stage

Experiential. Persist the stage’s output to data/interim/ and skip the computation when the artefact already exists; the second run should report the stage skipped. The lesson to carry forward is that a real orchestrator does this for you and invalidates the cache when a stage’s inputs or code change — caching is only safe when staleness is handled, which is why “re-run if the inputs changed” is the rule, not “re-run never”.

Exercise 4: The guarantee the cache quietly removed

An sklearn Pipeline prevents leakage by construction: .fit() is called separately on each training fold, so the scaler only ever sees training rows and validation statistics cannot influence it. A scale_features stage that fits across the whole dataset before the split has already broken that, and the caching makes it worse in two ways. The fitted statistics now sit in a persisted artefact that every downstream run reads, so the leak is baked into a file rather than recreated (and possibly noticed) each time; and because the stage is skipped whenever the artefact exists, a later split, a new fold, or a fresh evaluation silently reuses statistics computed over rows it was supposed to have never seen. The evaluation looks fine, which is the whole problem — leakage flatters your metrics rather than raising an error.

The fix is not to abandon caching but to move the split upstream of the fitting. Split first, and let the pipeline persist the split as its cached artefact; then fit the scaler inside the training path only, persist the fitted scaler as a versioned artefact of that run, and have the validation and serving paths apply it rather than refit. Caching a transformation is safe; caching a quantity that was learned from data is only safe if the artefact records which rows it learned from — which is precisely the bookkeeping Pipeline did for you inside a single .fit() call, and which nobody does for you once the stages span processes and files.

Exercise 5: When a pipeline framework is overkill

A single script or notebook is the right tool for a one-off analysis, or a workflow of one or two steps that you run interactively and watch. The signal that it has outgrown this is when re-running everything becomes too costly or too risky: the workflow runs repeatedly or on a schedule, some stages are expensive enough that you want to re-run only what changed, several people or systems depend on intermediate outputs, or failures need to be isolated and retried rather than restarting from scratch. At that point the explicit stages and orchestration earn their keep.

D.11 Chapter 11: Configuration and secrets

Exercise 1: Lift hard-coded values into config

Experiential. The values that turn out to differ between your machine and where the code really runs are the telling ones: absolute file paths almost always, plus database and table names, output locations, resource settings (number of workers), and debug flags. These are exactly the things configuration is for — the same logic, different values per environment — and finding them is finding everything implicitly tied to your laptop.

Exercise 2: Move a secret out of the repository

Experiential. Add .env to .gitignore, commit a .env.example template with placeholder values, and load the real secret from the environment. The reason moving it is not sufficient on its own, if it was ever committed, is that version control keeps history: the secret remains in past commits even after you delete it from the current files, so anyone with a clone still has it. It must be rotated — the credential changed at its source — not merely removed.

Exercise 3: Typed, validated config

Experiential. The contrast is the point: a bare dictionary lets a mistyped key (raw_paht) return a silent None that surfaces as a confusing failure much later, whereas a pydantic model with a constraint rejects a bad value the instant it loads, naming the offending field. Feeding it an out-of-range value and getting an immediate, specific error is the behaviour you’re buying.

Exercise 4: One config file per run, copied by hand

The copy and the original drift. Two files that are meant to agree on everything except a connection string won’t, within a few weeks: someone tunes a threshold in one and forgets the other, and production ends up running settings nobody deliberately chose. The failure is silent, and it destroys the thing config was supposed to give you — a trustworthy record of what actually ran. The split to make is between the schema, which should be identical everywhere and validated identically everywhere, and the values, only a few of which legitimately vary: connection strings, bucket names, output locations, worker counts, debug flags. Model settings that differ between development and production are almost always an accident rather than a decision, and an environment setting that’s identical everywhere is usually a hard-coded value waiting to bite. Keep one schema, layer environment-specific values over a shared base, and let validation catch the drift.

Exercise 5: When hard-coding is acceptable

A genuine constant — a value that is part of the logic and does not vary by run or environment, such as a mathematical constant or a truly fixed business rule — is perfectly fine as a named constant in the code. A value becomes configuration when it varies between environments, changes between runs, or is something you tune. The signal that a hard-coded value has become a liability is any of: you find yourself editing code to change it, it differs between dev and prod, or you can’t tell from the code why it has the value it does. Secrets are the absolute case — always a liability when hard-coded, regardless of anything else.

D.12 Chapter 12: API design

Exercise 1: Wrap a model in an endpoint

Experiential/applied. What the endpoint forces you to decide, and a notebook predict let you ignore, is the contract: the exact request format (field names, types, units), what the response contains (a label, a probability, both), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, explicitly and in advance, what a caller must send and will receive.

Exercise 2: Validation returns a clear 422

Experiential. Constraining the request fields means a malformed request is rejected at the door, before the model runs, with a message that says what was wrong — rather than reaching the model and producing a confidently wrong prediction from garbage, or crashing deep inside and returning an opaque 500. The lesson is that validation converts an unpredictable internal failure into a precise, early, client-facing error.

Exercise 3: Response schema and live docs

Experiential. The documentation is generated from the schema and code (as OpenAPI), so it cannot drift out of sync — change a field and the docs change with it. This matters because callers integrate against the documentation: hand-written API docs inevitably fall behind the implementation, and documentation that lies is worse than none, because it sends integrators down paths that no longer work.

Exercise 4: Adding a feature to a live endpoint

Publish /v2/predict. Adding a required field to /v1 breaks every existing caller the moment you deploy — their requests start failing validation with a 422, and the first they hear of it is their own error rate. You never agreed a migration window with them, so there’s no point at which that break is acceptable. The notebook instinct is safe precisely because you’re the only caller and you change both sides at once; an endpoint separates those sides across teams and across time. A new version lets both schemas run side by side while callers migrate on their own schedule, and gives you a signal — /v1 traffic falling to zero — that tells you when the old one can be retired. The tempting middle road, making the field optional with a default, is worth naming: it keeps callers working, but silently scores them on a value you invented, which is usually worse than a clear failure.

Exercise 5: Batch versus real-time

A scheduled batch job is the right mechanism when predictions are consumed in bulk on a known cadence and latency doesn’t matter — nightly churn scores feeding a dashboard, weekly demand forecasts written to a table. A real-time API is necessary when predictions are needed on demand, one at a time, with low latency, in response to a user action or another system’s request — a fraud check at checkout, a recommendation as a page loads. The deciding property is how and when the prediction is consumed: in bulk on a schedule points to batch; on demand with a latency requirement points to an API.

D.13 Chapter 13: Continuous integration

Exercise 1: Add a CI workflow

Experiential. The payoff is the moment the status goes red on a deliberately broken test before you’d have noticed any other way — that’s the days-to-minutes gap from the chapter’s opening, closed. A workflow that installs the locked dependencies (rather than whatever the runner happens to have) is also what makes the run a faithful check rather than a coincidence of the runner’s environment.

Exercise 2: Lint and format gate

Experiential. The first run typically flags the same mix Chapter 5 described — unused imports, unformatted files, the occasional shadowed name or undefined reference — and sorting them into “real defect” and “pure style” is the exercise. The style pile is exactly what the formatter fixes automatically, so once ruff format is in the gate it stops recurring, and the gate’s signal becomes mostly about real problems.

Exercise 3: Set up pre-commit

Experiential. The hook stops the commit locally, before anything reaches CI, which is the point: the cheapest place to catch a formatting slip or a stray large file is before it’s even recorded. Pre-commit and CI are complementary — the local hook gives instant feedback on the trivial things, and CI remains the authoritative shared gate that everyone’s changes must pass.

Exercise 4: The change CI cannot see

The data changes. An upstream system starts sending a column in a different unit, a category gets renamed, the customer population shifts after a marketing push. Nothing in the repository changed, so every check stays green — CI watches commits, not the world. That’s the real limit of the analogy: you re-run a holdout because anything might have invalidated the verdict, whereas CI fires only on a change to the code. Catching the rest takes the mechanisms from elsewhere in the book — validation gates on the data as it enters the pipeline (Chapter 10), a scheduled retrain that records metrics over time rather than gating a merge, and monitoring of the live service (Chapter 16). Green CI means the code still does what you asked it to. It never meant the model is still right.

Exercise 5: What belongs in CI

Run on every change the checks that are fast and deterministic: unit tests, linting, type checks, and a small end-to-end smoke test on sample data. Push to an occasional or nightly job the things that are slow, expensive, or non-deterministic: full model training, integration tests against real external services, validation over large datasets, performance benchmarks. The principle is that every-push checks must be quick and reliable enough that developers never feel the urge to route around them — a gate that is slow or flaky gets disabled or ignored, at which point it protects nothing. Speed and trustworthiness are what make the gate worth having.

D.14 Chapter 14: Containerisation

Exercise 1: Write a Dockerfile

Experiential. A sound first Dockerfile starts from a slim base, installs the locked requirements, copies the code, and sets the run command; building and running it confirms the service starts in a clean, sealed environment rather than relying on anything on your machine. If it runs in the container but not on a colleague’s bare machine, the container has done its job — it carried the environment with it.

Exercise 2: Improve the image

Experiential. Ordering the instructions so dependencies are installed before the code is copied means a code edit reuses the cached dependency layer instead of reinstalling everything — rebuilds drop from minutes to seconds. A slimmer base, a multi-stage build, or removing build tools afterwards shrinks the image substantially, often from north of a gigabyte to a few hundred megabytes. The exercise is to measure the before and after, because the gains are larger than most people expect.

Exercise 3: Keep data and secrets out

Experiential. Baking data into the image bloats every copy of it, ties the image to a single snapshot of the data, and can push sensitive records into registries. Baking a secret in is the Chapter 11 mistake reincarnated: the credential ends up inside an artefact that gets pushed to registries and shared, so it leaks and must be rotated, not merely removed. The fix is to keep the image a generic definition of how to run — mount the data as a volume and inject the secret as an environment variable at run time.

Exercise 4: Two builds of the same Dockerfile

They differ everywhere the recipe is vague. FROM python:3.12-slim is a moving tag that will have been rebuilt on a newer base with newer system libraries; pip install -r requirements.txt resolves against PyPI as it stands on build day, not six months ago; any apt-get install pulls whatever the distribution currently ships. So the honest statement is that a Dockerfile is a recipe, not a pin — the resulting image is frozen, but the instructions that produce it are not, and the analogy quietly assumes those are the same thing. Making it true means pinning at every layer: reference the base image by digest rather than tag, install from a fully locked requirements file, pin system package versions, and then treat the built image, stored in a registry, as the artefact you promote from CI to staging to production — rather than rebuilding at each stage and trusting that you’ll get the same thing twice.

Exercise 5: When to containerise

A container is clearly worth it whenever the code must run reliably on a machine that isn’t yours: a deployed service, a job on a shared cluster, a pipeline that has to reproduce exactly across environments, or onboarding a team to one consistent setup. It’s overkill for a one-off local analysis or an exploration only you will ever run on your own machine, where a virtual environment and a lockfile already give you everything you need. The deciding property is exactly that: does it need to run identically somewhere other than your machine? If yes, containerise; if it lives and dies on your laptop, the lockfile is enough.

D.15 Chapter 15: Deployment

Exercise 1: Deploy the service

Experiential. The revealing part is what the platform makes you supply explicitly that your laptop quietly provided: the environment variables and secrets (Chapter 11), the port to expose, persistent storage for any data, the exact run command, and resource limits. On your machine all of that was implicit context; deployment turns it into configuration you have to state, which is itself why the container and config work of the previous chapters pays off here.

Exercise 2: Staging with pass criteria

Experiential. The discipline is to run the same artefact as production with different configuration, and to decide what “passing staging” means before you look at the result — a latency ceiling, an error-rate ceiling, a smoke test that must pass. Deciding the bar in advance is what stops the very human temptation to wave a release through because it “seems fine”, which is the operational version of moving the goalposts.

Exercise 3: Schedule a batch job

Experiential. The key behaviour is that a failed run is detectable: the job exits non-zero (Chapter 4), so the scheduler can alert, rather than failing silently and leaving you to discover days later that the table was never updated. A batch job that fails quietly is worse than one that fails loudly, because the missing output often looks just like stale-but-present output.

Exercise 4: Does overfitting apply to staging?

The worry carries across, but it attaches to different things. You are not overfitting staging by running against it repeatedly — that’s the point of it, and unlike a holdout the environment isn’t consumed by being looked at. What you can over-fit to is the fixture: a staging dataset that never changes, so the smoke tests come to encode the quirks of that one sample; a synthetic load profile that flatters your caching; or a set of pass criteria quietly relaxed each time a release failed them. In every case staging keeps going green while telling you less and less about production. The defences are to refresh staging data periodically (or sample it from production, suitably anonymised), to set the thresholds from observed production behaviour rather than from what staging happened to achieve last time, and to treat a criterion you have loosened twice as evidence that the criterion, not the release, needs examining.

Exercise 5: Batch versus always-on, and rollback

A nightly churn-scoring job feeding a report is naturally batch; a real-time fraud-scoring API is naturally always-on. Their rollbacks differ accordingly. For the batch job, a rollback is usually re-running the previous version (or simply retaining yesterday’s output) — the failure is recoverable because nothing depended on the run in real time. For the online service, a rollback means switching live traffic back to the previous container version immediately — a blue-green flip or redeploying the prior tag — because every second on the bad version affects real requests. Batch buys you time; online demands a fast switch.

D.16 Chapter 16: Monitoring and observability

Exercise 1: Logging and health

Experiential. To investigate “a strange answer last Tuesday” you need, at minimum, the timestamped request inputs, the prediction returned, and the model version that served it — ideally tied together by a request ID so you can correlate across log lines. The common failure is logging only that “a prediction was made”: that confirms the service ran but lets you reconstruct nothing. The test of your logging is whether you could replay last Tuesday’s prediction from it alone.

Exercise 2: A drift check

Experiential. Store a reference sample from training, compare each live batch to it with a KS test or population stability index, and alert when the statistic crosses a threshold. As for which feature drifts first, the usual culprits are externally driven or time-sensitive ones — a monetary feature exposed to inflation, a feature fed by an upstream source that changes format or coverage, or anything seasonal — because the world moves those independently of your model.

Exercise 3: A useful alert

Experiential. Choose a condition and a threshold, then defend it against fatigue: set the threshold from observed normal variation rather than a round number, alert on a sustained breach rather than a single spike, deduplicate repeated firings, and route only actionable alerts to a human. This is the flaky-test lesson from Chapter 7 transplanted to operations — an alert that cries wolf trains the team to ignore it, and an ignored alert protects nothing, including on the day it’s right.

Exercise 4: Drift check or accuracy report?

Either can be defended, and the justification matters more than the choice — but the drift check is usually the better first build, because it is the only one that can tell you something tomorrow. With a ninety-day label lag, the accuracy report’s first useful signal is a quarter old: it would confirm decay that has already been served to every customer since. The drift check trades certainty for latency — it cannot tell you the model is worse, only that it is now operating on data it wasn’t trained for, which is a warning you can act on the same week.

What you give up is real. The accuracy report catches concept drift, and the drift check structurally cannot: if fraud tactics change while the input distribution stays put, every feature looks stable and the drift check stays silent while the model is quietly wrong. It also catches the case where inputs shifted and the model coped fine — which the drift check would have flagged as an alert nobody needed. So the honest answer is that the drift check buys speed on one failure mode and is blind to another, and the plan should be to add the label-based report as soon as there is time, not to treat drift as a substitute for ground truth.

Exercise 5: Data drift versus concept drift

Data drift is a change in the input distribution P(X): the feature values coming in look different from training — for example, a new customer demographic produces values the model rarely saw. Concept drift is a change in the relationship P(Y|X): the same inputs now map to different outcomes — for example, fraudsters change tactics, so transaction features that meant “safe” last year no longer do. Detecting input drift without labels warns you the model is operating on unfamiliar data, where its learned assumptions may no longer hold — but it can’t confirm a real accuracy drop, because the model might still perform well on the shifted inputs, or the relationship might have changed while the inputs looked unchanged. Only ground-truth labels, when they eventually arrive, can confirm the model has genuinely become less accurate.

D.17 Chapter 17: Code review

Exercise 1: A small, focused pull request

Experiential. The thing to notice is why a small, well-described PR is easy to follow: the reviewer can hold the whole change in their head at once, and a description of what-and-why spares them reconstructing your intent from the diff. The contrast with a sprawling change is the lesson — reviewability is mostly a property of size and framing, not of how clever the code is.

Exercise 2: Review someone else’s PR

Experiential. Most people find that finding issues is easier than phrasing them well. The discipline is to mark each comment as blocking or a suggestion (so the author knows what must change), to keep it about the code rather than the person, and to say why — because the reasoning is what teaches and what makes the comment land as help rather than criticism.

Exercise 3: A bug that automated checks miss

A leak, a wrong metric, or an off-by-one in a split passes the linter and the tests because it is syntactically valid and the tests only check what the author already thought to check; a linter inspects form, never domain correctness. Catching it needs a reviewer who reads the logic and the data flow with domain knowledge — someone who knows the scaler must be fit on training data only, or that this metric is wrong for an imbalanced problem. That is exactly the attention automation cannot provide and review exists to supply.

Exercise 4: Peer review habits that don’t transfer

Any of the three works; the accept-or-reject verdict is the most damaging. Brought to a pull request, it looks like a reviewer who reads the change, decides it isn’t the design they would have chosen, and rejects it with a paragraph explaining what they’d have done — a verdict on the whole, rather than comments on lines. The author now has no route forward short of rewriting, and the change stalls.

Code review isn’t a verdict, it’s a conversation with a default of yes. The reviewer’s job is to get the change merged in good shape, which means specific comments the author can act on, an explicit split between what blocks the merge and what’s merely a preference, and approval once the blockers are addressed rather than once the reviewer would have written it the same way. On a team shipping weekly this is structural, not just courtesy: reviewer and author will swap roles a dozen times a month, and every day a change sits unmerged is a day the branch drifts and the review gets harder. Peer review can afford to be a gate because it happens once; code review has to be a fast, repeatable exchange, because it happens constantly.

Anonymity fails for the same underlying reason — you know exactly who wrote this and you will need them to review yours on Thursday — and the expectation of a defence turns that exchange adversarial, teaching authors to justify rather than to ask.

Exercise 5: A data science review checklist

Items worth adding that a general checklist wouldn’t have: does it leak (preprocessing fit on test data, a target in the features)? are the data assumptions stated and checked? is it reproducible (seed and config captured)? is the metric appropriate to the problem? are there hard-coded secrets? What to leave off is style and formatting — not because it doesn’t matter, but because the formatter and linter settle it automatically (Chapter 5). Leaving it off makes reviews better, because litigating whitespace in comments consumes the human attention that should go to logic and trains a team to nitpick form instead of reasoning about correctness.

D.18 Chapter 18: Documentation

Exercise 1: Write a README

Experiential. The point at which your reader first gets stuck is the most valuable output: it’s almost always an undocumented environment variable, a data source that needs access you forgot to mention, or a setup step so habitual you didn’t know you were doing it. Timing the run from clone to running surfaces the assumptions you can’t see precisely because they’re yours.

Exercise 2: Add docstrings

Experiential. The test is whether help() on a function tells a reader enough to use it without reading the body. If it doesn’t, the docstring is missing part of the contract — usually the parameters, the return value, the exceptions it raises, or a worked example. A docstring that only restates the function name has documented nothing.

Exercise 3: Write a model card

Experiential. The hardest section is almost always “known limitations / where it should not be used”, because it forces you to articulate the model’s failure modes and the populations it was not validated on — exactly the questions exploratory work leaves implicit. That difficulty is itself informative: where the model card is hard to write is where your understanding of the model’s boundaries is thinnest, and therefore where the risk lives.

Exercise 4: Classify documents with Diátaxis

A docstring is reference (look up what a function takes and returns). A tutorial notebook is a tutorial (learning by the hand). A model card is mostly reference plus explanation (facts about the model, and the why behind its limits). A README is a deliberate blend — at its best a brief tutorial/how-to that orients a newcomer and points onward to the rest. Mixing the jobs makes a document worse because a reader arrives with one need — to learn, to look up, or to understand — and a document trying to serve two serves neither: a reference padded with teaching is slow to search, and a tutorial listing every option is impossible to follow.

Exercise 5: Keeping documentation in sync

Two structural practices: co-locate documentation with the code (docstrings), so a change to the code sits right beside the text that describes it; and generate reference documentation from the code and make examples executable (doctest, or a tested snippet), so a changed signature or a stale example becomes a build or test failure rather than a silent lie. “Remember to update the docs” is not a third practice because it relies on human discipline under deadline pressure with no feedback when it’s forgotten — the documentation rots quietly and you only discover it when it has already misled someone. Structural defences make drift either impossible or loud; a reminder makes it neither.

D.19 Chapter 19: Technical debt

Exercise 1: Audit a project for debt

Experiential. Sorting each item into deliberate (you knew you were cutting the corner) and inadvertent (you’ve only just noticed) is the instructive part, and the inadvertent pile is usually the larger and more alarming one. The item that surprises people most is almost always a piece of “temporary” code — a hard-coded value, a quick script — that turned out to be load-bearing and has been quietly holding production together for months.

Exercise 2: The boy-scout rule

Experiential. Paying down one item while you’re already in the file — adding a test, extracting a function, naming a constant — is typically quick relative to the change you came to make, and that’s exactly the point: opportunistic repayment is cheap because you’ve already paid the cost of understanding the code. Debt repaid this way never has to be scheduled.

Exercise 3: A debt log

Experiential. A debt item is worth writing down, rather than fixing on the spot, when the fix is larger than the time you have, when the code might be discarded anyway, or when stopping to repay it now would derail the task in hand — but you still want it visible so it isn’t silently forgotten. Trivial fixes don’t go in the log; they go in the boy-scout pass. The log’s whole purpose is to make deferred debt a deliberate, tracked decision rather than a thing you rediscover at 3am.

Exercise 4: Ordering the repayments

Experiential, but the proxy you land on is the instructive part. Since interest is only charged when you touch the code, the best available proxy is expected rate of change — how often you or anyone else is likely to modify that code in the next few months — weighted by what a mistake there would cost. A tangle in a module nobody has opened in a year is charging you nothing, however ugly it is; a hard-coded threshold in the transform every new feature passes through is charging you on every change. Blast radius and silence matter too: debt that fails loudly is cheaper than debt that returns inf into a report, which is the shortcut from earlier in this chapter.

The item to delete rather than repay is usually a dead experiment, an abandoned branch of a pipeline, or a helper with exactly one caller that no longer needs it. This is where the financial metaphor genuinely misleads: a loan must be settled, so the metaphor frames every debt as something owed and eventually payable, and refactoring as the only currency. Code has a third option the metaphor has no word for — you can make the obligation cease to exist by removing the code. Deletion is not repayment; it is discovering the debt was never worth carrying. Ask of each item whether anything would break if it vanished, and be suspicious of how often the honest answer is “nothing”.

Exercise 5: When debt is the right call

A shortcut is the correct decision for code with a short or uncertain life — a prototype that may be discarded, a hypothesis you’re testing, a genuine deadline where shipping now matters more than polish. It’s reckless when taken in code you already know will be load-bearing, when taken without recording it, or when the resulting failure would be silent and high-consequence. The distinguishing property is the code’s expected lifetime and criticality, combined with whether the debt is acknowledged: debt on disposable, low-stakes code is a tool; unrecorded debt on code others will depend on is a liability waiting to come due.

D.20 Chapter 20: Cross-discipline collaboration

Exercise 1: Map the vocabulary gaps

Experiential. The classic three: test (a data scientist means evaluation metrics; an engineer means pass/fail assertions on code), validation (DS: holding out data to measure generalisation; SE: checking inputs against a schema), and model (DS: a learned predictive function; SE: an abstraction of a domain, like a class diagram). The gap that has usually caused a real misunderstanding is “is it tested/validated?” — where both parties said yes, meaning entirely different things, and discovered the mismatch only later.

Exercise 2: Write a handoff document

Experiential. The revealing part is what you find yourself making explicit for the first time: the failure modes, the edge cases of the input contract, the caveats on performance (where the model is weak, the populations it wasn’t validated on), how it should be monitored, and who owns it when it misbehaves. All of that typically lived only in your head, which is exactly why the handoff is where things go wrong.

Exercise 3: An interface as a contract

Experiential. Agreeing the schema, latency budget, and bad-input behaviour in advance is cheaper because the alternative — discovering in production that you returned a label where the service expected a probability, or the wrong units, or an unhandled null — is an incident with real cost and a paging at a bad hour. The contract converts an integration surprise into a build-time check: a conversation now versus an outage later.

Exercise 4: The failure the schema can’t catch

Almost any failure of meaning rather than form gets through. The model returns 0.03 for every customer because an upstream feature silently went null and the classifier fell back to its base rate: a valid float in [0, 1], a valid version string, a green CI run, and a retention campaign that quietly stops firing. Or the team retrains on a new population and the probabilities are still well-formed but no longer calibrated, so the service’s “high risk” threshold now means something different from what it meant when it was chosen. The schema validated the bytes; nobody had written down what the number means, what “good enough” looks like for this use, or who is expected to notice when it stops holding. What was missing is the human half of the contract: the data scientists stating the calibration assumptions, the known failure modes, and the monitoring that would surface them (Chapter 16); the engineers stating who is paged, and what they are allowed to do about it at 3am. Both halves are writable — they just aren’t writable as a schema.

Exercise 5: Two rigours pulling opposite ways

A clear case: a model that needs frequent retraining and experimentation (the data scientist wants fast, loose iteration) while it serves live production traffic (the engineer wants stability, tests, and controlled releases). The instincts genuinely conflict. A team holding both resolves it not by one side winning but by engineering the boundary tightly so that exploration can stay loose safely — an automated retraining pipeline with validation gates and canary releases, behind a stable contract and CI the engineer trusts, so the data scientist iterates freely without putting production at risk. The general principle is that the engineering instinct and the data science instinct are reconciled by tightly engineering the interface so that what happens behind it can remain appropriately loose.

D.21 Chapter 21: Notebook to production API

Exercise 1: Carry a model one stage further

Experiential. For most readers the next stage is extracting the feature and training logic into an importable module of pure functions. What you have to change to make it importable is everything that tied the code to the notebook: replace reliance on notebook globals with explicit function arguments (Chapter 6), separate the logic from the cell that happened to run it, and give each function a clear input and return. The pure-function discipline from Chapter 6 is precisely what makes the code importable — a function that depends on whatever is in the kernel can’t be lifted out of it.

Exercise 2: The train–serve safeguard

Experiential. The test feeds a single raw record through both the training feature path and the serving feature path and asserts the resulting features are identical. It is worth more than a test of the model’s accuracy because train–serve skew is a silent, high-impact bug: if serving computes a feature even slightly differently from training, the model receives inputs unlike anything it learned on and degrades quietly, with nothing failing. That has a definite right answer a test can pin, whereas accuracy is a moving statistical quantity that doesn’t belong in a pass/fail gate (Chapter 7). The skew test catches a real deployment defect; an accuracy assertion catches noise.

Exercise 3: Wrap the model in an API

Experiential. What the endpoint forces you to make explicit, and the notebook let you leave implicit, is the contract: the exact request schema (field names, types, and valid ranges), what the response carries (a probability rather than a label, plus a model_version for traceability), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, in advance, what a caller must send and will receive — and the malformed request returning a clean 422 is that contract doing its job.

Exercise 4: Extending the publication mapping

The abstract maps cleanly. It is the README and the API documentation (Chapter 18): the short statement of what this thing is, what it takes, what it returns, and who should use it — read far more often than anything else, and the only part most consumers will ever read. The retraction maps onto rollback (Chapter 15), but the fit is worth pushing on, because rollback is the better mechanism. A retraction is slow, public, and cannot recall the copies already circulating or the work built on top of them; a rollback is a version swap that takes effect on the next request, and the versioned response field from this chapter means you can identify exactly which predictions came from the withdrawn model. The place where the analogy runs out is the other direction: a paper has citation, a record of who relied on the result. A service has nothing so honest. Its consumers are whoever happened to call the endpoint, and unless you deliberately build for it — logged clients, versioned routes, deprecation notices — you cannot tell who depends on the behaviour you are about to change. Publication makes dependence visible; deployment hides it.

Exercise 5: How far to walk the route

A throwaway analysis should stop at the notebook (version-controlled at most); an internal tool typically warrants a package, a few tests, and externalised config, but rarely a full container-and-CD pipeline; a model real users depend on needs the whole route — package, tests, API, container, CI, deployment, and monitoring. The signal to go further is always the same kind of thing: someone else needs to run it, it runs repeatedly or unattended, real decisions now depend on its output, or it must reproduce exactly. “It has outlived its expected life, or someone now depends on it” is the trigger to take it one stage further down the path.

D.22 Chapter 22: Reproducible research pipeline

Exercise 1: One command, from raw data to result

Experiential. The valuable discovery is the hidden dependency the Makefile flushes out — the step that only worked because of something on your machine: a file in your home directory, a package you installed once and forgot, an environment variable set months ago, or a manual “and then I clicked export” step. Declaring every stage’s inputs and outputs forces those implicit dependencies into the open, which is exactly why a one-command rebuild is a stronger guarantee than “it ran when I did it”.

Exercise 2: Version a dataset

Experiential. Before versioning, reconstructing the exact data behind an old result was usually impossible because the raw file had been overwritten or changed in place, with no record of which version produced the figure. DVC (or, at a minimum, a dated immutable copy plus a checksum committed alongside the code) makes the input recoverable, so checking out the commit behind a result also restores the dataset that produced it — the missing fourth input from the chapter.

Exercise 3: Generate the number, don’t paste it

Experiential. A hand-pasted number becomes wrong because nothing updates it when the data or code changes — it is a snapshot frozen at the moment you copied it, with no link back to its source, so the day the analysis changes the figure in the slide silently disagrees with the figure in the code. A generated number is recomputed from the data every time the report is rendered, so it cannot drift away from the result it claims to report; the worst that can happen is the build fails, which is loud rather than silent.

Exercise 4: Giving a live table a seed-like handle

Both options are defensible, and the trade is storage cost against trust in the warehouse. The snapshot genuinely pins the data — the bytes you analysed are the bytes you keep — but you pay for the storage, you pay again every time the analysis reruns on a new window, and a wide table snapshotted weekly gets expensive fast. It also leaves one failure open: the snapshot is only as good as the moment you took it, so if the extraction itself was wrong, you have faithfully preserved the wrong data. The query plus as-of timestamp is nearly free and stays readable, but it only works if the warehouse can actually answer a historical question — it assumes the underlying table is append-only or has genuine time-travel. Point it at a table that gets restated, backfilled, or hard-deleted for retention, and the “same” query returns different rows a year later while looking entirely reproducible. The honest test for either choice is the same one from the chapter: rebuild an old result from scratch and check it still comes out. A handle you have never exercised is a handle you do not know you have.

Exercise 5: What a notebook doesn’t pin

A single notebook, even under version control, pins the code but not the other three inputs. It does not pin the environment — the packages it imports are whatever happens to be installed in the kernel, so a colleague with a different pandas version can get a different result from identical code. It does not pin the data — it reads whatever the file path points to, and that file can be overwritten or updated without the notebook changing at all. (And it pins randomness only if you remembered to set seeds.) “It’s all in one notebook” addresses code organisation, not reproducibility: the notebook is necessary but nowhere near sufficient, because the result depends on three things living entirely outside it.

D.23 Chapter 23: MLOps pipeline

Exercise 1: Sketch the loop

Experiential. Naming each stage for your own model — the training pipeline, the registry, the deployment, the monitoring signal, the retraining trigger — usually reveals that the missing or manual stage is the return arrow: most teams have a way to train and a way to deploy, but monitoring is thin and retraining is ad hoc, done when someone happens to notice a problem. Automating it means adding drift monitoring that emits a signal, a triggered training pipeline, and a promotion gate — closing the loop so the cycle runs on a signal rather than on someone’s memory.

Exercise 2: The retraining trigger

Experiential. The false-alarm rate matters because a trigger that fires too often is the flaky test of MLOps (Chapter 7 and Chapter 16): each false alarm causes a needless retrain, which costs compute and — worse — risks promoting a model trained on a blip. A trigger becomes more trouble than it’s worth once its false positives are frequent enough that the team disables it or ignores its output, at which point it protects nothing. The defences are the same as for alerts: set the threshold from observed normal variation rather than a round number, and require sustained drift rather than a single noisy batch.

Exercise 3: The promotion gate

Experiential. The comparison must use the same evaluation data because scoring two models on different datasets confounds “the candidate is better” with “the candidate’s test set was easier” — you could not tell skill from luck of the draw. And you require a margin rather than strict improvement because a tiny difference in a metric like AUC is within its own run-to-run and sampling variability; promoting on a hair’s-breadth win means swapping the production model on noise, which adds risk and churn for no real gain. The candidate should have to beat the incumbent by more than the metric’s own wobble before it earns promotion.

Exercise 4: The judgement you won’t hand to a threshold

The judgements people are least willing to automate are the ones about why something changed rather than whether it changed. A metric holds up overall while quietly collapsing on one segment. Drift appears in a feature and you recognise it as a known upstream release rather than a real shift in customers. A candidate wins on AUC while getting worse on the errors that actually cost money. No threshold sees any of that, because each requires knowing something about the world that the numbers do not carry.

All three responses are legitimate, and the right one depends on the cost of being wrong and how often the loop turns. A cruder proxy — segment-wise metrics with their own gates, a cost-weighted score instead of AUC — works when you can name the thing you’re worried about in advance; it will still miss the case you didn’t anticipate. A human in one step is the usual answer for a genuinely consequential model: automate the trigger, the retrain, and the evaluation, and let a person approve the promotion, which is a few minutes of attention rather than a day of work. Leaving the loop open is the honest choice for a model that retrains twice a year or where a bad promotion is expensive and hard to detect — automation you don’t need is a system you now have to maintain.

What should move you between them is evidence, not ambition. If the human approval becomes a rubber stamp — if nobody has rejected a candidate in six months — the judgement has effectively been encoded already and you should write it down. If retraining by request starts arriving faster than you can serve it, close the loop. And if you find yourself unable to state what the human is checking for, that is the signal that you don’t yet understand the decision well enough to automate or delegate it. Knowing which of these a given model deserves is the judgement the whole book has been building towards.

Exercise 5: The weakest practice

Take rollback (Chapter 15). Without it, the loop’s promotion gate is a one-way door: the moment a candidate is promoted — perhaps trained on a corrupted batch, or scoring well on a test set that didn’t catch a regression — it serves production traffic with no fast way back, and an automated loop that can promote but not un-promote has merely automated the act of shipping a bad model. The same argument lands on any link: without reproducibility (Chapter 22) you can’t retrain to a comparable result, so the candidate can’t be trusted or traced; without monitoring (Chapter 16) nothing triggers the loop and the model decays in silence; without testing (Chapter 7) a broken transform propagates into every retrain. The cycle only runs safely if every one of these holds, which is why automating it is the last thing you do, not the first.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Exercise answers" --- ## Chapter 1: From notebook to system {#sec-answers-notebook-to-system} **Exercise 1: Restart Kernel and Run All** This exercise is diagnostic — there's no single "right" answer. Common failures include: - `NameError` for variables defined in cells that the kernel ran out of order. For example, a variable created in cell 15 but used in cell 8, which only worked because you happened to run cell 15 first during interactive exploration. - `FileNotFoundError` for data files with hard-coded paths that only exist on your machine or in a specific working directory. - Cells that depend on outputs from cells you've since deleted or commented out. The point isn't to fix every failure immediately — it's to see how much of your notebook's correctness depends on invisible state rather than explicit structure. **Exercise 2: Extract a function** Here's an example using the chapter's customer filtering logic: ```{python} #| label: answer-ch1-ex2 #| echo: true import pandas as pd import numpy as np def filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame: """Select customers whose spend exceeds the given threshold.""" return customers[customers["spend"] > threshold].copy() # Verify with a small test input test_data = pd.DataFrame({"spend": [50, 150, 250]}) result = filter_high_value(test_data, threshold=100) assert len(result) == 2, f"Expected 2 rows, got {len(result)}" assert list(result["spend"]) == [150, 250], "Should contain only rows above threshold" print("All assertions passed") ``` The key properties gained: the function has a name that describes its purpose, its inputs are explicit (no reliance on global variables), and the `assert` statements verify the logic independently of the notebook's broader state. **Exercise 3: Score a notebook** This is a self-assessment — your honest scores matter more than the numbers themselves. In practice, most data science notebooks score highest on modularity (often 2–3, since most have at least some cell-level separation) and lowest on testability (often 1, since few notebooks include any automated verification). Reproducibility varies widely: a notebook that loads from a fixed CSV with pinned dependencies might score 4, while one that relies on a live database connection and `pip install` scores 1. The value of this exercise is identifying your weakest property and asking whether strengthening it would have saved you time in a recent project. If the answer is yes, that's where to invest first. **Exercise 4: Holdout set / test suite analogy** Two ways the analogy **holds**: 1. Both are verification mechanisms applied *after* the creative work. You build the model, then validate. You write the code, then test. Neither replaces the work; both catch problems the author missed. 2. Both require separation — a holdout set must be kept separate from training data, and tests must check behaviour from outside the code, not just re-run it. Contamination in either case undermines the verification. Two ways the analogy **breaks down**: 1. Model validation is probabilistic; software testing is deterministic. A holdout accuracy of 82% might be perfectly acceptable — you're measuring how well the model generalises. A test that passes 82% of the time is broken. Tests are pass/fail: the code either does what you specified or it doesn't. 2. Holdout sets evaluate performance on data drawn from the same distribution. Tests evaluate correctness against cases the developer explicitly constructed, including edge cases and error conditions that may never appear in production data. This means tests can catch failures that no amount of validation data would reveal, but they can also miss failures that real-world data would expose. Each has a blind spot the other doesn't share. **Exercise 5: Run a colleague's notebook** This exercise is experiential — the answer is your documentation of the attempt. Common discoveries include: - File paths that assume a specific directory structure or operating system - Environment dependencies not captured anywhere (specific package versions, system libraries, environment variables) - Cells that must be run in a non-obvious order, or cells that must be skipped - Configuration values with no explanation of how they were chosen Each discovery maps to one of the chapter's four system properties. Hard-coded paths and undocumented dependencies are reproducibility failures. Monolithic cells that do many things are modularity failures. The absence of any automated checks is a testability failure. And magic numbers without context are readability failures. ## Chapter 2: Version control {#sec-answers-version-control} **Exercise 1: Put a project under version control** Experiential — there's no single right answer, but a sound initial commit contains only what you *authored*: code (`.py` modules, and notebooks with outputs stripped), configuration (`requirements.txt`, `pyproject.toml`, and the `.gitignore` itself), and documentation (a README). What you deliberately exclude, and where each belongs instead: - **Data** → a data-versioning tool (DVC) or object storage. It's too large and too volatile for Git's keep-everything-forever history, and may be sensitive. - **Trained models and artefacts** (`.pkl`, `.joblib`) → a model registry or artefact store. They're large, binary, and regenerated rather than written by hand. - **Secrets** (`.env`, API keys) → a secrets manager, or a local `.env` that is never committed. A credential committed even once persists in the history after you delete it. - **Caches and environments** (`__pycache__`, `.ipynb_checkpoints`, `.venv`) → not versioned at all; they're regenerated locally. The discipline is the one from the chapter: commit what you author, store what you generate or receive somewhere better suited to it. **Exercise 2: Clean notebook diffs** Two routes achieve this. `nbstripout` installs a Git filter that strips outputs and execution counts on commit, so the tracked version holds only code and markdown: ```bash pip install nbstripout nbstripout --install # registers the filter for this repository ``` Or pair the notebook with a script representation using Jupytext, and treat the script as the reviewed artefact: ```bash pip install jupytext jupytext --set-formats ipynb,py:percent analysis.ipynb ``` After either, make a one-line change and inspect `git diff` (or `nbdiff` if you use nbdime). The diff should now show only your change rather than a wall of metadata. The verification *is* the exercise: you've turned an unreadable JSON diff into a reviewable one. **Exercise 3: Commit messages for past decisions** Self-directed. The instructive part is comparing your messages against what a filename or inline comment could carry. A message such as *"Drop `signup_channel`: 60% missing after the May tracking change, and imputing it was injecting signal"* records both the reason and the evidence — information a filename like `model_v3` cannot hold and a comment like `# dropped signup_channel` omits. Because the message is attached to the exact change, attributed, dated, and surfaced by `git blame`, the "why" survives long after the people who remember the review meeting have moved on. That permanence is what neither the filename nor the comment provides. **Exercise 4: Branch for an experiment** Experiential. A typical flow: ```bash git switch -c experiment/log-spend-features # ...edit, commit on the branch... git switch main # main is untouched and still runs git merge experiment/log-spend-features # if the experiment worked # or: git branch -D experiment/log-spend-features # if it didn't ``` Keeping `main` untouched means that at every moment you have a known-good version to fall back to, demo, or hand over — and discarding a failed experiment leaves no residue, no `model_v2_BAD.ipynb` lingering in the folder. Compared with copying files, the branch makes the experiment both *comparable* (you can diff it against `main`) and *disposable* (deleting the branch erases the dead end cleanly). **Exercise 5: Git versus an experiment tracker** Git versions the *code and its history of changes*: it answers "what is the code, and how did it come to be this way?", with branching, merging, and line-by-line history. It records nothing about what a given run produced. An experiment tracker such as MLflow records *runs*: the metrics, parameters, and artefacts a specific execution generated, so you can compare AUC across fifty hyperparameter settings — something Git has no concept of. Conversely, a tracker offers no line-by-line source history or merge. The division of labour reflects two different questions. Versioning your code is about the *process* that produces results; tracking results is about the *outputs* of running that process on particular data. Reproducibility needs both — the exact code version *and* the run record — which is why mature projects link each tracked run back to the Git commit that produced it. ## Chapter 3: Environments and dependencies {#sec-answers-environments} **Exercise 1: Produce a lockfile** Experiential. The flow separates intent from the resolved result: ```bash # requirements.in holds your direct dependencies (the abstract spec) pip install pip-tools pip-compile requirements.in # -> fully pinned requirements.txt (the lock) python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt # rebuild from the lock ``` The point of the exercise is what the lock contains that your `requirements.in` did not: every *transitive* dependency, pinned to an exact version. Rebuilding in a fresh environment and finding the project still runs confirms you've captured the environment, not just your top-level wishes. **Exercise 2: Audit unpinned dependencies** Compare what's declared (often nothing, or a few `>=` constraints) against `pip freeze`. The instructive part is naming a library where a major-version bump could change results and saying *how*: scikit-learn has changed estimator defaults between releases (a different solver or tie-handling shifts predictions); pandas changed copy-on-write behaviour and default dtypes; NumPy's random `Generator` stream can differ across versions. In each case the code is untouched but the numbers move — which is precisely the failure mode pinning prevents. **Exercise 3: What pinning controls** Pinning your Python-package versions controls *library behaviour* — a changed default, a re-implemented algorithm, a deprecated parameter. It does **not** control the Python interpreter version, the operating system, the underlying maths libraries (BLAS/LAPACK), or hardware/GPU floating-point behaviour. To control that second category you reach for a container (Docker), which pins the interpreter, system libraries, and OS alongside the packages; for bit-exact numerics you would additionally fix BLAS threading and any framework determinism flags. Pinning versions is necessary but not sufficient for full reproducibility. **Exercise 4: Reproduce a colleague's environment** Experiential. The valuable output is the list of things the lockfile alone didn't capture, each with its proper home: - The **Python version** → a `.python-version` file or `pyproject.toml`. - A **system library** a wheel links against (e.g. a C library, a CUDA runtime) → a container image or a documented set of OS packages. - An **environment variable** the code reads at runtime → a committed `.env.example` documenting the names (never the values; see *Configuration and secrets*). Each gap is a reproducibility failure waiting to happen, and each has a better home than a colleague's memory. **Exercise 5: When abstract beats locked** Abstract `>=` requirements are the right choice when your code is a *library* meant to be installed alongside other packages: over-constraining versions would make it hard for consumers to satisfy everyone's dependencies at once, so you specify the minimum you need and stay flexible. An exact lock is essential when your code is an *endpoint* — a deployed service, a scheduled pipeline, a reproducible analysis — where the same versions must reappear every time and nothing downstream depends on your version range. The difference is structural: a library is a dependency *of* other things (flexibility aids interoperability); an application is the final consumer (exact reproduction is the whole point). ## Chapter 4: The command line {#sec-answers-command-line} **Exercise 1: Capture a workflow** Experiential. A reasonable result is a `Makefile` with one target per step, run end to end with a single `make`. The discovery worth noting is the step that "depended on you remembering to do something first" — creating an output directory, setting an environment variable, downloading data before the feature step. Those implicit prerequisites are exactly what a task runner makes explicit: a target that creates the directory, or a dependency declaration (`features: data`) that enforces the order so no one has to remember it. **Exercise 2: Answer a question with the shell** For example, counting the distinct values in the third column of a CSV (skipping the header): ```bash tail -n +2 data.csv | cut -d, -f3 | sort | uniq | wc -l ``` or counting rows matching a condition with `grep -c`. This feels natural for quick, line-oriented filtering and counting on flat text, and it's faster than starting a Python session. You wish for a DataFrame the moment fields contain commas or quotes (naive `cut` mis-parses real CSV), when you need typed aggregation, or when a join is involved — that's the boundary the Data Science Bridge describes. **Exercise 3: Exit codes and chaining** A validation script signals failure by exiting non-zero: ```python # validate.py import sys import pandas as pd df = pd.read_csv("data/processed.csv") if df.empty or "target" not in df.columns: print("validation failed: empty data or missing target", file=sys.stderr) sys.exit(1) # non-zero: tells the shell something went wrong print("validation passed") ``` ```bash python validate.py && python train.py # train runs only if validate exits 0 ``` This matters because automation and CI decide pass/fail from exit codes, not from reading output. A non-zero exit halts the `&&` chain, so bad data never reaches training, and a CI server marks the build red instead of silently continuing. The exit code is the machine-readable verdict the whole pipeline depends on. **Exercise 4: Surviving a dropped connection** A command started in a plain SSH session is a child of that session. When the connection drops, the session ends and the job is sent a hang-up signal (`SIGHUP`), so it dies — unless you took extra steps (`nohup`, `disown`). `tmux` (or `screen`) changes this by running your shell inside a session that lives on the *server*, decoupled from your client connection: you detach (or simply lose the connection), the session and everything in it keep running, and you reattach later to find the job still going or finished. For any job measured in hours, that decoupling is the difference between a result and a wasted afternoon. **Exercise 5: When a pipeline counts the wrong thing** The figure can be wrong in several directions at once. `grep` has no idea which column it is looking at, so a row with a customer reference of `INV-2026-118`, a product name containing `2026`, or a shipping date in 2026 attached to an order placed in December 2025 all count as matches. It has no idea about rows either: a quoted field containing an embedded newline is two lines to `wc -l`, and the header line counts as an order if it happens to contain the string. And a match is a match once — a line mentioning 2026 twice still contributes one, which is right here but would not be if you were counting occurrences. The pandas version avoids all of this because it asks a question of one *typed* column: `order_date` has been parsed into a datetime, `.dt.year` is a genuine year rather than a substring, and a row is a row regardless of how many newlines the raw text contained. The culprit is not `grep`, which does its job perfectly. It is the pipe's data model: it carries untyped text, split on newlines, with no notion of fields, types, or records. That generality is exactly why every tool on the system can be composed with every other, and exactly why the composition stops being trustworthy once the meaning of your data lives in its structure rather than its characters. ## Chapter 5: Readable code {#sec-answers-readable-code} **Exercise 1: Refactor for readability** Experiential. The change that surprises people most is that renaming alone surfaces confusion: if you can't think of a good name for a variable, that's often a sign you don't fully understand what it holds, or that it holds two different things at different points. Replacing magic numbers with named constants does the same — naming `0.73` forces you to articulate what it *is*. Adding a type signature and a one-line docstring then makes the contract explicit without touching the logic. If the refactor was purely cosmetic and nothing became clearer, the original was already readable; usually something does. **Exercise 2: Formatter and linter** Experiential. The instructive step is sorting the linter's output into real defects and pure style. Real defects include unused imports, variables assigned but never used, names that shadow a builtin (`list`, `dict`), bare `except:` clauses that swallow errors, and mutable default arguments. Pure style includes line length, quote style, and spacing — exactly the things the formatter fixes automatically. The lesson is the division of labour: automate the style pile entirely so that human attention in review goes to the defect pile and to the logic, which no tool can check. **Exercise 3: Split a function** Experiential. The test of a good decomposition is to read just the new function names in sequence and ask whether they narrate what the original function did. If they do — `load_raw`, `drop_invalid_rows`, `add_spend_per_day`, `aggregate_by_cohort` — the names are carrying the structure, and a reader can understand the whole from them. If a step needs a comment to explain what it does, the name isn't doing its job yet. **Exercise 4: Rewrite it, or explain it?** For most inherited code the rewrite wins, and the reason is what the next reader does rather than what they want to know. Someone opening a sampling correction is usually there to change it — a new weighting scheme, a different edge case — and a comment cannot help them do that safely. Prose sits beside the code, drifts out of date the first time someone edits the body without editing the paragraph, and offers no protection against misreading the line they are about to modify. Names, decomposition and a signature live *in* the thing being changed, so they cannot fall out of step with it. The opposite choice is right when the difficulty is not in the code but in the reasoning behind it. A function whose body is five clear lines implementing a non-obvious statistical correction — where the mystery is *why* this adjustment, under which assumptions, from which paper — gains nothing from further decomposition and everything from a paragraph recording the justification. That is the honest boundary of the methods-section analogy: prose is the right tool for the reasoning a reader could never recover from the code, and the wrong tool for the intent they need in order to change it. **Exercise 5: When readability isn't worth it** Throwaway names are the right call for genuinely scratch code that will be deleted within the session — a quick check of a distribution, a one-off plot to settle a question, a snippet you're using to understand an API. The specific signal that it has crossed the line is *promotion*: you copy it into another notebook, you find yourself relying on its output days later, or you hand it to someone else. At that moment the code has become "kept", will be read many times, and earns the few minutes of naming and documentation. The skill is noticing the promotion and cleaning up *then*, rather than writing every scratch cell as if it were production. ## Chapter 6: Functions, modules, packages {#sec-answers-functions-modules-packages} **Exercise 1: Extract a copy-pasted function** Experiential. The payoff is visible the moment you make the follow-up change: with the logic in one imported module, you edit one place and every caller gets the fix; with copies scattered across notebooks, you had to find and edit each one — and missing one is exactly how the copies drift out of sync. "How many places did I have to change?" going from several to one *is* the single-source-of-truth principle made concrete. **Exercise 2: Global to pure function** Experiential. Once every input arrives as an argument, the function's result depends only on those arguments, so it returns the same answer no matter what cells ran before it. The verification — calling it after deliberately changing some unrelated global or re-running cells out of order, and getting the same result — is the property that also makes it testable in the next chapter. **Exercise 3: Make a project installable** Experiential. The error you were previously working around is `ModuleNotFoundError` (or the `sys.path.append("..")` hack used to dodge it), which works only from the directory you happened to launch from. After a minimal `pyproject.toml` and `pip install -e .`, the package is importable by name from anywhere in the environment, and because the install is *editable*, changes to the source take effect immediately without reinstalling. The fresh-notebook test confirms the import no longer depends on where you started. **Exercise 4: Library-author disciplines** A discipline you do **not** need for code only you use: a stable public API with backwards compatibility, semantic versioning, and deprecation cycles. While you're the only user you can rename, re-signature, and restructure freely. A discipline you **should** adopt the moment a colleague imports your code: a stable interface — don't change function names or argument meanings out from under them without warning — and a documented public surface so they can use it without reading the implementation. The trigger for the switch is precisely "someone else now depends on this". **Exercise 5: What belongs in a package** Code that should stay in the notebook is exploratory, one-off, or presentation-specific: the narrative of a particular analysis, plots tailored to one report, throwaway checks. Code that has earned a place in a module is reusable logic — data cleaning, feature engineering, model training and evaluation — that you'll run more than once or in more than one place. The signal that logic has crossed the line is reuse: you've used it (or want to) a second time, you need to test it, or someone else needs it. "I'm about to copy this" is the clearest possible prompt to extract it instead. ## Chapter 7: Testing stochastic code {#sec-answers-testing} **Exercise 1: Test a deterministic transform** Experiential. The instructive part is usually the edge cases: writing `test_on_empty_input` or `test_with_all_zeros` forces you to *decide* what the function should do in those situations — return an empty result, raise a clear error, propagate NaNs — when the original code never made that decision explicit. A test you can't write because you don't know the expected answer is a sign the contract is underspecified, which is a finding in itself. **Exercise 2: Make a stochastic function testable** Experiential. Having the function accept an explicit `rng` argument turns hidden global randomness into an injected dependency you control. The exact test then fixes the seed and asserts a specific result; the tolerance test asserts a statistical property — say, that the mean of many draws is within some band of the expected value. The tolerance should be *justified*: wide enough that it won't fail by chance (a few standard errors of the quantity you're checking), tight enough that a real defect would breach it. Stating why you chose the band is part of the answer. **Exercise 3: Test an invariant** Experiential. An invariant is a property that must hold for *every* input — preserved row count, no new missing values, output bounded in a range, mean zero after standardising. Checking it across many random inputs is property-based testing done by hand; `hypothesis` automates the input generation and, when a property fails, shrinks the counterexample to the smallest input that triggers it, which is often the fastest route to understanding the bug. **Exercise 4: Why `model.score(...) > 0.85` is a poor unit test** It conflates evaluation with testing. The assertion is really trying to answer "is the model good enough?", which is an evaluation question — answered on a continuum, against a baseline, and monitored over time — not a pass/fail property of the code. As a unit test it fails on three counts: it's fragile (it breaks the first time the data shifts, with no code defect), uninformative (a failure doesn't localise any bug), and slow. What you *should* test about the pipeline is the deterministic machinery around the model: that data validation rejects malformed input, that transforms produce the expected columns and leak nothing, that the pipeline runs end to end on a tiny sample, and that a saved model round-trips to identical predictions. **Exercise 5: Why a flaky test is worse than none** A test that fails one run in ten and is habitually re-run until green is *dishonestly noisy*: it trains the team to treat failures as background noise to be cleared by re-running, which is exactly the habit that lets a real failure slip through unnoticed. It also blocks or destabilises CI and erodes trust in the whole suite. No test is at least honestly silent; a flaky test actively degrades everyone's response to failure. Two fixes that keep the test: make it deterministic by fixing the seed so the stochastic element is pinned; or, if it's genuinely checking a statistical property, replace the brittle assertion with a principled tolerance (several standard errors wide) or an invariant that must always hold. Re-running until it passes, or deleting it, are the two non-answers. ## Chapter 8: Debugging and profiling {#sec-answers-debugging} **Exercise 1: Read a traceback** Experiential. Read bottom-up: the final line is the exception type and message (*what* went wrong), and the deepest frame in your own code is *where* to look — third-party frames below it are usually just the machinery that surfaced your mistake. The fact you'd check first follows directly from those two (a `ZeroDivisionError` in a line dividing by `active_days` says: look for a zero). People are routinely surprised how often the traceback alone, read properly, fully explains the bug — the panic that makes us skim it is the real obstacle. **Exercise 2: Use a debugger instead of print** Experiential. The thing a debugger shows that a print does not is the *entire* live state at the moment of failure — every variable, not just the one you anticipated printing. That's how you spot the cause you weren't looking for: the column that's unexpectedly all zeros, the frame with the wrong shape, the value that's a string where you assumed a float. Print debugging can only show what you already suspected; the debugger shows what you didn't. **Exercise 3: Replace print with logging** Experiential. A reasonable mapping is `INFO` for milestones ("loaded N rows", "training complete"), `WARNING` for recoverable oddities ("clipped 12 negative values"), and `DEBUG` for fine detail. The payoff is the final step: flipping a single level setting switches between a quiet production run and a verbose diagnostic one without editing — or later removing — any of the statements, which is precisely what makes logging persist where scattered prints get deleted. **Exercise 4: Profile and fix** Experiential. The lesson lands hardest when the hot spot is *not* where you expected — the slow step is often an innocent-looking apply or a repeated recomputation, not the obviously heavy model fit. Fix the dominant one (vectorise the loop, cache the repeated work) and measure the change rather than assuming it helped. The discipline is to optimise what the profiler points at, and only once something is actually too slow. **Exercise 5: Four questions, four tools** - *A variable's current value* → `print` (or a quick inspect). Adequate, because the question is narrow and you already know what you want to see. - *The full state at a failure* → a debugger (`pdb` or an IDE). `print` can't show everything at once, and it forces you to guess in advance which variables will matter. - *What happened in a run you weren't watching* → `logging`. `print` has no levels, timestamps, or persistence, and you weren't there to read it scroll past. - *Where the time went* → a profiler. `print` can't attribute time, and hand-timed guesses are biased toward the parts you already suspect. `print` is the right tool only for the first; for the other three it's a poor stand-in because it answers "what is this value now?" and nothing else — it cannot capture full state, persist a structured record, or measure performance. ## Chapter 9: Project structure {#sec-answers-project-structure} **Exercise 1: Reorganise a flat project** Experiential. The instructive discovery is usually something being *mutated in place* — a raw CSV edited to fix a typo, a column renamed in the source file, a row dropped by hand. Once raw data is read-only, that edit has to become a transformation step whose output lands in `data/interim/` or `data/processed/`, leaving the original untouched. Finding the in-place edit is finding the point where your work stopped being reproducible from source. **Exercise 2: Write a README** Experiential. Every question the colleague still has to ask is a gap in either the README or the structure, and the common ones are revealing: an undocumented environment variable, a data source that needs credentials or special access, a setup step that "everyone knows", or the order in which things must run. The exercise works precisely because you can't see your own assumptions — the colleague's questions surface them. **Exercise 3: A single source of truth for paths** Experiential. The original would break on a colleague's machine because an absolute path like `/Users/you/project/data.csv` simply doesn't exist there. Deriving every path from one project root (resolved from `__file__` or a config value) makes the project portable: it runs unchanged on a laptop, a server, or inside a container, because only the root differs. The hard-coded path is one of the most common reasons "it works on my machine" and nowhere else. **Exercise 4: Write the rule for your own layout** A workable rule usually turns on kind and lifecycle rather than topic: "every file belongs to exactly one of authored-and-version-controlled, received-and-immutable, or generated-and-regenerable, and lives in the directory for that category." That is enough for a colleague to place a new file unaided, and it is the sentence a layout is really encoding — `src/` and `configs/` are authored, `data/raw/` is received, `data/processed/` and `models/` are generated. The interesting part is the file that resists it. A notebook that both explores and produces the figures for a report is authored code and a generated artefact at once; a lookup table that someone hand-edits is authored data that arrived as raw. Sometimes the file is genuinely misplaced and should be split — the notebook writes its figures to a generated directory rather than beside itself. Sometimes the rule is too crude and needs an extra category, which is a real answer rather than a failure. Working out which is which is the maintenance work a schema would have done for you automatically, and it is the price of a convention that nothing enforces. **Exercise 5: When structure is overkill** A genuinely one-off analysis — a quick answer for a meeting, a teaching example, a throwaway exploration — should stay a single notebook in a single folder; imposing `src/`, `tests/`, and a structured `data/` on it is pure overhead. The signal that it has earned the scaffolding is longevity and dependence: it will run again (on new data, or on a schedule), someone else needs to run or maintain it, it needs tests, or helper files are starting to accumulate at the top level. "This is going to outlive the week" is the trigger. ## Chapter 10: Data pipelines {#sec-answers-data-pipelines} **Exercise 1: Break a monolith into stages** Experiential. The payoff usually shows up as a stage that proves reusable in a context you hadn't anticipated — the cleaning function reused by a different analysis, or a feature transform reused at serving time. That unplanned reuse is exactly what the monolithic cell made impossible, because the useful part was welded to everything around it. **Exercise 2: Add a validation gate** Experiential. The bad data a gate would have caught at the boundary includes an upstream schema change (a renamed or dropped column), an unexpected null where the next stage assumes completeness, an out-of-range value (negative spend, a date in the future), a duplicated key, or a target column leaking into the features. The point is *where* the failure surfaces: a gate turns a cryptic error three stages downstream into a precise message at the moment the bad data entered. **Exercise 3: An idempotent, cached stage** Experiential. Persist the stage's output to `data/interim/` and skip the computation when the artefact already exists; the second run should report the stage skipped. The lesson to carry forward is that a real orchestrator does this for you *and* invalidates the cache when a stage's inputs or code change — caching is only safe when staleness is handled, which is why "re-run if the inputs changed" is the rule, not "re-run never". **Exercise 4: The guarantee the cache quietly removed** An `sklearn` `Pipeline` prevents leakage by construction: `.fit()` is called separately on each training fold, so the scaler only ever sees training rows and validation statistics cannot influence it. A `scale_features` stage that fits across the whole dataset before the split has already broken that, and the caching makes it worse in two ways. The fitted statistics now sit in a persisted artefact that every downstream run reads, so the leak is baked into a file rather than recreated (and possibly noticed) each time; and because the stage is skipped whenever the artefact exists, a later split, a new fold, or a fresh evaluation silently reuses statistics computed over rows it was supposed to have never seen. The evaluation looks fine, which is the whole problem — leakage flatters your metrics rather than raising an error. The fix is not to abandon caching but to move the split upstream of the fitting. Split first, and let the pipeline persist the *split* as its cached artefact; then fit the scaler inside the training path only, persist the fitted scaler as a versioned artefact of that run, and have the validation and serving paths *apply* it rather than refit. Caching a transformation is safe; caching a quantity that was learned from data is only safe if the artefact records which rows it learned from — which is precisely the bookkeeping `Pipeline` did for you inside a single `.fit()` call, and which nobody does for you once the stages span processes and files. **Exercise 5: When a pipeline framework is overkill** A single script or notebook is the right tool for a one-off analysis, or a workflow of one or two steps that you run interactively and watch. The signal that it has outgrown this is when re-running everything becomes too costly or too risky: the workflow runs repeatedly or on a schedule, some stages are expensive enough that you want to re-run only what changed, several people or systems depend on intermediate outputs, or failures need to be isolated and retried rather than restarting from scratch. At that point the explicit stages and orchestration earn their keep. ## Chapter 11: Configuration and secrets {#sec-answers-config-secrets} **Exercise 1: Lift hard-coded values into config** Experiential. The values that turn out to differ between your machine and where the code really runs are the telling ones: absolute file paths almost always, plus database and table names, output locations, resource settings (number of workers), and debug flags. These are exactly the things configuration is for — the same logic, different values per environment — and finding them is finding everything implicitly tied to your laptop. **Exercise 2: Move a secret out of the repository** Experiential. Add `.env` to `.gitignore`, commit a `.env.example` template with placeholder values, and load the real secret from the environment. The reason moving it is not sufficient on its own, if it was ever committed, is that version control keeps history: the secret remains in past commits even after you delete it from the current files, so anyone with a clone still has it. It must be *rotated* — the credential changed at its source — not merely removed. **Exercise 3: Typed, validated config** Experiential. The contrast is the point: a bare dictionary lets a mistyped key (`raw_paht`) return a silent `None` that surfaces as a confusing failure much later, whereas a `pydantic` model with a constraint rejects a bad value the instant it loads, naming the offending field. Feeding it an out-of-range value and getting an immediate, specific error is the behaviour you're buying. **Exercise 4: One config file per run, copied by hand** The copy and the original drift. Two files that are meant to agree on everything except a connection string won't, within a few weeks: someone tunes a threshold in one and forgets the other, and production ends up running settings nobody deliberately chose. The failure is silent, and it destroys the thing config was supposed to give you — a trustworthy record of what actually ran. The split to make is between the *schema*, which should be identical everywhere and validated identically everywhere, and the *values*, only a few of which legitimately vary: connection strings, bucket names, output locations, worker counts, debug flags. Model settings that differ between development and production are almost always an accident rather than a decision, and an environment setting that's identical everywhere is usually a hard-coded value waiting to bite. Keep one schema, layer environment-specific values over a shared base, and let validation catch the drift. **Exercise 5: When hard-coding is acceptable** A genuine constant — a value that is part of the logic and does not vary by run or environment, such as a mathematical constant or a truly fixed business rule — is perfectly fine as a *named* constant in the code. A value becomes configuration when it varies between environments, changes between runs, or is something you tune. The signal that a hard-coded value has become a liability is any of: you find yourself editing code to change it, it differs between dev and prod, or you can't tell from the code why it has the value it does. Secrets are the absolute case — always a liability when hard-coded, regardless of anything else. ## Chapter 12: API design {#sec-answers-api-design} **Exercise 1: Wrap a model in an endpoint** Experiential/applied. What the endpoint forces you to decide, and a notebook `predict` let you ignore, is the *contract*: the exact request format (field names, types, units), what the response contains (a label, a probability, both), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, explicitly and in advance, what a caller must send and will receive. **Exercise 2: Validation returns a clear 422** Experiential. Constraining the request fields means a malformed request is rejected at the door, before the model runs, with a message that says what was wrong — rather than reaching the model and producing a confidently wrong prediction from garbage, or crashing deep inside and returning an opaque `500`. The lesson is that validation converts an unpredictable internal failure into a precise, early, client-facing error. **Exercise 3: Response schema and live docs** Experiential. The documentation is generated *from* the schema and code (as OpenAPI), so it cannot drift out of sync — change a field and the docs change with it. This matters because callers integrate against the documentation: hand-written API docs inevitably fall behind the implementation, and documentation that lies is worse than none, because it sends integrators down paths that no longer work. **Exercise 4: Adding a feature to a live endpoint** Publish `/v2/predict`. Adding a required field to `/v1` breaks every existing caller the moment you deploy — their requests start failing validation with a `422`, and the first they hear of it is their own error rate. You never agreed a migration window with them, so there's no point at which that break is acceptable. The notebook instinct is safe precisely because you're the only caller and you change both sides at once; an endpoint separates those sides across teams and across time. A new version lets both schemas run side by side while callers migrate on their own schedule, and gives you a signal — `/v1` traffic falling to zero — that tells you when the old one can be retired. The tempting middle road, making the field optional with a default, is worth naming: it keeps callers working, but silently scores them on a value you invented, which is usually worse than a clear failure. **Exercise 5: Batch versus real-time** A scheduled batch job is the right mechanism when predictions are consumed in bulk on a known cadence and latency doesn't matter — nightly churn scores feeding a dashboard, weekly demand forecasts written to a table. A real-time API is necessary when predictions are needed on demand, one at a time, with low latency, in response to a user action or another system's request — a fraud check at checkout, a recommendation as a page loads. The deciding property is *how and when the prediction is consumed*: in bulk on a schedule points to batch; on demand with a latency requirement points to an API. ## Chapter 13: Continuous integration {#sec-answers-ci} **Exercise 1: Add a CI workflow** Experiential. The payoff is the moment the status goes red on a deliberately broken test before you'd have noticed any other way — that's the days-to-minutes gap from the chapter's opening, closed. A workflow that installs the *locked* dependencies (rather than whatever the runner happens to have) is also what makes the run a faithful check rather than a coincidence of the runner's environment. **Exercise 2: Lint and format gate** Experiential. The first run typically flags the same mix @sec-readable-code described — unused imports, unformatted files, the occasional shadowed name or undefined reference — and sorting them into "real defect" and "pure style" is the exercise. The style pile is exactly what the formatter fixes automatically, so once `ruff format` is in the gate it stops recurring, and the gate's signal becomes mostly about real problems. **Exercise 3: Set up pre-commit** Experiential. The hook stops the commit locally, before anything reaches CI, which is the point: the cheapest place to catch a formatting slip or a stray large file is before it's even recorded. Pre-commit and CI are complementary — the local hook gives instant feedback on the trivial things, and CI remains the authoritative shared gate that everyone's changes must pass. **Exercise 4: The change CI cannot see** The data changes. An upstream system starts sending a column in a different unit, a category gets renamed, the customer population shifts after a marketing push. Nothing in the repository changed, so every check stays green — CI watches commits, not the world. That's the real limit of the analogy: you re-run a holdout because *anything* might have invalidated the verdict, whereas CI fires only on a change to the code. Catching the rest takes the mechanisms from elsewhere in the book — validation gates on the data as it enters the pipeline (@sec-data-pipelines), a scheduled retrain that records metrics over time rather than gating a merge, and monitoring of the live service (@sec-monitoring). Green CI means the code still does what you asked it to. It never meant the model is still right. **Exercise 5: What belongs in CI** Run on every change the checks that are *fast and deterministic*: unit tests, linting, type checks, and a small end-to-end smoke test on sample data. Push to an occasional or nightly job the things that are slow, expensive, or non-deterministic: full model training, integration tests against real external services, validation over large datasets, performance benchmarks. The principle is that every-push checks must be quick and reliable enough that developers never feel the urge to route around them — a gate that is slow or flaky gets disabled or ignored, at which point it protects nothing. Speed and trustworthiness are what make the gate worth having. ## Chapter 14: Containerisation {#sec-answers-containerisation} **Exercise 1: Write a Dockerfile** Experiential. A sound first Dockerfile starts from a slim base, installs the locked requirements, copies the code, and sets the run command; building and running it confirms the service starts in a clean, sealed environment rather than relying on anything on your machine. If it runs in the container but not on a colleague's bare machine, the container has done its job — it carried the environment with it. **Exercise 2: Improve the image** Experiential. Ordering the instructions so dependencies are installed before the code is copied means a code edit reuses the cached dependency layer instead of reinstalling everything — rebuilds drop from minutes to seconds. A slimmer base, a multi-stage build, or removing build tools afterwards shrinks the image substantially, often from north of a gigabyte to a few hundred megabytes. The exercise is to measure the before and after, because the gains are larger than most people expect. **Exercise 3: Keep data and secrets out** Experiential. Baking *data* into the image bloats every copy of it, ties the image to a single snapshot of the data, and can push sensitive records into registries. Baking a *secret* in is the @sec-config-secrets mistake reincarnated: the credential ends up inside an artefact that gets pushed to registries and shared, so it leaks and must be rotated, not merely removed. The fix is to keep the image a generic definition of *how to run* — mount the data as a volume and inject the secret as an environment variable at run time. **Exercise 4: Two builds of the same Dockerfile** They differ everywhere the recipe is vague. `FROM python:3.12-slim` is a moving tag that will have been rebuilt on a newer base with newer system libraries; `pip install -r requirements.txt` resolves against PyPI as it stands on build day, not six months ago; any `apt-get install` pulls whatever the distribution currently ships. So the honest statement is that a `Dockerfile` is a *recipe*, not a pin — the resulting image is frozen, but the instructions that produce it are not, and the analogy quietly assumes those are the same thing. Making it true means pinning at every layer: reference the base image by digest rather than tag, install from a fully locked requirements file, pin system package versions, and then treat the *built image*, stored in a registry, as the artefact you promote from CI to staging to production — rather than rebuilding at each stage and trusting that you'll get the same thing twice. **Exercise 5: When to containerise** A container is clearly worth it whenever the code must run *reliably on a machine that isn't yours*: a deployed service, a job on a shared cluster, a pipeline that has to reproduce exactly across environments, or onboarding a team to one consistent setup. It's overkill for a one-off local analysis or an exploration only you will ever run on your own machine, where a virtual environment and a lockfile already give you everything you need. The deciding property is exactly that: does it need to run identically somewhere other than your machine? If yes, containerise; if it lives and dies on your laptop, the lockfile is enough. ## Chapter 15: Deployment {#sec-answers-deployment} **Exercise 1: Deploy the service** Experiential. The revealing part is what the platform makes you supply explicitly that your laptop quietly provided: the environment variables and secrets (@sec-config-secrets), the port to expose, persistent storage for any data, the exact run command, and resource limits. On your machine all of that was implicit context; deployment turns it into configuration you have to state, which is itself why the container and config work of the previous chapters pays off here. **Exercise 2: Staging with pass criteria** Experiential. The discipline is to run the *same* artefact as production with different configuration, and to decide what "passing staging" means *before* you look at the result — a latency ceiling, an error-rate ceiling, a smoke test that must pass. Deciding the bar in advance is what stops the very human temptation to wave a release through because it "seems fine", which is the operational version of moving the goalposts. **Exercise 3: Schedule a batch job** Experiential. The key behaviour is that a failed run is *detectable*: the job exits non-zero (@sec-command-line), so the scheduler can alert, rather than failing silently and leaving you to discover days later that the table was never updated. A batch job that fails quietly is worse than one that fails loudly, because the missing output often looks just like stale-but-present output. **Exercise 4: Does overfitting apply to staging?** The worry carries across, but it attaches to different things. You are not overfitting staging by *running* against it repeatedly — that's the point of it, and unlike a holdout the environment isn't consumed by being looked at. What you can over-fit to is the *fixture*: a staging dataset that never changes, so the smoke tests come to encode the quirks of that one sample; a synthetic load profile that flatters your caching; or a set of pass criteria quietly relaxed each time a release failed them. In every case staging keeps going green while telling you less and less about production. The defences are to refresh staging data periodically (or sample it from production, suitably anonymised), to set the thresholds from observed production behaviour rather than from what staging happened to achieve last time, and to treat a criterion you have loosened twice as evidence that the criterion, not the release, needs examining. **Exercise 5: Batch versus always-on, and rollback** A nightly churn-scoring job feeding a report is naturally batch; a real-time fraud-scoring API is naturally always-on. Their rollbacks differ accordingly. For the batch job, a rollback is usually re-running the previous version (or simply retaining yesterday's output) — the failure is recoverable because nothing depended on the run in real time. For the online service, a rollback means switching live traffic back to the previous container version immediately — a blue-green flip or redeploying the prior tag — because every second on the bad version affects real requests. Batch buys you time; online demands a fast switch. ## Chapter 16: Monitoring and observability {#sec-answers-monitoring} **Exercise 1: Logging and health** Experiential. To investigate "a strange answer last Tuesday" you need, at minimum, the timestamped request *inputs*, the *prediction* returned, and the *model version* that served it — ideally tied together by a request ID so you can correlate across log lines. The common failure is logging only that "a prediction was made": that confirms the service ran but lets you reconstruct nothing. The test of your logging is whether you could replay last Tuesday's prediction from it alone. **Exercise 2: A drift check** Experiential. Store a reference sample from training, compare each live batch to it with a KS test or population stability index, and alert when the statistic crosses a threshold. As for which feature drifts first, the usual culprits are externally driven or time-sensitive ones — a monetary feature exposed to inflation, a feature fed by an upstream source that changes format or coverage, or anything seasonal — because the world moves those independently of your model. **Exercise 3: A useful alert** Experiential. Choose a condition and a threshold, then defend it against fatigue: set the threshold from observed normal variation rather than a round number, alert on a sustained breach rather than a single spike, deduplicate repeated firings, and route only actionable alerts to a human. This is the flaky-test lesson from @sec-testing transplanted to operations — an alert that cries wolf trains the team to ignore it, and an ignored alert protects nothing, including on the day it's right. **Exercise 4: Drift check or accuracy report?** Either can be defended, and the justification matters more than the choice — but the drift check is usually the better first build, because it is the only one that can tell you something *tomorrow*. With a ninety-day label lag, the accuracy report's first useful signal is a quarter old: it would confirm decay that has already been served to every customer since. The drift check trades certainty for latency — it cannot tell you the model is worse, only that it is now operating on data it wasn't trained for, which is a warning you can act on the same week. What you give up is real. The accuracy report catches concept drift, and the drift check structurally cannot: if fraud tactics change while the input distribution stays put, every feature looks stable and the drift check stays silent while the model is quietly wrong. It also catches the case where inputs shifted and the model coped fine — which the drift check would have flagged as an alert nobody needed. So the honest answer is that the drift check buys speed on one failure mode and is blind to another, and the plan should be to add the label-based report as soon as there is time, not to treat drift as a substitute for ground truth. **Exercise 5: Data drift versus concept drift** Data drift is a change in the input distribution P(X): the feature values coming in look different from training — for example, a new customer demographic produces values the model rarely saw. Concept drift is a change in the relationship P(Y|X): the same inputs now map to different outcomes — for example, fraudsters change tactics, so transaction features that meant "safe" last year no longer do. Detecting input drift without labels warns you the model is operating on unfamiliar data, where its learned assumptions may no longer hold — but it can't confirm a real accuracy drop, because the model might still perform well on the shifted inputs, or the relationship might have changed while the inputs looked unchanged. Only ground-truth labels, when they eventually arrive, can confirm the model has genuinely become less accurate. ## Chapter 17: Code review {#sec-answers-code-review} **Exercise 1: A small, focused pull request** Experiential. The thing to notice is *why* a small, well-described PR is easy to follow: the reviewer can hold the whole change in their head at once, and a description of what-and-why spares them reconstructing your intent from the diff. The contrast with a sprawling change is the lesson — reviewability is mostly a property of size and framing, not of how clever the code is. **Exercise 2: Review someone else's PR** Experiential. Most people find that *finding* issues is easier than *phrasing* them well. The discipline is to mark each comment as blocking or a suggestion (so the author knows what must change), to keep it about the code rather than the person, and to say why — because the reasoning is what teaches and what makes the comment land as help rather than criticism. **Exercise 3: A bug that automated checks miss** A leak, a wrong metric, or an off-by-one in a split passes the linter and the tests because it is syntactically valid and the tests only check what the author already thought to check; a linter inspects *form*, never domain correctness. Catching it needs a reviewer who reads the logic and the data flow with domain knowledge — someone who knows the scaler must be fit on training data only, or that this metric is wrong for an imbalanced problem. That is exactly the attention automation cannot provide and review exists to supply. **Exercise 4: Peer review habits that don't transfer** Any of the three works; the accept-or-reject verdict is the most damaging. Brought to a pull request, it looks like a reviewer who reads the change, decides it isn't the design they would have chosen, and rejects it with a paragraph explaining what they'd have done — a verdict on the whole, rather than comments on lines. The author now has no route forward short of rewriting, and the change stalls. Code review isn't a verdict, it's a conversation with a default of *yes*. The reviewer's job is to get the change merged in good shape, which means specific comments the author can act on, an explicit split between what blocks the merge and what's merely a preference, and approval once the blockers are addressed rather than once the reviewer would have written it the same way. On a team shipping weekly this is structural, not just courtesy: reviewer and author will swap roles a dozen times a month, and every day a change sits unmerged is a day the branch drifts and the review gets harder. Peer review can afford to be a gate because it happens once; code review has to be a fast, repeatable exchange, because it happens constantly. Anonymity fails for the same underlying reason — you know exactly who wrote this and you will need them to review yours on Thursday — and the expectation of a defence turns that exchange adversarial, teaching authors to justify rather than to ask. **Exercise 5: A data science review checklist** Items worth adding that a general checklist wouldn't have: does it leak (preprocessing fit on test data, a target in the features)? are the data assumptions stated and checked? is it reproducible (seed and config captured)? is the metric appropriate to the problem? are there hard-coded secrets? What to leave off is style and formatting — not because it doesn't matter, but because the formatter and linter settle it automatically (@sec-readable-code). Leaving it off makes reviews *better*, because litigating whitespace in comments consumes the human attention that should go to logic and trains a team to nitpick form instead of reasoning about correctness. ## Chapter 18: Documentation {#sec-answers-documentation} **Exercise 1: Write a README** Experiential. The point at which your reader first gets stuck is the most valuable output: it's almost always an undocumented environment variable, a data source that needs access you forgot to mention, or a setup step so habitual you didn't know you were doing it. Timing the run from clone to running surfaces the assumptions you can't see precisely because they're yours. **Exercise 2: Add docstrings** Experiential. The test is whether `help()` on a function tells a reader enough to *use* it without reading the body. If it doesn't, the docstring is missing part of the contract — usually the parameters, the return value, the exceptions it raises, or a worked example. A docstring that only restates the function name has documented nothing. **Exercise 3: Write a model card** Experiential. The hardest section is almost always "known limitations / where it should not be used", because it forces you to articulate the model's failure modes and the populations it was *not* validated on — exactly the questions exploratory work leaves implicit. That difficulty is itself informative: where the model card is hard to write is where your understanding of the model's boundaries is thinnest, and therefore where the risk lives. **Exercise 4: Classify documents with Diátaxis** A docstring is *reference* (look up what a function takes and returns). A tutorial notebook is a *tutorial* (learning by the hand). A model card is mostly *reference* plus *explanation* (facts about the model, and the why behind its limits). A README is a deliberate blend — at its best a brief *tutorial/how-to* that orients a newcomer and points onward to the rest. Mixing the jobs makes a document worse because a reader arrives with one need — to learn, to look up, or to understand — and a document trying to serve two serves neither: a reference padded with teaching is slow to search, and a tutorial listing every option is impossible to follow. **Exercise 5: Keeping documentation in sync** Two structural practices: co-locate documentation with the code (docstrings), so a change to the code sits right beside the text that describes it; and generate reference documentation from the code and make examples executable (doctest, or a tested snippet), so a changed signature or a stale example becomes a build or test *failure* rather than a silent lie. "Remember to update the docs" is not a third practice because it relies on human discipline under deadline pressure with no feedback when it's forgotten — the documentation rots quietly and you only discover it when it has already misled someone. Structural defences make drift either impossible or loud; a reminder makes it neither. ## Chapter 19: Technical debt {#sec-answers-technical-debt} **Exercise 1: Audit a project for debt** Experiential. Sorting each item into deliberate (you knew you were cutting the corner) and inadvertent (you've only just noticed) is the instructive part, and the inadvertent pile is usually the larger and more alarming one. The item that surprises people most is almost always a piece of "temporary" code — a hard-coded value, a quick script — that turned out to be load-bearing and has been quietly holding production together for months. **Exercise 2: The boy-scout rule** Experiential. Paying down one item while you're already in the file — adding a test, extracting a function, naming a constant — is typically quick relative to the change you came to make, and that's exactly the point: opportunistic repayment is cheap because you've already paid the cost of understanding the code. Debt repaid this way never has to be scheduled. **Exercise 3: A debt log** Experiential. A debt item is worth writing down, rather than fixing on the spot, when the fix is larger than the time you have, when the code might be discarded anyway, or when stopping to repay it now would derail the task in hand — but you still want it *visible* so it isn't silently forgotten. Trivial fixes don't go in the log; they go in the boy-scout pass. The log's whole purpose is to make deferred debt a deliberate, tracked decision rather than a thing you rediscover at 3am. **Exercise 4: Ordering the repayments** Experiential, but the proxy you land on is the instructive part. Since interest is only charged when you touch the code, the best available proxy is *expected rate of change* — how often you or anyone else is likely to modify that code in the next few months — weighted by what a mistake there would cost. A tangle in a module nobody has opened in a year is charging you nothing, however ugly it is; a hard-coded threshold in the transform every new feature passes through is charging you on every change. Blast radius and silence matter too: debt that fails loudly is cheaper than debt that returns `inf` into a report, which is the shortcut from earlier in this chapter. The item to delete rather than repay is usually a dead experiment, an abandoned branch of a pipeline, or a helper with exactly one caller that no longer needs it. This is where the financial metaphor genuinely misleads: a loan must be settled, so the metaphor frames every debt as something owed and eventually payable, and refactoring as the only currency. Code has a third option the metaphor has no word for — you can make the obligation cease to exist by removing the code. Deletion is not repayment; it is discovering the debt was never worth carrying. Ask of each item whether anything would break if it vanished, and be suspicious of how often the honest answer is "nothing". **Exercise 5: When debt is the right call** A shortcut is the correct decision for code with a short or uncertain life — a prototype that may be discarded, a hypothesis you're testing, a genuine deadline where shipping now matters more than polish. It's reckless when taken in code you already know will be load-bearing, when taken without recording it, or when the resulting failure would be silent and high-consequence. The distinguishing property is the code's expected lifetime and criticality, combined with whether the debt is acknowledged: debt on disposable, low-stakes code is a tool; unrecorded debt on code others will depend on is a liability waiting to come due. ## Chapter 20: Cross-discipline collaboration {#sec-answers-cross-discipline} **Exercise 1: Map the vocabulary gaps** Experiential. The classic three: *test* (a data scientist means evaluation metrics; an engineer means pass/fail assertions on code), *validation* (DS: holding out data to measure generalisation; SE: checking inputs against a schema), and *model* (DS: a learned predictive function; SE: an abstraction of a domain, like a class diagram). The gap that has usually caused a real misunderstanding is "is it tested/validated?" — where both parties said yes, meaning entirely different things, and discovered the mismatch only later. **Exercise 2: Write a handoff document** Experiential. The revealing part is what you find yourself making explicit for the first time: the failure modes, the edge cases of the input contract, the *caveats* on performance (where the model is weak, the populations it wasn't validated on), how it should be monitored, and who owns it when it misbehaves. All of that typically lived only in your head, which is exactly why the handoff is where things go wrong. **Exercise 3: An interface as a contract** Experiential. Agreeing the schema, latency budget, and bad-input behaviour in advance is cheaper because the alternative — discovering in production that you returned a label where the service expected a probability, or the wrong units, or an unhandled null — is an incident with real cost and a paging at a bad hour. The contract converts an integration surprise into a build-time check: a conversation now versus an outage later. **Exercise 4: The failure the schema can't catch** Almost any failure of *meaning* rather than *form* gets through. The model returns `0.03` for every customer because an upstream feature silently went null and the classifier fell back to its base rate: a valid float in `[0, 1]`, a valid version string, a green CI run, and a retention campaign that quietly stops firing. Or the team retrains on a new population and the probabilities are still well-formed but no longer calibrated, so the service's "high risk" threshold now means something different from what it meant when it was chosen. The schema validated the bytes; nobody had written down what the number *means*, what "good enough" looks like for this use, or who is expected to notice when it stops holding. What was missing is the human half of the contract: the data scientists stating the calibration assumptions, the known failure modes, and the monitoring that would surface them (@sec-monitoring); the engineers stating who is paged, and what they are allowed to do about it at 3am. Both halves are writable — they just aren't writable as a schema. **Exercise 5: Two rigours pulling opposite ways** A clear case: a model that needs frequent retraining and experimentation (the data scientist wants fast, loose iteration) while it serves live production traffic (the engineer wants stability, tests, and controlled releases). The instincts genuinely conflict. A team holding both resolves it not by one side winning but by engineering the *boundary* tightly so that exploration can stay loose safely — an automated retraining pipeline with validation gates and canary releases, behind a stable contract and CI the engineer trusts, so the data scientist iterates freely without putting production at risk. The general principle is that the engineering instinct and the data science instinct are reconciled by tightly engineering the interface so that what happens behind it can remain appropriately loose. ## Chapter 21: Notebook to production API {#sec-answers-notebook-to-api} **Exercise 1: Carry a model one stage further** Experiential. For most readers the next stage is extracting the feature and training logic into an importable module of pure functions. What you have to change to make it importable is everything that tied the code to the notebook: replace reliance on notebook globals with explicit function arguments (@sec-functions-modules), separate the logic from the cell that happened to run it, and give each function a clear input and return. The pure-function discipline from @sec-functions-modules is precisely what makes the code importable — a function that depends on whatever is in the kernel can't be lifted out of it. **Exercise 2: The train–serve safeguard** Experiential. The test feeds a single raw record through both the training feature path and the serving feature path and asserts the resulting features are identical. It is worth more than a test of the model's accuracy because train–serve skew is a *silent, high-impact* bug: if serving computes a feature even slightly differently from training, the model receives inputs unlike anything it learned on and degrades quietly, with nothing failing. That has a definite right answer a test can pin, whereas accuracy is a moving statistical quantity that doesn't belong in a pass/fail gate (@sec-testing). The skew test catches a real deployment defect; an accuracy assertion catches noise. **Exercise 3: Wrap the model in an API** Experiential. What the endpoint forces you to make explicit, and the notebook let you leave implicit, is the contract: the exact request schema (field names, types, and valid ranges), what the response carries (a probability rather than a label, plus a `model_version` for traceability), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, in advance, what a caller must send and will receive — and the malformed request returning a clean 422 is that contract doing its job. **Exercise 4: Extending the publication mapping** The abstract maps cleanly. It is the README and the API documentation (@sec-documentation): the short statement of what this thing is, what it takes, what it returns, and who should use it — read far more often than anything else, and the only part most consumers will ever read. The retraction maps onto rollback (@sec-deployment), but the fit is worth pushing on, because rollback is the *better* mechanism. A retraction is slow, public, and cannot recall the copies already circulating or the work built on top of them; a rollback is a version swap that takes effect on the next request, and the versioned response field from this chapter means you can identify exactly which predictions came from the withdrawn model. The place where the analogy runs out is the other direction: a paper has citation, a record of who relied on the result. A service has nothing so honest. Its consumers are whoever happened to call the endpoint, and unless you deliberately build for it — logged clients, versioned routes, deprecation notices — you cannot tell who depends on the behaviour you are about to change. Publication makes dependence visible; deployment hides it. **Exercise 5: How far to walk the route** A throwaway analysis should stop at the notebook (version-controlled at most); an internal tool typically warrants a package, a few tests, and externalised config, but rarely a full container-and-CD pipeline; a model real users depend on needs the whole route — package, tests, API, container, CI, deployment, and monitoring. The signal to go further is always the same kind of thing: someone else needs to run it, it runs repeatedly or unattended, real decisions now depend on its output, or it must reproduce exactly. "It has outlived its expected life, or someone now depends on it" is the trigger to take it one stage further down the path. ## Chapter 22: Reproducible research pipeline {#sec-answers-reproducible-pipeline} **Exercise 1: One command, from raw data to result** Experiential. The valuable discovery is the hidden dependency the `Makefile` flushes out — the step that only worked because of something on *your* machine: a file in your home directory, a package you installed once and forgot, an environment variable set months ago, or a manual "and then I clicked export" step. Declaring every stage's inputs and outputs forces those implicit dependencies into the open, which is exactly why a one-command rebuild is a stronger guarantee than "it ran when I did it". **Exercise 2: Version a dataset** Experiential. Before versioning, reconstructing the exact data behind an old result was usually impossible because the raw file had been overwritten or changed in place, with no record of which version produced the figure. DVC (or, at a minimum, a dated immutable copy plus a checksum committed alongside the code) makes the input recoverable, so checking out the commit behind a result also restores the dataset that produced it — the missing fourth input from the chapter. **Exercise 3: Generate the number, don't paste it** Experiential. A hand-pasted number becomes wrong because nothing updates it when the data or code changes — it is a snapshot frozen at the moment you copied it, with no link back to its source, so the day the analysis changes the figure in the slide silently disagrees with the figure in the code. A generated number is recomputed from the data every time the report is rendered, so it cannot drift away from the result it claims to report; the worst that can happen is the build fails, which is loud rather than silent. **Exercise 4: Giving a live table a seed-like handle** Both options are defensible, and the trade is storage cost against trust in the warehouse. The **snapshot** genuinely pins the data — the bytes you analysed are the bytes you keep — but you pay for the storage, you pay again every time the analysis reruns on a new window, and a wide table snapshotted weekly gets expensive fast. It also leaves one failure open: the snapshot is only as good as the moment you took it, so if the extraction itself was wrong, you have faithfully preserved the wrong data. The **query plus as-of timestamp** is nearly free and stays readable, but it only works if the warehouse can actually answer a historical question — it assumes the underlying table is append-only or has genuine time-travel. Point it at a table that gets restated, backfilled, or hard-deleted for retention, and the "same" query returns different rows a year later while looking entirely reproducible. The honest test for either choice is the same one from the chapter: rebuild an old result from scratch and check it still comes out. A handle you have never exercised is a handle you do not know you have. **Exercise 5: What a notebook doesn't pin** A single notebook, even under version control, pins the *code* but not the other three inputs. It does not pin the **environment** — the packages it imports are whatever happens to be installed in the kernel, so a colleague with a different pandas version can get a different result from identical code. It does not pin the **data** — it reads whatever the file path points to, and that file can be overwritten or updated without the notebook changing at all. (And it pins **randomness** only if you remembered to set seeds.) "It's all in one notebook" addresses code organisation, not reproducibility: the notebook is necessary but nowhere near sufficient, because the result depends on three things living entirely outside it. ## Chapter 23: MLOps pipeline {#sec-answers-mlops} **Exercise 1: Sketch the loop** Experiential. Naming each stage for your own model — the training pipeline, the registry, the deployment, the monitoring signal, the retraining trigger — usually reveals that the missing or manual stage is the *return arrow*: most teams have a way to train and a way to deploy, but monitoring is thin and retraining is ad hoc, done when someone happens to notice a problem. Automating it means adding drift monitoring that emits a signal, a triggered training pipeline, and a promotion gate — closing the loop so the cycle runs on a signal rather than on someone's memory. **Exercise 2: The retraining trigger** Experiential. The false-alarm rate matters because a trigger that fires too often is the flaky test of MLOps (@sec-testing and @sec-monitoring): each false alarm causes a needless retrain, which costs compute and — worse — risks promoting a model trained on a blip. A trigger becomes more trouble than it's worth once its false positives are frequent enough that the team disables it or ignores its output, at which point it protects nothing. The defences are the same as for alerts: set the threshold from observed normal variation rather than a round number, and require sustained drift rather than a single noisy batch. **Exercise 3: The promotion gate** Experiential. The comparison must use the *same* evaluation data because scoring two models on different datasets confounds "the candidate is better" with "the candidate's test set was easier" — you could not tell skill from luck of the draw. And you require a *margin* rather than strict improvement because a tiny difference in a metric like AUC is within its own run-to-run and sampling variability; promoting on a hair's-breadth win means swapping the production model on noise, which adds risk and churn for no real gain. The candidate should have to beat the incumbent by more than the metric's own wobble before it earns promotion. **Exercise 4: The judgement you won't hand to a threshold** The judgements people are least willing to automate are the ones about *why* something changed rather than *whether* it changed. A metric holds up overall while quietly collapsing on one segment. Drift appears in a feature and you recognise it as a known upstream release rather than a real shift in customers. A candidate wins on AUC while getting worse on the errors that actually cost money. No threshold sees any of that, because each requires knowing something about the world that the numbers do not carry. All three responses are legitimate, and the right one depends on the cost of being wrong and how often the loop turns. **A cruder proxy** — segment-wise metrics with their own gates, a cost-weighted score instead of AUC — works when you can name the thing you're worried about in advance; it will still miss the case you didn't anticipate. **A human in one step** is the usual answer for a genuinely consequential model: automate the trigger, the retrain, and the evaluation, and let a person approve the promotion, which is a few minutes of attention rather than a day of work. **Leaving the loop open** is the honest choice for a model that retrains twice a year or where a bad promotion is expensive and hard to detect — automation you don't need is a system you now have to maintain. What should move you between them is evidence, not ambition. If the human approval becomes a rubber stamp — if nobody has rejected a candidate in six months — the judgement has effectively been encoded already and you should write it down. If retraining by request starts arriving faster than you can serve it, close the loop. And if you find yourself unable to state what the human is checking for, that is the signal that you don't yet understand the decision well enough to automate *or* delegate it. Knowing which of these a given model deserves is the judgement the whole book has been building towards. **Exercise 5: The weakest practice** Take rollback (@sec-deployment). Without it, the loop's promotion gate is a one-way door: the moment a candidate is promoted — perhaps trained on a corrupted batch, or scoring well on a test set that didn't catch a regression — it serves production traffic with no fast way back, and an automated loop that can promote but not un-promote has merely automated the act of shipping a bad model. The same argument lands on any link: without reproducibility (@sec-repro-pipeline) you can't retrain to a comparable result, so the candidate can't be trusted or traced; without monitoring (@sec-monitoring) nothing triggers the loop and the model decays in silence; without testing (@sec-testing) a broken transform propagates into every retrain. The cycle only runs safely if every one of these holds, which is why automating it is the last thing you do, not the first.