This exercise is diagnostic — there’s no single “right” answer. Common failures include:
NameError for variables defined in cells that the kernel ran out of order. For example, a variable created in cell 15 but used in cell 8, which only worked because you happened to run cell 15 first during interactive exploration.
FileNotFoundError for data files with hard-coded paths that only exist on your machine or in a specific working directory.
Cells that depend on outputs from cells you’ve since deleted or commented out.
The point isn’t to fix every failure immediately — it’s to see how much of your notebook’s correctness depends on invisible state rather than explicit structure.
Exercise 2: Extract a function
Here’s an example using the chapter’s customer filtering logic:
import pandas as pdimport numpy as npdef filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame:"""Select customers whose spend exceeds the given threshold."""return customers[customers["spend"] > threshold].copy()# Verify with a small test inputtest_data = pd.DataFrame({"spend": [50, 150, 250]})result = filter_high_value(test_data, threshold=100)assertlen(result) ==2, f"Expected 2 rows, got {len(result)}"assertlist(result["spend"]) == [150, 250], "Should contain only rows above threshold"print("All assertions passed")
All assertions passed
The key properties gained: the function has a name that describes its purpose, its inputs are explicit (no reliance on global variables), and the assert statements verify the logic independently of the notebook’s broader state.
Exercise 3: Score a notebook
This is a self-assessment — your honest scores matter more than the numbers themselves. In practice, most data science notebooks score highest on modularity (often 2–3, since most have at least some cell-level separation) and lowest on testability (often 1, since few notebooks include any automated verification). Reproducibility varies widely: a notebook that loads from a fixed CSV with pinned dependencies might score 4, while one that relies on a live database connection and pip install scores 1.
The value of this exercise is identifying your weakest property and asking whether strengthening it would have saved you time in a recent project. If the answer is yes, that’s where to invest first.
Exercise 4: Holdout set / test suite analogy
Two ways the analogy holds:
Both are verification mechanisms applied after the creative work. You build the model, then validate. You write the code, then test. Neither replaces the work; both catch problems the author missed.
Both require separation — a holdout set must be kept separate from training data, and tests must check behaviour from outside the code, not just re-run it. Contamination in either case undermines the verification.
Two ways the analogy breaks down:
Model validation is probabilistic; software testing is deterministic. A holdout accuracy of 82% might be perfectly acceptable — you’re measuring how well the model generalises. A test that passes 82% of the time is broken. Tests are pass/fail: the code either does what you specified or it doesn’t.
Holdout sets evaluate performance on data drawn from the same distribution. Tests evaluate correctness against cases the developer explicitly constructed, including edge cases and error conditions that may never appear in production data. This means tests can catch failures that no amount of validation data would reveal, but they can also miss failures that real-world data would expose. Each has a blind spot the other doesn’t share.
Exercise 5: Run a colleague’s notebook
This exercise is experiential — the answer is your documentation of the attempt. Common discoveries include:
File paths that assume a specific directory structure or operating system
Environment dependencies not captured anywhere (specific package versions, system libraries, environment variables)
Cells that must be run in a non-obvious order, or cells that must be skipped
Configuration values with no explanation of how they were chosen
Each discovery maps to one of the chapter’s four system properties. Hard-coded paths and undocumented dependencies are reproducibility failures. Monolithic cells that do many things are modularity failures. The absence of any automated checks is a testability failure. And magic numbers without context are readability failures.
D.2 Chapter 2: Version control
Exercise 1: Put a project under version control
Experiential — there’s no single right answer, but a sound initial commit contains only what you authored: code (.py modules, and notebooks with outputs stripped), configuration (requirements.txt, pyproject.toml, and the .gitignore itself), and documentation (a README). What you deliberately exclude, and where each belongs instead:
Data → a data-versioning tool (DVC) or object storage. It’s too large and too volatile for Git’s keep-everything-forever history, and may be sensitive.
Trained models and artefacts (.pkl, .joblib) → a model registry or artefact store. They’re large, binary, and regenerated rather than written by hand.
Secrets (.env, API keys) → a secrets manager, or a local .env that is never committed. A credential committed even once persists in the history after you delete it.
Caches and environments (__pycache__, .ipynb_checkpoints, .venv) → not versioned at all; they’re regenerated locally.
The discipline is the one from the chapter: commit what you author, store what you generate or receive somewhere better suited to it.
Exercise 2: Clean notebook diffs
Two routes achieve this. nbstripout installs a Git filter that strips outputs and execution counts on commit, so the tracked version holds only code and markdown:
pip install nbstripoutnbstripout--install# registers the filter for this repository
Or pair the notebook with a script representation using Jupytext, and treat the script as the reviewed artefact:
After either, make a one-line change and inspect git diff (or nbdiff if you use nbdime). The diff should now show only your change rather than a wall of metadata. The verification is the exercise: you’ve turned an unreadable JSON diff into a reviewable one.
Exercise 3: Commit messages for past decisions
Self-directed. The instructive part is comparing your messages against what a filename or inline comment could carry. A message such as “Drop signup_channel: 60% missing after the May tracking change, and imputing it was injecting signal” records both the reason and the evidence — information a filename like model_v3 cannot hold and a comment like # dropped signup_channel omits. Because the message is attached to the exact change, attributed, dated, and surfaced by git blame, the “why” survives long after the people who remember the review meeting have moved on. That permanence is what neither the filename nor the comment provides.
Exercise 4: Branch for an experiment
Experiential. A typical flow:
git switch -c experiment/log-spend-features# ...edit, commit on the branch...git switch main # main is untouched and still runsgit merge experiment/log-spend-features # if the experiment worked# or: git branch -D experiment/log-spend-features # if it didn't
Keeping main untouched means that at every moment you have a known-good version to fall back to, demo, or hand over — and discarding a failed experiment leaves no residue, no model_v2_BAD.ipynb lingering in the folder. Compared with copying files, the branch makes the experiment both comparable (you can diff it against main) and disposable (deleting the branch erases the dead end cleanly).
Exercise 5: Git versus an experiment tracker
Git versions the code and its history of changes: it answers “what is the code, and how did it come to be this way?”, with branching, merging, and line-by-line history. It records nothing about what a given run produced. An experiment tracker such as MLflow records runs: the metrics, parameters, and artefacts a specific execution generated, so you can compare AUC across fifty hyperparameter settings — something Git has no concept of. Conversely, a tracker offers no line-by-line source history or merge.
The division of labour reflects two different questions. Versioning your code is about the process that produces results; tracking results is about the outputs of running that process on particular data. Reproducibility needs both — the exact code version and the run record — which is why mature projects link each tracked run back to the Git commit that produced it.
D.3 Chapter 3: Environments and dependencies
Exercise 1: Produce a lockfile
Experiential. The flow separates intent from the resolved result:
# requirements.in holds your direct dependencies (the abstract spec)pip install pip-toolspip-compile requirements.in # -> fully pinned requirements.txt (the lock)python-m venv .venv &&source .venv/bin/activatepip install -r requirements.txt # rebuild from the lock
The point of the exercise is what the lock contains that your requirements.in did not: every transitive dependency, pinned to an exact version. Rebuilding in a fresh environment and finding the project still runs confirms you’ve captured the environment, not just your top-level wishes.
Exercise 2: Audit unpinned dependencies
Compare what’s declared (often nothing, or a few >= constraints) against pip freeze. The instructive part is naming a library where a major-version bump could change results and saying how: scikit-learn has changed estimator defaults between releases (a different solver or tie-handling shifts predictions); pandas changed copy-on-write behaviour and default dtypes; NumPy’s random Generator stream can differ across versions. In each case the code is untouched but the numbers move — which is precisely the failure mode pinning prevents.
Exercise 3: What pinning controls
Pinning your Python-package versions controls library behaviour — a changed default, a re-implemented algorithm, a deprecated parameter. It does not control the Python interpreter version, the operating system, the underlying maths libraries (BLAS/LAPACK), or hardware/GPU floating-point behaviour. To control that second category you reach for a container (Docker), which pins the interpreter, system libraries, and OS alongside the packages; for bit-exact numerics you would additionally fix BLAS threading and any framework determinism flags. Pinning versions is necessary but not sufficient for full reproducibility.
Exercise 4: Reproduce a colleague’s environment
Experiential. The valuable output is the list of things the lockfile alone didn’t capture, each with its proper home:
The Python version → a .python-version file or pyproject.toml.
A system library a wheel links against (e.g. a C library, a CUDA runtime) → a container image or a documented set of OS packages.
An environment variable the code reads at runtime → a committed .env.example documenting the names (never the values; see Configuration and secrets).
Each gap is a reproducibility failure waiting to happen, and each has a better home than a colleague’s memory.
Exercise 5: When abstract beats locked
Abstract >= requirements are the right choice when your code is a library meant to be installed alongside other packages: over-constraining versions would make it hard for consumers to satisfy everyone’s dependencies at once, so you specify the minimum you need and stay flexible. An exact lock is essential when your code is an endpoint — a deployed service, a scheduled pipeline, a reproducible analysis — where the same versions must reappear every time and nothing downstream depends on your version range. The difference is structural: a library is a dependency of other things (flexibility aids interoperability); an application is the final consumer (exact reproduction is the whole point).
D.4 Chapter 4: The command line
Exercise 1: Capture a workflow
Experiential. A reasonable result is a Makefile with one target per step, run end to end with a single make. The discovery worth noting is the step that “depended on you remembering to do something first” — creating an output directory, setting an environment variable, downloading data before the feature step. Those implicit prerequisites are exactly what a task runner makes explicit: a target that creates the directory, or a dependency declaration (features: data) that enforces the order so no one has to remember it.
Exercise 2: Answer a question with the shell
For example, counting the distinct values in the third column of a CSV (skipping the header):
tail-n +2 data.csv |cut-d,-f3|sort|uniq|wc-l
or counting rows matching a condition with grep -c. This feels natural for quick, line-oriented filtering and counting on flat text, and it’s faster than starting a Python session. You wish for a DataFrame the moment fields contain commas or quotes (naive cut mis-parses real CSV), when you need typed aggregation, or when a join is involved — that’s the boundary the Data Science Bridge describes.
Exercise 3: Exit codes and chaining
A validation script signals failure by exiting non-zero:
# validate.pyimport sysimport pandas as pddf = pd.read_csv("data/processed.csv")if df.empty or"target"notin df.columns:print("validation failed: empty data or missing target", file=sys.stderr) sys.exit(1) # non-zero: tells the shell something went wrongprint("validation passed")
python validate.py &&python train.py # train runs only if validate exits 0
This matters because automation and CI decide pass/fail from exit codes, not from reading output. A non-zero exit halts the && chain, so bad data never reaches training, and a CI server marks the build red instead of silently continuing. The exit code is the machine-readable verdict the whole pipeline depends on.
Exercise 4: Surviving a dropped connection
A command started in a plain SSH session is a child of that session. When the connection drops, the session ends and the job is sent a hang-up signal (SIGHUP), so it dies — unless you took extra steps (nohup, disown). tmux (or screen) changes this by running your shell inside a session that lives on the server, decoupled from your client connection: you detach (or simply lose the connection), the session and everything in it keep running, and you reattach later to find the job still going or finished. For any job measured in hours, that decoupling is the difference between a result and a wasted afternoon.
Exercise 5: Pipes versus pandas
The analogy holds in composition: both build a larger operation from small, single-purpose steps arranged left to right, each transforming what the previous one produced. It breaks down in the data model — shell pipes pass untyped text, line by line, between separate programs, whereas pandas passes a typed DataFrame within a single process. The rule of thumb that falls out: use a shell pipeline for quick, line-oriented work on text and files (filtering, counting, gluing tools together), and reach for a Python script the moment the data has real structure — typed columns, joins, or anything where correctly parsing the text is itself the hard part.
D.5 Chapter 5: Readable code
Exercise 1: Refactor for readability
Experiential. The change that surprises people most is that renaming alone surfaces confusion: if you can’t think of a good name for a variable, that’s often a sign you don’t fully understand what it holds, or that it holds two different things at different points. Replacing magic numbers with named constants does the same — naming 0.73 forces you to articulate what it is. Adding a type signature and a one-line docstring then makes the contract explicit without touching the logic. If the refactor was purely cosmetic and nothing became clearer, the original was already readable; usually something does.
Exercise 2: Formatter and linter
Experiential. The instructive step is sorting the linter’s output into real defects and pure style. Real defects include unused imports, variables assigned but never used, names that shadow a builtin (list, dict), bare except: clauses that swallow errors, and mutable default arguments. Pure style includes line length, quote style, and spacing — exactly the things the formatter fixes automatically. The lesson is the division of labour: automate the style pile entirely so that human attention in review goes to the defect pile and to the logic, which no tool can check.
Exercise 3: Split a function
Experiential. The test of a good decomposition is to read just the new function names in sequence and ask whether they narrate what the original function did. If they do — load_raw, drop_invalid_rows, add_spend_per_day, aggregate_by_cohort — the names are carrying the structure, and a reader can understand the whole from them. If a step needs a comment to explain what it does, the name isn’t doing its job yet.
Exercise 4: Code versus a methods section
The analogy holds in that both name things meaningfully, present steps in a followable order, and leave out the dead ends, because both exist so a reader can follow the reasoning rather than reconstruct it. It breaks down in how they’re consumed: a methods section is read once, for understanding, and a reader forgives small ambiguities because they won’t re-execute your prose. Code is run repeatedly and is read specifically in order to change it, so an ambiguity a human would gloss over becomes the spot where someone misreads the intent and introduces a bug. Code has to be clearer than prose because misreading it has consequences a prose reader never faces.
Exercise 5: When readability isn’t worth it
Throwaway names are the right call for genuinely scratch code that will be deleted within the session — a quick check of a distribution, a one-off plot to settle a question, a snippet you’re using to understand an API. The specific signal that it has crossed the line is promotion: you copy it into another notebook, you find yourself relying on its output days later, or you hand it to someone else. At that moment the code has become “kept”, will be read many times, and earns the few minutes of naming and documentation. The skill is noticing the promotion and cleaning up then, rather than writing every scratch cell as if it were production.
D.6 Chapter 6: Functions, modules, packages
Exercise 1: Extract a copy-pasted function
Experiential. The payoff is visible the moment you make the follow-up change: with the logic in one imported module, you edit one place and every caller gets the fix; with copies scattered across notebooks, you had to find and edit each one — and missing one is exactly how the copies drift out of sync. “How many places did I have to change?” going from several to one is the single-source-of-truth principle made concrete.
Exercise 2: Global to pure function
Experiential. Once every input arrives as an argument, the function’s result depends only on those arguments, so it returns the same answer no matter what cells ran before it. The verification — calling it after deliberately changing some unrelated global or re-running cells out of order, and getting the same result — is the property that also makes it testable in the next chapter.
Exercise 3: Make a project installable
Experiential. The error you were previously working around is ModuleNotFoundError (or the sys.path.append("..") hack used to dodge it), which works only from the directory you happened to launch from. After a minimal pyproject.toml and pip install -e ., the package is importable by name from anywhere in the environment, and because the install is editable, changes to the source take effect immediately without reinstalling. The fresh-notebook test confirms the import no longer depends on where you started.
Exercise 4: Library-author disciplines
A discipline you do not need for code only you use: a stable public API with backwards compatibility, semantic versioning, and deprecation cycles. While you’re the only user you can rename, re-signature, and restructure freely. A discipline you should adopt the moment a colleague imports your code: a stable interface — don’t change function names or argument meanings out from under them without warning — and a documented public surface so they can use it without reading the implementation. The trigger for the switch is precisely “someone else now depends on this”.
Exercise 5: What belongs in a package
Code that should stay in the notebook is exploratory, one-off, or presentation-specific: the narrative of a particular analysis, plots tailored to one report, throwaway checks. Code that has earned a place in a module is reusable logic — data cleaning, feature engineering, model training and evaluation — that you’ll run more than once or in more than one place. The signal that logic has crossed the line is reuse: you’ve used it (or want to) a second time, you need to test it, or someone else needs it. “I’m about to copy this” is the clearest possible prompt to extract it instead.
D.7 Chapter 7: Testing stochastic code
Exercise 1: Test a deterministic transform
Experiential. The instructive part is usually the edge cases: writing test_on_empty_input or test_with_all_zeros forces you to decide what the function should do in those situations — return an empty result, raise a clear error, propagate NaNs — when the original code never made that decision explicit. A test you can’t write because you don’t know the expected answer is a sign the contract is underspecified, which is a finding in itself.
Exercise 2: Make a stochastic function testable
Experiential. Having the function accept an explicit rng argument turns hidden global randomness into an injected dependency you control. The exact test then fixes the seed and asserts a specific result; the tolerance test asserts a statistical property — say, that the mean of many draws is within some band of the expected value. The tolerance should be justified: wide enough that it won’t fail by chance (a few standard errors of the quantity you’re checking), tight enough that a real defect would breach it. Stating why you chose the band is part of the answer.
Exercise 3: Test an invariant
Experiential. An invariant is a property that must hold for every input — preserved row count, no new missing values, output bounded in a range, mean zero after standardising. Checking it across many random inputs is property-based testing done by hand; hypothesis automates the input generation and, when a property fails, shrinks the counterexample to the smallest input that triggers it, which is often the fastest route to understanding the bug.
Exercise 4: Why model.score(...) > 0.85 is a poor unit test
It conflates evaluation with testing. The assertion is really trying to answer “is the model good enough?”, which is an evaluation question — answered on a continuum, against a baseline, and monitored over time — not a pass/fail property of the code. As a unit test it fails on three counts: it’s fragile (it breaks the first time the data shifts, with no code defect), uninformative (a failure doesn’t localise any bug), and slow. What you should test about the pipeline is the deterministic machinery around the model: that data validation rejects malformed input, that transforms produce the expected columns and leak nothing, that the pipeline runs end to end on a tiny sample, and that a saved model round-trips to identical predictions.
Exercise 5: Why a flaky test is worse than none
A test that fails one run in ten and is habitually re-run until green is dishonestly noisy: it trains the team to treat failures as background noise to be cleared by re-running, which is exactly the habit that lets a real failure slip through unnoticed. It also blocks or destabilises CI and erodes trust in the whole suite. No test is at least honestly silent; a flaky test actively degrades everyone’s response to failure. Two fixes that keep the test: make it deterministic by fixing the seed so the stochastic element is pinned; or, if it’s genuinely checking a statistical property, replace the brittle assertion with a principled tolerance (several standard errors wide) or an invariant that must always hold. Re-running until it passes, or deleting it, are the two non-answers.
D.8 Chapter 8: Debugging and profiling
Exercise 1: Read a traceback
Experiential. Read bottom-up: the final line is the exception type and message (what went wrong), and the deepest frame in your own code is where to look — third-party frames below it are usually just the machinery that surfaced your mistake. The fact you’d check first follows directly from those two (a ZeroDivisionError in a line dividing by active_days says: look for a zero). People are routinely surprised how often the traceback alone, read properly, fully explains the bug — the panic that makes us skim it is the real obstacle.
Exercise 2: Use a debugger instead of print
Experiential. The thing a debugger shows that a print does not is the entire live state at the moment of failure — every variable, not just the one you anticipated printing. That’s how you spot the cause you weren’t looking for: the column that’s unexpectedly all zeros, the frame with the wrong shape, the value that’s a string where you assumed a float. Print debugging can only show what you already suspected; the debugger shows what you didn’t.
Exercise 3: Replace print with logging
Experiential. A reasonable mapping is INFO for milestones (“loaded N rows”, “training complete”), WARNING for recoverable oddities (“clipped 12 negative values”), and DEBUG for fine detail. The payoff is the final step: flipping a single level setting switches between a quiet production run and a verbose diagnostic one without editing — or later removing — any of the statements, which is precisely what makes logging persist where scattered prints get deleted.
Exercise 4: Profile and fix
Experiential. The lesson lands hardest when the hot spot is not where you expected — the slow step is often an innocent-looking apply or a repeated recomputation, not the obviously heavy model fit. Fix the dominant one (vectorise the loop, cache the repeated work) and measure the change rather than assuming it helped. The discipline is to optimise what the profiler points at, and only once something is actually too slow.
Exercise 5: Four questions, four tools
A variable’s current value → print (or a quick inspect). Adequate, because the question is narrow and you already know what you want to see.
The full state at a failure → a debugger (pdb or an IDE). print can’t show everything at once, and it forces you to guess in advance which variables will matter.
What happened in a run you weren’t watching → logging. print has no levels, timestamps, or persistence, and you weren’t there to read it scroll past.
Where the time went → a profiler. print can’t attribute time, and hand-timed guesses are biased toward the parts you already suspect.
print is the right tool only for the first; for the other three it’s a poor stand-in because it answers “what is this value now?” and nothing else — it cannot capture full state, persist a structured record, or measure performance.
D.9 Chapter 9: Project structure
Exercise 1: Reorganise a flat project
Experiential. The instructive discovery is usually something being mutated in place — a raw CSV edited to fix a typo, a column renamed in the source file, a row dropped by hand. Once raw data is read-only, that edit has to become a transformation step whose output lands in data/interim/ or data/processed/, leaving the original untouched. Finding the in-place edit is finding the point where your work stopped being reproducible from source.
Exercise 2: Write a README
Experiential. Every question the colleague still has to ask is a gap in either the README or the structure, and the common ones are revealing: an undocumented environment variable, a data source that needs credentials or special access, a setup step that “everyone knows”, or the order in which things must run. The exercise works precisely because you can’t see your own assumptions — the colleague’s questions surface them.
Exercise 3: A single source of truth for paths
Experiential. The original would break on a colleague’s machine because an absolute path like /Users/you/project/data.csv simply doesn’t exist there. Deriving every path from one project root (resolved from __file__ or a config value) makes the project portable: it runs unchanged on a laptop, a server, or inside a container, because only the root differs. The hard-coded path is one of the most common reasons “it works on my machine” and nowhere else.
Exercise 4: Layout versus schema
The analogy holds in that both give each thing a known place, so a person or a tool can navigate without a guided tour — columns tell you where a variable lives, directories tell you where a kind of file lives. It breaks down in enforcement: a DataFrame’s schema is enforced by the runtime (the wrong columns cause a failure), whereas a directory convention is enforced only by discipline and tooling — a project template, a linter, code review. The consequence is that structure must be actively maintained: it drifts the moment someone drops a stray file at the top level, where a schema would simply have refused.
Exercise 5: When structure is overkill
A genuinely one-off analysis — a quick answer for a meeting, a teaching example, a throwaway exploration — should stay a single notebook in a single folder; imposing src/, tests/, and a structured data/ on it is pure overhead. The signal that it has earned the scaffolding is longevity and dependence: it will run again (on new data, or on a schedule), someone else needs to run or maintain it, it needs tests, or helper files are starting to accumulate at the top level. “This is going to outlive the week” is the trigger.
D.10 Chapter 10: Data pipelines
Exercise 1: Break a monolith into stages
Experiential. The payoff usually shows up as a stage that proves reusable in a context you hadn’t anticipated — the cleaning function reused by a different analysis, or a feature transform reused at serving time. That unplanned reuse is exactly what the monolithic cell made impossible, because the useful part was welded to everything around it.
Exercise 2: Add a validation gate
Experiential. The bad data a gate would have caught at the boundary includes an upstream schema change (a renamed or dropped column), an unexpected null where the next stage assumes completeness, an out-of-range value (negative spend, a date in the future), a duplicated key, or a target column leaking into the features. The point is where the failure surfaces: a gate turns a cryptic error three stages downstream into a precise message at the moment the bad data entered.
Exercise 3: An idempotent, cached stage
Experiential. Persist the stage’s output to data/interim/ and skip the computation when the artefact already exists; the second run should report the stage skipped. The lesson to carry forward is that a real orchestrator does this for you and invalidates the cache when a stage’s inputs or code change — caching is only safe when staleness is handled, which is why “re-run if the inputs changed” is the rule, not “re-run never”.
Exercise 4: Workflow pipeline versus sklearnPipeline
The analogy holds in that both compose single-responsibility steps into a whole that moves together, and both enforce order so that, for example, preprocessing can’t leak across a boundary. It breaks down in scope and mechanism: an sklearnPipeline lives in one process and one .fit()/.predict() call, holding everything in memory, whereas a workflow pipeline spans processes, persisted artefacts, and schedules. The workflow pipeline therefore needs things the sklearn one doesn’t — explicit intermediate storage, and an orchestrator that handles dependencies, caching, retries, and scheduling.
Exercise 5: When a pipeline framework is overkill
A single script or notebook is the right tool for a one-off analysis, or a workflow of one or two steps that you run interactively and watch. The signal that it has outgrown this is when re-running everything becomes too costly or too risky: the workflow runs repeatedly or on a schedule, some stages are expensive enough that you want to re-run only what changed, several people or systems depend on intermediate outputs, or failures need to be isolated and retried rather than restarting from scratch. At that point the explicit stages and orchestration earn their keep.
D.11 Chapter 11: Configuration and secrets
Exercise 1: Lift hard-coded values into config
Experiential. The values that turn out to differ between your machine and where the code really runs are the telling ones: absolute file paths almost always, plus database and table names, output locations, resource settings (number of workers), and debug flags. These are exactly the things configuration is for — the same logic, different values per environment — and finding them is finding everything implicitly tied to your laptop.
Exercise 2: Move a secret out of the repository
Experiential. Add .env to .gitignore, commit a .env.example template with placeholder values, and load the real secret from the environment. The reason moving it is not sufficient on its own, if it was ever committed, is that version control keeps history: the secret remains in past commits even after you delete it from the current files, so anyone with a clone still has it. It must be rotated — the credential changed at its source — not merely removed.
Exercise 3: Typed, validated config
Experiential. The contrast is the point: a bare dictionary lets a mistyped key (raw_paht) return a silent None that surfaces as a confusing failure much later, whereas a pydantic model with a constraint rejects a bad value the instant it loads, naming the offending field. Feeding it an out-of-range value and getting an immediate, specific error is the behaviour you’re buying.
Exercise 4: Config versus a hyperparameter dictionary
The analogy holds in that both pull the adjustable knobs out of the logic and into a single place you can change without editing code. It breaks down in scope: a hyperparameter dictionary is consumed once, in one process, to fit one model, whereas application configuration also selects behaviour across environments — which database to connect to, which bucket to write to, whether to run in debug mode. Config therefore carries a dimension the hyperparameter dict doesn’t: the same schema with different values in development, staging, and production.
Exercise 5: When hard-coding is acceptable
A genuine constant — a value that is part of the logic and does not vary by run or environment, such as a mathematical constant or a truly fixed business rule — is perfectly fine as a named constant in the code. A value becomes configuration when it varies between environments, changes between runs, or is something you tune. The signal that a hard-coded value has become a liability is any of: you find yourself editing code to change it, it differs between dev and prod, or you can’t tell from the code why it has the value it does. Secrets are the absolute case — always a liability when hard-coded, regardless of anything else.
D.12 Chapter 12: API design
Exercise 1: Wrap a model in an endpoint
Experiential/applied. What the endpoint forces you to decide, and a notebook predict let you ignore, is the contract: the exact request format (field names, types, units), what the response contains (a label, a probability, both), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, explicitly and in advance, what a caller must send and will receive.
Exercise 2: Validation returns a clear 422
Experiential. Constraining the request fields means a malformed request is rejected at the door, before the model runs, with a message that says what was wrong — rather than reaching the model and producing a confidently wrong prediction from garbage, or crashing deep inside and returning an opaque 500. The lesson is that validation converts an unpredictable internal failure into a precise, early, client-facing error.
Exercise 3: Response schema and live docs
Experiential. The documentation is generated from the schema and code (as OpenAPI), so it cannot drift out of sync — change a field and the docs change with it. This matters because callers integrate against the documentation: hand-written API docs inevitably fall behind the implementation, and documentation that lies is worse than none, because it sends integrators down paths that no longer work.
Exercise 4: Endpoint versus model.predict()
The analogy holds in that both take features and return a prediction, with the request and response schemas playing the role of the input and output. It breaks down because an endpoint receives input from untrusted strangers across a network, so it must validate that input, handle errors without leaking internals, and version itself so the model can change without breaking existing callers — none of which an in-notebook predict call, with its single trusting user, ever has to do. The endpoint handles the adversarial, multi-caller, evolving reality the notebook call is insulated from.
Exercise 5: Batch versus real-time
A scheduled batch job is the right mechanism when predictions are consumed in bulk on a known cadence and latency doesn’t matter — nightly churn scores feeding a dashboard, weekly demand forecasts written to a table. A real-time API is necessary when predictions are needed on demand, one at a time, with low latency, in response to a user action or another system’s request — a fraud check at checkout, a recommendation as a page loads. The deciding property is how and when the prediction is consumed: in bulk on a schedule points to batch; on demand with a latency requirement points to an API.
D.13 Chapter 13: Continuous integration
Exercise 1: Add a CI workflow
Experiential. The payoff is the moment the status goes red on a deliberately broken test before you’d have noticed any other way — that’s the days-to-minutes gap from the chapter’s opening, closed. A workflow that installs the locked dependencies (rather than whatever the runner happens to have) is also what makes the run a faithful check rather than a coincidence of the runner’s environment.
Exercise 2: Lint and format gate
Experiential. The first run typically flags the same mix Chapter 5 described — unused imports, unformatted files, the occasional shadowed name or undefined reference — and sorting them into “real defect” and “pure style” is the exercise. The style pile is exactly what the formatter fixes automatically, so once ruff format is in the gate it stops recurring, and the gate’s signal becomes mostly about real problems.
Exercise 3: Set up pre-commit
Experiential. The hook stops the commit locally, before anything reaches CI, which is the point: the cheapest place to catch a formatting slip or a stray large file is before it’s even recorded. Pre-commit and CI are complementary — the local hook gives instant feedback on the trivial things, and CI remains the authoritative shared gate that everyone’s changes must pass.
Exercise 4: CI versus model evaluation
The analogy holds in the trigger: both say “something changed, so the previous verdict is no longer trustworthy — re-verify”. You re-run a holdout evaluation after changing features; CI re-runs the checks after changing code. It breaks down in the verdict. A model evaluation yields a graded score you weigh with judgement (is 0.82 good enough?); CI yields a binary gate that is open or shut, with no “82% of the tests passed, ship it”. The same reflex produces a number in one case and a door in the other.
Exercise 5: What belongs in CI
Run on every change the checks that are fast and deterministic: unit tests, linting, type checks, and a small end-to-end smoke test on sample data. Push to an occasional or nightly job the things that are slow, expensive, or non-deterministic: full model training, integration tests against real external services, validation over large datasets, performance benchmarks. The principle is that every-push checks must be quick and reliable enough that developers never feel the urge to route around them — a gate that is slow or flaky gets disabled or ignored, at which point it protects nothing. Speed and trustworthiness are what make the gate worth having.
D.14 Chapter 14: Containerisation
Exercise 1: Write a Dockerfile
Experiential. A sound first Dockerfile starts from a slim base, installs the locked requirements, copies the code, and sets the run command; building and running it confirms the service starts in a clean, sealed environment rather than relying on anything on your machine. If it runs in the container but not on a colleague’s bare machine, the container has done its job — it carried the environment with it.
Exercise 2: Improve the image
Experiential. Ordering the instructions so dependencies are installed before the code is copied means a code edit reuses the cached dependency layer instead of reinstalling everything — rebuilds drop from minutes to seconds. A slimmer base, a multi-stage build, or removing build tools afterwards shrinks the image substantially, often from north of a gigabyte to a few hundred megabytes. The exercise is to measure the before and after, because the gains are larger than most people expect.
Exercise 3: Keep data and secrets out
Experiential. Baking data into the image bloats every copy of it, ties the image to a single snapshot of the data, and can push sensitive records into registries. Baking a secret in is the Chapter 11 mistake reincarnated: the credential ends up inside an artefact that gets pushed to registries and shared, so it leaks and must be rotated, not merely removed. The fix is to keep the image a generic definition of how to run — mount the data as a volume and inject the secret as an environment variable at run time.
Exercise 4: Container versus lockfile
The analogy holds in principle: both pin the things that affect behaviour so they can’t drift between machines — the same “control your inputs” move. It breaks down in reach and form. A container pins what the lockfile cannot — the interpreter, the system libraries, and the operating system — sealing the whole tower. In exchange you give up transparency: a lockfile is a small text file you can read, diff, and review, whereas an image is an opaque binary blob you manage through versioning and a registry rather than by reading it.
Exercise 5: When to containerise
A container is clearly worth it whenever the code must run reliably on a machine that isn’t yours: a deployed service, a job on a shared cluster, a pipeline that has to reproduce exactly across environments, or onboarding a team to one consistent setup. It’s overkill for a one-off local analysis or an exploration only you will ever run on your own machine, where a virtual environment and a lockfile already give you everything you need. The deciding property is exactly that: does it need to run identically somewhere other than your machine? If yes, containerise; if it lives and dies on your laptop, the lockfile is enough.
D.15 Chapter 15: Deployment
Exercise 1: Deploy the service
Experiential. The revealing part is what the platform makes you supply explicitly that your laptop quietly provided: the environment variables and secrets (Chapter 11), the port to expose, persistent storage for any data, the exact run command, and resource limits. On your machine all of that was implicit context; deployment turns it into configuration you have to state, which is itself why the container and config work of the previous chapters pays off here.
Exercise 2: Staging with pass criteria
Experiential. The discipline is to run the same artefact as production with different configuration, and to decide what “passing staging” means before you look at the result — a latency ceiling, an error-rate ceiling, a smoke test that must pass. Deciding the bar in advance is what stops the very human temptation to wave a release through because it “seems fine”, which is the operational version of moving the goalposts.
Exercise 3: Schedule a batch job
Experiential. The key behaviour is that a failed run is detectable: the job exits non-zero (Chapter 4), so the scheduler can alert, rather than failing silently and leaving you to discover days later that the table was never updated. A batch job that fails quietly is worse than one that fails loudly, because the missing output often looks just like stale-but-present output.
Exercise 4: Staging versus a holdout
The analogy holds in the principle: both are “try it somewhere safe, on conditions it wasn’t built on, before it counts”. It breaks down in what you measure and how you judge. A holdout is a fixed sample you score once for a single accuracy number; staging is an environment you run continuously, and “passing” it is a judgement about operational behaviour — latency, error rate, resource use under load — rather than one metric. Passing a holdout is a number clearing a bar; passing staging is a system behaving acceptably.
Exercise 5: Batch versus always-on, and rollback
A nightly churn-scoring job feeding a report is naturally batch; a real-time fraud-scoring API is naturally always-on. Their rollbacks differ accordingly. For the batch job, a rollback is usually re-running the previous version (or simply retaining yesterday’s output) — the failure is recoverable because nothing depended on the run in real time. For the online service, a rollback means switching live traffic back to the previous container version immediately — a blue-green flip or redeploying the prior tag — because every second on the bad version affects real requests. Batch buys you time; online demands a fast switch.
D.16 Chapter 16: Monitoring and observability
Exercise 1: Logging and health
Experiential. To investigate “a strange answer last Tuesday” you need, at minimum, the timestamped request inputs, the prediction returned, and the model version that served it — ideally tied together by a request ID so you can correlate across log lines. The common failure is logging only that “a prediction was made”: that confirms the service ran but lets you reconstruct nothing. The test of your logging is whether you could replay last Tuesday’s prediction from it alone.
Exercise 2: A drift check
Experiential. Store a reference sample from training, compare each live batch to it with a KS test or population stability index, and alert when the statistic crosses a threshold. As for which feature drifts first, the usual culprits are externally driven or time-sensitive ones — a monetary feature exposed to inflation, a feature fed by an upstream source that changes format or coverage, or anything seasonal — because the world moves those independently of your model.
Exercise 3: A useful alert
Experiential. Choose a condition and a threshold, then defend it against fatigue: set the threshold from observed normal variation rather than a round number, alert on a sustained breach rather than a single spike, deduplicate repeated firings, and route only actionable alerts to a human. This is the flaky-test lesson from Chapter 7 transplanted to operations — an alert that cries wolf trains the team to ignore it, and an ignored alert protects nothing, including on the day it’s right.
Exercise 4: Monitoring versus validation
The analogy holds in that both check the model against data it didn’t train on — monitoring simply continues that check indefinitely rather than once. It breaks down over labels: a holdout has ground truth, so you measure accuracy directly, whereas in production the labels usually lag (you learn who actually churned months later) or never arrive. So you generally cannot measure live accuracy as directly as holdout accuracy, and you fall back on input and prediction drift as proxies that hint at trouble without confirming it.
Exercise 5: Data drift versus concept drift
Data drift is a change in the input distribution P(X): the feature values coming in look different from training — for example, a new customer demographic produces values the model rarely saw. Concept drift is a change in the relationship P(Y|X): the same inputs now map to different outcomes — for example, fraudsters change tactics, so transaction features that meant “safe” last year no longer do. Detecting input drift without labels warns you the model is operating on unfamiliar data, where its learned assumptions may no longer hold — but it can’t confirm a real accuracy drop, because the model might still perform well on the shifted inputs, or the relationship might have changed while the inputs looked unchanged. Only ground-truth labels, when they eventually arrive, can confirm the model has genuinely become less accurate.
D.17 Chapter 17: Code review
Exercise 1: A small, focused pull request
Experiential. The thing to notice is why a small, well-described PR is easy to follow: the reviewer can hold the whole change in their head at once, and a description of what-and-why spares them reconstructing your intent from the diff. The contrast with a sprawling change is the lesson — reviewability is mostly a property of size and framing, not of how clever the code is.
Exercise 2: Review someone else’s PR
Experiential. Most people find that finding issues is easier than phrasing them well. The discipline is to mark each comment as blocking or a suggestion (so the author knows what must change), to keep it about the code rather than the person, and to say why — because the reasoning is what teaches and what makes the comment land as help rather than criticism.
Exercise 3: A bug that automated checks miss
A leak, a wrong metric, or an off-by-one in a split passes the linter and the tests because it is syntactically valid and the tests only check what the author already thought to check; a linter inspects form, never domain correctness. Catching it needs a reviewer who reads the logic and the data flow with domain knowledge — someone who knows the scaler must be fit on training data only, or that this metric is wrong for an imbalanced problem. That is exactly the attention automation cannot provide and review exists to supply.
Exercise 4: Code review versus peer review of a paper
The analogy holds: both are a knowledgeable peer examining the reasoning and method before the work counts, catching what the author is too close to see. It breaks down in cadence and size — peer review of a paper is rare, heavy, and large (a whole study, reviewed once), whereas code review is frequent, light, and small (one change, many times a week). The implication is to submit work for review in small, frequent pieces rather than saving up a quarter’s work for one enormous request that can only be rubber-stamped.
Exercise 5: A data science review checklist
Items worth adding that a general checklist wouldn’t have: does it leak (preprocessing fit on test data, a target in the features)? are the data assumptions stated and checked? is it reproducible (seed and config captured)? is the metric appropriate to the problem? are there hard-coded secrets? What to leave off is style and formatting — not because it doesn’t matter, but because the formatter and linter settle it automatically (Chapter 5). Leaving it off makes reviews better, because litigating whitespace in comments consumes the human attention that should go to logic and trains a team to nitpick form instead of reasoning about correctness.
D.18 Chapter 18: Documentation
Exercise 1: Write a README
Experiential. The point at which your reader first gets stuck is the most valuable output: it’s almost always an undocumented environment variable, a data source that needs access you forgot to mention, or a setup step so habitual you didn’t know you were doing it. Timing the run from clone to running surfaces the assumptions you can’t see precisely because they’re yours.
Exercise 2: Add docstrings
Experiential. The test is whether help() on a function tells a reader enough to use it without reading the body. If it doesn’t, the docstring is missing part of the contract — usually the parameters, the return value, the exceptions it raises, or a worked example. A docstring that only restates the function name has documented nothing.
Exercise 3: Write a model card
Experiential. The hardest section is almost always “known limitations / where it should not be used”, because it forces you to articulate the model’s failure modes and the populations it was not validated on — exactly the questions exploratory work leaves implicit. That difficulty is itself informative: where the model card is hard to write is where your understanding of the model’s boundaries is thinnest, and therefore where the risk lives.
Exercise 4: Classify documents with Diátaxis
A docstring is reference (look up what a function takes and returns). A tutorial notebook is a tutorial (learning by the hand). A model card is mostly reference plus explanation (facts about the model, and the why behind its limits). A README is a deliberate blend — at its best a brief tutorial/how-to that orients a newcomer and points onward to the rest. Mixing the jobs makes a document worse because a reader arrives with one need — to learn, to look up, or to understand — and a document trying to serve two serves neither: a reference padded with teaching is slow to search, and a tutorial listing every option is impossible to follow.
Exercise 5: Keeping documentation in sync
Two structural practices: co-locate documentation with the code (docstrings), so a change to the code sits right beside the text that describes it; and generate reference documentation from the code and make examples executable (doctest, or a tested snippet), so a changed signature or a stale example becomes a build or test failure rather than a silent lie. “Remember to update the docs” is not a third practice because it relies on human discipline under deadline pressure with no feedback when it’s forgotten — the documentation rots quietly and you only discover it when it has already misled someone. Structural defences make drift either impossible or loud; a reminder makes it neither.
D.19 Chapter 19: Technical debt
Exercise 1: Audit a project for debt
Experiential. Sorting each item into deliberate (you knew you were cutting the corner) and inadvertent (you’ve only just noticed) is the instructive part, and the inadvertent pile is usually the larger and more alarming one. The item that surprises people most is almost always a piece of “temporary” code — a hard-coded value, a quick script — that turned out to be load-bearing and has been quietly holding production together for months.
Exercise 2: The boy-scout rule
Experiential. Paying down one item while you’re already in the file — adding a test, extracting a function, naming a constant — is typically quick relative to the change you came to make, and that’s exactly the point: opportunistic repayment is cheap because you’ve already paid the cost of understanding the code. Debt repaid this way never has to be scheduled.
Exercise 3: A debt log
Experiential. A debt item is worth writing down, rather than fixing on the spot, when the fix is larger than the time you have, when the code might be discarded anyway, or when stopping to repay it now would derail the task in hand — but you still want it visible so it isn’t silently forgotten. Trivial fixes don’t go in the log; they go in the boy-scout pass. The log’s whole purpose is to make deferred debt a deliberate, tracked decision rather than a thing you rediscover at 3am.
Exercise 4: Debt versus financial debt
The analogy holds: a shortcut borrows time now and charges interest later, in that every future change to that code is slower and riskier. It breaks down in the shape of the interest. A loan has a known rate and a schedule, so you can plan around it; technical debt’s interest is unpredictable and lumpy — it costs nothing until the day you have to touch the code, then it can cost an enormous amount at once. With no monthly statement to remind you it exists, it’s easy to defer indefinitely, which is precisely why it accumulates.
Exercise 5: When debt is the right call
A shortcut is the correct decision for code with a short or uncertain life — a prototype that may be discarded, a hypothesis you’re testing, a genuine deadline where shipping now matters more than polish. It’s reckless when taken in code you already know will be load-bearing, when taken without recording it, or when the resulting failure would be silent and high-consequence. The distinguishing property is the code’s expected lifetime and criticality, combined with whether the debt is acknowledged: debt on disposable, low-stakes code is a tool; unrecorded debt on code others will depend on is a liability waiting to come due.
D.20 Chapter 20: Cross-discipline collaboration
Exercise 1: Map the vocabulary gaps
Experiential. The classic three: test (a data scientist means evaluation metrics; an engineer means pass/fail assertions on code), validation (DS: holding out data to measure generalisation; SE: checking inputs against a schema), and model (DS: a learned predictive function; SE: an abstraction of a domain, like a class diagram). The gap that has usually caused a real misunderstanding is “is it tested/validated?” — where both parties said yes, meaning entirely different things, and discovered the mismatch only later.
Exercise 2: Write a handoff document
Experiential. The revealing part is what you find yourself making explicit for the first time: the failure modes, the edge cases of the input contract, the caveats on performance (where the model is weak, the populations it wasn’t validated on), how it should be monitored, and who owns it when it misbehaves. All of that typically lived only in your head, which is exactly why the handoff is where things go wrong.
Exercise 3: An interface as a contract
Experiential. Agreeing the schema, latency budget, and bad-input behaviour in advance is cheaper because the alternative — discovering in production that you returned a label where the service expected a probability, or the wrong units, or an unhandled null — is an incident with real cost and a paging at a bad hour. The contract converts an integration surprise into a build-time check: a conversation now versus an outage later.
Exercise 4: Team interface versus data contract
The analogy holds in that both define exactly what crosses the boundary, so neither side has to guess or reverse-engineer the other. It breaks down because a data contract is enforced by code — the schema validates or the request fails — whereas a team contract also carries a human dimension no schema captures: the shared understanding of intent, of what “good enough” means for this use, and of who is responsible when the model misbehaves. The schema pins the bytes; it cannot pin the agreement about ownership and intent, and that is where collaboration actually succeeds or fails.
Exercise 5: Two rigours pulling opposite ways
A clear case: a model that needs frequent retraining and experimentation (the data scientist wants fast, loose iteration) while it serves live production traffic (the engineer wants stability, tests, and controlled releases). The instincts genuinely conflict. A team holding both resolves it not by one side winning but by engineering the boundary tightly so that exploration can stay loose safely — an automated retraining pipeline with validation gates and canary releases, behind a stable contract and CI the engineer trusts, so the data scientist iterates freely without putting production at risk. The general principle is that the engineering instinct and the data science instinct are reconciled by tightly engineering the interface so that what happens behind it can remain appropriately loose.
D.21 Chapter 21: Notebook to production API
Exercise 1: Carry a model one stage further
Experiential. For most readers the next stage is extracting the feature and training logic into an importable module of pure functions. What you have to change to make it importable is everything that tied the code to the notebook: replace reliance on notebook globals with explicit function arguments (Chapter 6), separate the logic from the cell that happened to run it, and give each function a clear input and return. The pure-function discipline from Chapter 6 is precisely what makes the code importable — a function that depends on whatever is in the kernel can’t be lifted out of it.
Exercise 2: The train–serve safeguard
Experiential. The test feeds a single raw record through both the training feature path and the serving feature path and asserts the resulting features are identical. It is worth more than a test of the model’s accuracy because train–serve skew is a silent, high-impact bug: if serving computes a feature even slightly differently from training, the model receives inputs unlike anything it learned on and degrades quietly, with nothing failing. That has a definite right answer a test can pin, whereas accuracy is a moving statistical quantity that doesn’t belong in a pass/fail gate (Chapter 7). The skew test catches a real deployment defect; an accuracy assertion catches noise.
Exercise 3: Wrap the model in an API
Experiential. What the endpoint forces you to make explicit, and the notebook let you leave implicit, is the contract: the exact request schema (field names, types, and valid ranges), what the response carries (a probability rather than a label, plus a model_version for traceability), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, in advance, what a caller must send and will receive — and the malformed request returning a clean 422 is that contract doing its job.
Exercise 4: Productionising versus publication
The analogy holds in that both turn an exploratory finding into a rigorous, reproducible, reviewable artefact others can rely on — the package is the methods section, the lockfile and config are the reproducibility statement, the tests are the peer review you run on yourself. It breaks down because a paper, once published, is finished and frozen, whereas a deployed service runs continuously against data that keeps changing, so it needs something a paper never does: monitoring, to tell you when the result it embodies has stopped being true. That is the subject of Chapter 16, Monitoring and observability.
Exercise 5: How far to walk the route
A throwaway analysis should stop at the notebook (version-controlled at most); an internal tool typically warrants a package, a few tests, and externalised config, but rarely a full container-and-CD pipeline; a model real users depend on needs the whole route — package, tests, API, container, CI, deployment, and monitoring. The signal to go further is always the same kind of thing: someone else needs to run it, it runs repeatedly or unattended, real decisions now depend on its output, or it must reproduce exactly. “It has outlived its expected life, or someone now depends on it” is the trigger to take it one stage further down the path.
D.22 Chapter 22: Reproducible research pipeline
Exercise 1: One command, from raw data to result
Experiential. The valuable discovery is the hidden dependency the Makefile flushes out — the step that only worked because of something on your machine: a file in your home directory, a package you installed once and forgot, an environment variable set months ago, or a manual “and then I clicked export” step. Declaring every stage’s inputs and outputs forces those implicit dependencies into the open, which is exactly why a one-command rebuild is a stronger guarantee than “it ran when I did it”.
Exercise 2: Version a dataset
Experiential. Before versioning, reconstructing the exact data behind an old result was usually impossible because the raw file had been overwritten or changed in place, with no record of which version produced the figure. DVC (or, at a minimum, a dated immutable copy plus a checksum committed alongside the code) makes the input recoverable, so checking out the commit behind a result also restores the dataset that produced it — the missing fourth input from the chapter.
Exercise 3: Generate the number, don’t paste it
Experiential. A hand-pasted number becomes wrong because nothing updates it when the data or code changes — it is a snapshot frozen at the moment you copied it, with no link back to its source, so the day the analysis changes the figure in the slide silently disagrees with the figure in the code. A generated number is recomputed from the data every time the report is rendered, so it cannot drift away from the result it claims to report; the worst that can happen is the build fails, which is loud rather than silent.
Exercise 4: Data versioning versus a seed
The analogy holds in that both pin an input so the result is repeatable — the seed pins the randomness, data versioning pins the dataset, and in each case you’re removing a way the result could change without your intending it. It breaks down in mechanism: a seed is a single integer you drop into the code, whereas data is large and lives outside the code, so it can’t be a line in a script. It needs its own external storage and a small, committable pointer — which is precisely what DVC provides. The thing being pinned differs in size and location, so the handle must differ too.
Exercise 5: What a notebook doesn’t pin
A single notebook, even under version control, pins the code but not the other three inputs. It does not pin the environment — the packages it imports are whatever happens to be installed in the kernel, so a colleague with a different pandas version can get a different result from identical code. It does not pin the data — it reads whatever the file path points to, and that file can be overwritten or updated without the notebook changing at all. (And it pins randomness only if you remembered to set seeds.) “It’s all in one notebook” addresses code organisation, not reproducibility: the notebook is necessary but nowhere near sufficient, because the result depends on three things living entirely outside it.
D.23 Chapter 23: MLOps pipeline
Exercise 1: Sketch the loop
Experiential. Naming each stage for your own model — the training pipeline, the registry, the deployment, the monitoring signal, the retraining trigger — usually reveals that the missing or manual stage is the return arrow: most teams have a way to train and a way to deploy, but monitoring is thin and retraining is ad hoc, done when someone happens to notice a problem. Automating it means adding drift monitoring that emits a signal, a triggered training pipeline, and a promotion gate — closing the loop so the cycle runs on a signal rather than on someone’s memory.
Exercise 2: The retraining trigger
Experiential. The false-alarm rate matters because a trigger that fires too often is the flaky test of MLOps (Chapters 7 and 16): each false alarm causes a needless retrain, which costs compute and — worse — risks promoting a model trained on a blip. A trigger becomes more trouble than it’s worth once its false positives are frequent enough that the team disables it or ignores its output, at which point it protects nothing. The defences are the same as for alerts: set the threshold from observed normal variation rather than a round number, and require sustained drift rather than a single noisy batch.
Exercise 3: The promotion gate
Experiential. The comparison must use the same evaluation data because scoring two models on different datasets confounds “the candidate is better” with “the candidate’s test set was easier” — you could not tell skill from luck of the draw. And you require a margin rather than strict improvement because a tiny difference in a metric like AUC is within its own run-to-run and sampling variability; promoting on a hair’s-breadth win means swapping the production model on noise, which adds risk and churn for no real gain. The candidate should have to beat the incumbent by more than the metric’s own wobble before it earns promotion.
Exercise 4: MLOps loop versus the experiment–iterate cycle
The analogy holds: both are a cycle of train, evaluate, adjust, and retrain. It breaks down in who drives it and what it optimises for. Your exploratory loop optimises for discovery, and you are inside it applying judgement at every turn; the production loop optimises for staying current and must run with you out of it most of the time. So the judgement you’d apply by eye during exploration — is this drifting enough to act on, is this new model actually better — has to be made explicit in the production loop, as a drift threshold and a promotion gate, because nobody is watching each iteration.
Exercise 5: The weakest practice
Take rollback (Chapter 15). Without it, the loop’s promotion gate is a one-way door: the moment a candidate is promoted — perhaps trained on a corrupted batch, or scoring well on a test set that didn’t catch a regression — it serves production traffic with no fast way back, and an automated loop that can promote but not un-promote has merely automated the act of shipping a bad model. The same argument lands on any link: without reproducibility (Chapter 22) you can’t retrain to a comparable result, so the candidate can’t be trusted or traced; without monitoring (Chapter 16) nothing triggers the loop and the model decays in silence; without testing (Chapter 7) a broken transform propagates into every retrain. The cycle only runs safely if every one of these holds, which is why automating it is the last thing you do, not the first.
---# Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.mdtitle: "Exercise answers"---## Chapter 1: From notebook to system {#sec-answers-notebook-to-system}**Exercise 1: Restart Kernel and Run All**This exercise is diagnostic — there's no single "right" answer. Common failures include:- `NameError` for variables defined in cells that the kernel ran out of order. For example, a variable created in cell 15 but used in cell 8, which only worked because you happened to run cell 15 first during interactive exploration.- `FileNotFoundError` for data files with hard-coded paths that only exist on your machine or in a specific working directory.- Cells that depend on outputs from cells you've since deleted or commented out.The point isn't to fix every failure immediately — it's to see how much of your notebook's correctness depends on invisible state rather than explicit structure.**Exercise 2: Extract a function**Here's an example using the chapter's customer filtering logic:```{python}#| label: answer-ch1-ex2#| echo: trueimport pandas as pdimport numpy as npdef filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame:"""Select customers whose spend exceeds the given threshold."""return customers[customers["spend"] > threshold].copy()# Verify with a small test inputtest_data = pd.DataFrame({"spend": [50, 150, 250]})result = filter_high_value(test_data, threshold=100)assertlen(result) ==2, f"Expected 2 rows, got {len(result)}"assertlist(result["spend"]) == [150, 250], "Should contain only rows above threshold"print("All assertions passed")```The key properties gained: the function has a name that describes its purpose, its inputs are explicit (no reliance on global variables), and the `assert` statements verify the logic independently of the notebook's broader state.**Exercise 3: Score a notebook**This is a self-assessment — your honest scores matter more than the numbers themselves. In practice, most data science notebooks score highest on modularity (often 2–3, since most have at least some cell-level separation) and lowest on testability (often 1, since few notebooks include any automated verification). Reproducibility varies widely: a notebook that loads from a fixed CSV with pinned dependencies might score 4, while one that relies on a live database connection and `pip install` scores 1.The value of this exercise is identifying your weakest property and asking whether strengthening it would have saved you time in a recent project. If the answer is yes, that's where to invest first.**Exercise 4: Holdout set / test suite analogy**Two ways the analogy **holds**:1. Both are verification mechanisms applied *after* the creative work. You build the model, then validate. You write the code, then test. Neither replaces the work; both catch problems the author missed.2. Both require separation — a holdout set must be kept separate from training data, and tests must check behaviour from outside the code, not just re-run it. Contamination in either case undermines the verification.Two ways the analogy **breaks down**:1. Model validation is probabilistic; software testing is deterministic. A holdout accuracy of 82% might be perfectly acceptable — you're measuring how well the model generalises. A test that passes 82% of the time is broken. Tests are pass/fail: the code either does what you specified or it doesn't.2. Holdout sets evaluate performance on data drawn from the same distribution. Tests evaluate correctness against cases the developer explicitly constructed, including edge cases and error conditions that may never appear in production data. This means tests can catch failures that no amount of validation data would reveal, but they can also miss failures that real-world data would expose. Each has a blind spot the other doesn't share.**Exercise 5: Run a colleague's notebook**This exercise is experiential — the answer is your documentation of the attempt. Common discoveries include:- File paths that assume a specific directory structure or operating system- Environment dependencies not captured anywhere (specific package versions, system libraries, environment variables)- Cells that must be run in a non-obvious order, or cells that must be skipped- Configuration values with no explanation of how they were chosenEach discovery maps to one of the chapter's four system properties. Hard-coded paths and undocumented dependencies are reproducibility failures. Monolithic cells that do many things are modularity failures. The absence of any automated checks is a testability failure. And magic numbers without context are readability failures.## Chapter 2: Version control {#sec-answers-version-control}**Exercise 1: Put a project under version control**Experiential — there's no single right answer, but a sound initial commit contains only what you *authored*: code (`.py` modules, and notebooks with outputs stripped), configuration (`requirements.txt`, `pyproject.toml`, and the `.gitignore` itself), and documentation (a README). What you deliberately exclude, and where each belongs instead:- **Data** → a data-versioning tool (DVC) or object storage. It's too large and too volatile for Git's keep-everything-forever history, and may be sensitive.- **Trained models and artefacts** (`.pkl`, `.joblib`) → a model registry or artefact store. They're large, binary, and regenerated rather than written by hand.- **Secrets** (`.env`, API keys) → a secrets manager, or a local `.env` that is never committed. A credential committed even once persists in the history after you delete it.- **Caches and environments** (`__pycache__`, `.ipynb_checkpoints`, `.venv`) → not versioned at all; they're regenerated locally.The discipline is the one from the chapter: commit what you author, store what you generate or receive somewhere better suited to it.**Exercise 2: Clean notebook diffs**Two routes achieve this. `nbstripout` installs a Git filter that strips outputs and execution counts on commit, so the tracked version holds only code and markdown:```bashpip install nbstripoutnbstripout--install# registers the filter for this repository```Or pair the notebook with a script representation using Jupytext, and treat the script as the reviewed artefact:```bashpip install jupytextjupytext--set-formats ipynb,py:percent analysis.ipynb```After either, make a one-line change and inspect `git diff` (or `nbdiff` if you use nbdime). The diff should now show only your change rather than a wall of metadata. The verification *is* the exercise: you've turned an unreadable JSON diff into a reviewable one.**Exercise 3: Commit messages for past decisions**Self-directed. The instructive part is comparing your messages against what a filename or inline comment could carry. A message such as *"Drop `signup_channel`: 60% missing after the May tracking change, and imputing it was injecting signal"* records both the reason and the evidence — information a filename like `model_v3` cannot hold and a comment like `# dropped signup_channel` omits. Because the message is attached to the exact change, attributed, dated, and surfaced by `git blame`, the "why" survives long after the people who remember the review meeting have moved on. That permanence is what neither the filename nor the comment provides.**Exercise 4: Branch for an experiment**Experiential. A typical flow:```bashgit switch -c experiment/log-spend-features# ...edit, commit on the branch...git switch main # main is untouched and still runsgit merge experiment/log-spend-features # if the experiment worked# or: git branch -D experiment/log-spend-features # if it didn't```Keeping `main` untouched means that at every moment you have a known-good version to fall back to, demo, or hand over — and discarding a failed experiment leaves no residue, no `model_v2_BAD.ipynb` lingering in the folder. Compared with copying files, the branch makes the experiment both *comparable* (you can diff it against `main`) and *disposable* (deleting the branch erases the dead end cleanly).**Exercise 5: Git versus an experiment tracker**Git versions the *code and its history of changes*: it answers "what is the code, and how did it come to be this way?", with branching, merging, and line-by-line history. It records nothing about what a given run produced. An experiment tracker such as MLflow records *runs*: the metrics, parameters, and artefacts a specific execution generated, so you can compare AUC across fifty hyperparameter settings — something Git has no concept of. Conversely, a tracker offers no line-by-line source history or merge.The division of labour reflects two different questions. Versioning your code is about the *process* that produces results; tracking results is about the *outputs* of running that process on particular data. Reproducibility needs both — the exact code version *and* the run record — which is why mature projects link each tracked run back to the Git commit that produced it.## Chapter 3: Environments and dependencies {#sec-answers-environments}**Exercise 1: Produce a lockfile**Experiential. The flow separates intent from the resolved result:```bash# requirements.in holds your direct dependencies (the abstract spec)pip install pip-toolspip-compile requirements.in # -> fully pinned requirements.txt (the lock)python-m venv .venv &&source .venv/bin/activatepip install -r requirements.txt # rebuild from the lock```The point of the exercise is what the lock contains that your `requirements.in` did not: every *transitive* dependency, pinned to an exact version. Rebuilding in a fresh environment and finding the project still runs confirms you've captured the environment, not just your top-level wishes.**Exercise 2: Audit unpinned dependencies**Compare what's declared (often nothing, or a few `>=` constraints) against `pip freeze`. The instructive part is naming a library where a major-version bump could change results and saying *how*: scikit-learn has changed estimator defaults between releases (a different solver or tie-handling shifts predictions); pandas changed copy-on-write behaviour and default dtypes; NumPy's random `Generator` stream can differ across versions. In each case the code is untouched but the numbers move — which is precisely the failure mode pinning prevents.**Exercise 3: What pinning controls**Pinning your Python-package versions controls *library behaviour* — a changed default, a re-implemented algorithm, a deprecated parameter. It does **not** control the Python interpreter version, the operating system, the underlying maths libraries (BLAS/LAPACK), or hardware/GPU floating-point behaviour. To control that second category you reach for a container (Docker), which pins the interpreter, system libraries, and OS alongside the packages; for bit-exact numerics you would additionally fix BLAS threading and any framework determinism flags. Pinning versions is necessary but not sufficient for full reproducibility.**Exercise 4: Reproduce a colleague's environment**Experiential. The valuable output is the list of things the lockfile alone didn't capture, each with its proper home:- The **Python version** → a `.python-version` file or `pyproject.toml`.- A **system library** a wheel links against (e.g. a C library, a CUDA runtime) → a container image or a documented set of OS packages.- An **environment variable** the code reads at runtime → a committed `.env.example` documenting the names (never the values; see *Configuration and secrets*).Each gap is a reproducibility failure waiting to happen, and each has a better home than a colleague's memory.**Exercise 5: When abstract beats locked**Abstract `>=` requirements are the right choice when your code is a *library* meant to be installed alongside other packages: over-constraining versions would make it hard for consumers to satisfy everyone's dependencies at once, so you specify the minimum you need and stay flexible. An exact lock is essential when your code is an *endpoint* — a deployed service, a scheduled pipeline, a reproducible analysis — where the same versions must reappear every time and nothing downstream depends on your version range. The difference is structural: a library is a dependency *of* other things (flexibility aids interoperability); an application is the final consumer (exact reproduction is the whole point).## Chapter 4: The command line {#sec-answers-command-line}**Exercise 1: Capture a workflow**Experiential. A reasonable result is a `Makefile` with one target per step, run end to end with a single `make`. The discovery worth noting is the step that "depended on you remembering to do something first" — creating an output directory, setting an environment variable, downloading data before the feature step. Those implicit prerequisites are exactly what a task runner makes explicit: a target that creates the directory, or a dependency declaration (`features: data`) that enforces the order so no one has to remember it.**Exercise 2: Answer a question with the shell**For example, counting the distinct values in the third column of a CSV (skipping the header):```bashtail-n +2 data.csv |cut-d,-f3|sort|uniq|wc-l```or counting rows matching a condition with `grep -c`. This feels natural for quick, line-oriented filtering and counting on flat text, and it's faster than starting a Python session. You wish for a DataFrame the moment fields contain commas or quotes (naive `cut` mis-parses real CSV), when you need typed aggregation, or when a join is involved — that's the boundary the Data Science Bridge describes.**Exercise 3: Exit codes and chaining**A validation script signals failure by exiting non-zero:```python# validate.pyimport sysimport pandas as pddf = pd.read_csv("data/processed.csv")if df.empty or"target"notin df.columns:print("validation failed: empty data or missing target", file=sys.stderr) sys.exit(1) # non-zero: tells the shell something went wrongprint("validation passed")``````bashpython validate.py &&python train.py # train runs only if validate exits 0```This matters because automation and CI decide pass/fail from exit codes, not from reading output. A non-zero exit halts the `&&` chain, so bad data never reaches training, and a CI server marks the build red instead of silently continuing. The exit code is the machine-readable verdict the whole pipeline depends on.**Exercise 4: Surviving a dropped connection**A command started in a plain SSH session is a child of that session. When the connection drops, the session ends and the job is sent a hang-up signal (`SIGHUP`), so it dies — unless you took extra steps (`nohup`, `disown`). `tmux` (or `screen`) changes this by running your shell inside a session that lives on the *server*, decoupled from your client connection: you detach (or simply lose the connection), the session and everything in it keep running, and you reattach later to find the job still going or finished. For any job measured in hours, that decoupling is the difference between a result and a wasted afternoon.**Exercise 5: Pipes versus pandas**The analogy **holds** in composition: both build a larger operation from small, single-purpose steps arranged left to right, each transforming what the previous one produced. It **breaks down** in the data model — shell pipes pass untyped text, line by line, between separate programs, whereas pandas passes a typed DataFrame within a single process. The rule of thumb that falls out: use a shell pipeline for quick, line-oriented work on text and files (filtering, counting, gluing tools together), and reach for a Python script the moment the data has real structure — typed columns, joins, or anything where correctly parsing the text is itself the hard part.## Chapter 5: Readable code {#sec-answers-readable-code}**Exercise 1: Refactor for readability**Experiential. The change that surprises people most is that renaming alone surfaces confusion: if you can't think of a good name for a variable, that's often a sign you don't fully understand what it holds, or that it holds two different things at different points. Replacing magic numbers with named constants does the same — naming `0.73` forces you to articulate what it *is*. Adding a type signature and a one-line docstring then makes the contract explicit without touching the logic. If the refactor was purely cosmetic and nothing became clearer, the original was already readable; usually something does.**Exercise 2: Formatter and linter**Experiential. The instructive step is sorting the linter's output into real defects and pure style. Real defects include unused imports, variables assigned but never used, names that shadow a builtin (`list`, `dict`), bare `except:` clauses that swallow errors, and mutable default arguments. Pure style includes line length, quote style, and spacing — exactly the things the formatter fixes automatically. The lesson is the division of labour: automate the style pile entirely so that human attention in review goes to the defect pile and to the logic, which no tool can check.**Exercise 3: Split a function**Experiential. The test of a good decomposition is to read just the new function names in sequence and ask whether they narrate what the original function did. If they do — `load_raw`, `drop_invalid_rows`, `add_spend_per_day`, `aggregate_by_cohort` — the names are carrying the structure, and a reader can understand the whole from them. If a step needs a comment to explain what it does, the name isn't doing its job yet.**Exercise 4: Code versus a methods section**The analogy **holds** in that both name things meaningfully, present steps in a followable order, and leave out the dead ends, because both exist so a reader can follow the reasoning rather than reconstruct it. It **breaks down** in how they're consumed: a methods section is read once, for understanding, and a reader forgives small ambiguities because they won't re-execute your prose. Code is run repeatedly and is read specifically *in order to change it*, so an ambiguity a human would gloss over becomes the spot where someone misreads the intent and introduces a bug. Code has to be clearer than prose because misreading it has consequences a prose reader never faces.**Exercise 5: When readability isn't worth it**Throwaway names are the right call for genuinely scratch code that will be deleted within the session — a quick check of a distribution, a one-off plot to settle a question, a snippet you're using to understand an API. The specific signal that it has crossed the line is *promotion*: you copy it into another notebook, you find yourself relying on its output days later, or you hand it to someone else. At that moment the code has become "kept", will be read many times, and earns the few minutes of naming and documentation. The skill is noticing the promotion and cleaning up *then*, rather than writing every scratch cell as if it were production.## Chapter 6: Functions, modules, packages {#sec-answers-functions-modules-packages}**Exercise 1: Extract a copy-pasted function**Experiential. The payoff is visible the moment you make the follow-up change: with the logic in one imported module, you edit one place and every caller gets the fix; with copies scattered across notebooks, you had to find and edit each one — and missing one is exactly how the copies drift out of sync. "How many places did I have to change?" going from several to one *is* the single-source-of-truth principle made concrete.**Exercise 2: Global to pure function**Experiential. Once every input arrives as an argument, the function's result depends only on those arguments, so it returns the same answer no matter what cells ran before it. The verification — calling it after deliberately changing some unrelated global or re-running cells out of order, and getting the same result — is the property that also makes it testable in the next chapter.**Exercise 3: Make a project installable**Experiential. The error you were previously working around is `ModuleNotFoundError` (or the `sys.path.append("..")` hack used to dodge it), which works only from the directory you happened to launch from. After a minimal `pyproject.toml` and `pip install -e .`, the package is importable by name from anywhere in the environment, and because the install is *editable*, changes to the source take effect immediately without reinstalling. The fresh-notebook test confirms the import no longer depends on where you started.**Exercise 4: Library-author disciplines**A discipline you do **not** need for code only you use: a stable public API with backwards compatibility, semantic versioning, and deprecation cycles. While you're the only user you can rename, re-signature, and restructure freely. A discipline you **should** adopt the moment a colleague imports your code: a stable interface — don't change function names or argument meanings out from under them without warning — and a documented public surface so they can use it without reading the implementation. The trigger for the switch is precisely "someone else now depends on this".**Exercise 5: What belongs in a package**Code that should stay in the notebook is exploratory, one-off, or presentation-specific: the narrative of a particular analysis, plots tailored to one report, throwaway checks. Code that has earned a place in a module is reusable logic — data cleaning, feature engineering, model training and evaluation — that you'll run more than once or in more than one place. The signal that logic has crossed the line is reuse: you've used it (or want to) a second time, you need to test it, or someone else needs it. "I'm about to copy this" is the clearest possible prompt to extract it instead.## Chapter 7: Testing stochastic code {#sec-answers-testing}**Exercise 1: Test a deterministic transform**Experiential. The instructive part is usually the edge cases: writing `test_on_empty_input` or `test_with_all_zeros` forces you to *decide* what the function should do in those situations — return an empty result, raise a clear error, propagate NaNs — when the original code never made that decision explicit. A test you can't write because you don't know the expected answer is a sign the contract is underspecified, which is a finding in itself.**Exercise 2: Make a stochastic function testable**Experiential. Having the function accept an explicit `rng` argument turns hidden global randomness into an injected dependency you control. The exact test then fixes the seed and asserts a specific result; the tolerance test asserts a statistical property — say, that the mean of many draws is within some band of the expected value. The tolerance should be *justified*: wide enough that it won't fail by chance (a few standard errors of the quantity you're checking), tight enough that a real defect would breach it. Stating why you chose the band is part of the answer.**Exercise 3: Test an invariant**Experiential. An invariant is a property that must hold for *every* input — preserved row count, no new missing values, output bounded in a range, mean zero after standardising. Checking it across many random inputs is property-based testing done by hand; `hypothesis` automates the input generation and, when a property fails, shrinks the counterexample to the smallest input that triggers it, which is often the fastest route to understanding the bug.**Exercise 4: Why `model.score(...) > 0.85` is a poor unit test**It conflates evaluation with testing. The assertion is really trying to answer "is the model good enough?", which is an evaluation question — answered on a continuum, against a baseline, and monitored over time — not a pass/fail property of the code. As a unit test it fails on three counts: it's fragile (it breaks the first time the data shifts, with no code defect), uninformative (a failure doesn't localise any bug), and slow. What you *should* test about the pipeline is the deterministic machinery around the model: that data validation rejects malformed input, that transforms produce the expected columns and leak nothing, that the pipeline runs end to end on a tiny sample, and that a saved model round-trips to identical predictions.**Exercise 5: Why a flaky test is worse than none**A test that fails one run in ten and is habitually re-run until green is *dishonestly noisy*: it trains the team to treat failures as background noise to be cleared by re-running, which is exactly the habit that lets a real failure slip through unnoticed. It also blocks or destabilises CI and erodes trust in the whole suite. No test is at least honestly silent; a flaky test actively degrades everyone's response to failure. Two fixes that keep the test: make it deterministic by fixing the seed so the stochastic element is pinned; or, if it's genuinely checking a statistical property, replace the brittle assertion with a principled tolerance (several standard errors wide) or an invariant that must always hold. Re-running until it passes, or deleting it, are the two non-answers.## Chapter 8: Debugging and profiling {#sec-answers-debugging}**Exercise 1: Read a traceback**Experiential. Read bottom-up: the final line is the exception type and message (*what* went wrong), and the deepest frame in your own code is *where* to look — third-party frames below it are usually just the machinery that surfaced your mistake. The fact you'd check first follows directly from those two (a `ZeroDivisionError` in a line dividing by `active_days` says: look for a zero). People are routinely surprised how often the traceback alone, read properly, fully explains the bug — the panic that makes us skim it is the real obstacle.**Exercise 2: Use a debugger instead of print**Experiential. The thing a debugger shows that a print does not is the *entire* live state at the moment of failure — every variable, not just the one you anticipated printing. That's how you spot the cause you weren't looking for: the column that's unexpectedly all zeros, the frame with the wrong shape, the value that's a string where you assumed a float. Print debugging can only show what you already suspected; the debugger shows what you didn't.**Exercise 3: Replace print with logging**Experiential. A reasonable mapping is `INFO` for milestones ("loaded N rows", "training complete"), `WARNING` for recoverable oddities ("clipped 12 negative values"), and `DEBUG` for fine detail. The payoff is the final step: flipping a single level setting switches between a quiet production run and a verbose diagnostic one without editing — or later removing — any of the statements, which is precisely what makes logging persist where scattered prints get deleted.**Exercise 4: Profile and fix**Experiential. The lesson lands hardest when the hot spot is *not* where you expected — the slow step is often an innocent-looking apply or a repeated recomputation, not the obviously heavy model fit. Fix the dominant one (vectorise the loop, cache the repeated work) and measure the change rather than assuming it helped. The discipline is to optimise what the profiler points at, and only once something is actually too slow.**Exercise 5: Four questions, four tools**- *A variable's current value* → `print` (or a quick inspect). Adequate, because the question is narrow and you already know what you want to see.- *The full state at a failure* → a debugger (`pdb` or an IDE). `print` can't show everything at once, and it forces you to guess in advance which variables will matter.- *What happened in a run you weren't watching* → `logging`. `print` has no levels, timestamps, or persistence, and you weren't there to read it scroll past.- *Where the time went* → a profiler. `print` can't attribute time, and hand-timed guesses are biased toward the parts you already suspect.`print` is the right tool only for the first; for the other three it's a poor stand-in because it answers "what is this value now?" and nothing else — it cannot capture full state, persist a structured record, or measure performance.## Chapter 9: Project structure {#sec-answers-project-structure}**Exercise 1: Reorganise a flat project**Experiential. The instructive discovery is usually something being *mutated in place* — a raw CSV edited to fix a typo, a column renamed in the source file, a row dropped by hand. Once raw data is read-only, that edit has to become a transformation step whose output lands in `data/interim/` or `data/processed/`, leaving the original untouched. Finding the in-place edit is finding the point where your work stopped being reproducible from source.**Exercise 2: Write a README**Experiential. Every question the colleague still has to ask is a gap in either the README or the structure, and the common ones are revealing: an undocumented environment variable, a data source that needs credentials or special access, a setup step that "everyone knows", or the order in which things must run. The exercise works precisely because you can't see your own assumptions — the colleague's questions surface them.**Exercise 3: A single source of truth for paths**Experiential. The original would break on a colleague's machine because an absolute path like `/Users/you/project/data.csv` simply doesn't exist there. Deriving every path from one project root (resolved from `__file__` or a config value) makes the project portable: it runs unchanged on a laptop, a server, or inside a container, because only the root differs. The hard-coded path is one of the most common reasons "it works on my machine" and nowhere else.**Exercise 4: Layout versus schema**The analogy **holds** in that both give each thing a known place, so a person or a tool can navigate without a guided tour — columns tell you where a variable lives, directories tell you where a kind of file lives. It **breaks down** in enforcement: a DataFrame's schema is enforced by the runtime (the wrong columns cause a failure), whereas a directory convention is enforced only by discipline and tooling — a project template, a linter, code review. The consequence is that structure must be *actively maintained*: it drifts the moment someone drops a stray file at the top level, where a schema would simply have refused.**Exercise 5: When structure is overkill**A genuinely one-off analysis — a quick answer for a meeting, a teaching example, a throwaway exploration — should stay a single notebook in a single folder; imposing `src/`, `tests/`, and a structured `data/` on it is pure overhead. The signal that it has earned the scaffolding is longevity and dependence: it will run again (on new data, or on a schedule), someone else needs to run or maintain it, it needs tests, or helper files are starting to accumulate at the top level. "This is going to outlive the week" is the trigger.## Chapter 10: Data pipelines {#sec-answers-data-pipelines}**Exercise 1: Break a monolith into stages**Experiential. The payoff usually shows up as a stage that proves reusable in a context you hadn't anticipated — the cleaning function reused by a different analysis, or a feature transform reused at serving time. That unplanned reuse is exactly what the monolithic cell made impossible, because the useful part was welded to everything around it.**Exercise 2: Add a validation gate**Experiential. The bad data a gate would have caught at the boundary includes an upstream schema change (a renamed or dropped column), an unexpected null where the next stage assumes completeness, an out-of-range value (negative spend, a date in the future), a duplicated key, or a target column leaking into the features. The point is *where* the failure surfaces: a gate turns a cryptic error three stages downstream into a precise message at the moment the bad data entered.**Exercise 3: An idempotent, cached stage**Experiential. Persist the stage's output to `data/interim/` and skip the computation when the artefact already exists; the second run should report the stage skipped. The lesson to carry forward is that a real orchestrator does this for you *and* invalidates the cache when a stage's inputs or code change — caching is only safe when staleness is handled, which is why "re-run if the inputs changed" is the rule, not "re-run never".**Exercise 4: Workflow pipeline versus `sklearn` `Pipeline`**The analogy **holds** in that both compose single-responsibility steps into a whole that moves together, and both enforce order so that, for example, preprocessing can't leak across a boundary. It **breaks down** in scope and mechanism: an `sklearn``Pipeline` lives in one process and one `.fit()`/`.predict()` call, holding everything in memory, whereas a workflow pipeline spans processes, persisted artefacts, and schedules. The workflow pipeline therefore needs things the `sklearn` one doesn't — explicit intermediate storage, and an orchestrator that handles dependencies, caching, retries, and scheduling.**Exercise 5: When a pipeline framework is overkill**A single script or notebook is the right tool for a one-off analysis, or a workflow of one or two steps that you run interactively and watch. The signal that it has outgrown this is when re-running everything becomes too costly or too risky: the workflow runs repeatedly or on a schedule, some stages are expensive enough that you want to re-run only what changed, several people or systems depend on intermediate outputs, or failures need to be isolated and retried rather than restarting from scratch. At that point the explicit stages and orchestration earn their keep.## Chapter 11: Configuration and secrets {#sec-answers-config-secrets}**Exercise 1: Lift hard-coded values into config**Experiential. The values that turn out to differ between your machine and where the code really runs are the telling ones: absolute file paths almost always, plus database and table names, output locations, resource settings (number of workers), and debug flags. These are exactly the things configuration is for — the same logic, different values per environment — and finding them is finding everything implicitly tied to your laptop.**Exercise 2: Move a secret out of the repository**Experiential. Add `.env` to `.gitignore`, commit a `.env.example` template with placeholder values, and load the real secret from the environment. The reason moving it is not sufficient on its own, if it was ever committed, is that version control keeps history: the secret remains in past commits even after you delete it from the current files, so anyone with a clone still has it. It must be *rotated* — the credential changed at its source — not merely removed.**Exercise 3: Typed, validated config**Experiential. The contrast is the point: a bare dictionary lets a mistyped key (`raw_paht`) return a silent `None` that surfaces as a confusing failure much later, whereas a `pydantic` model with a constraint rejects a bad value the instant it loads, naming the offending field. Feeding it an out-of-range value and getting an immediate, specific error is the behaviour you're buying.**Exercise 4: Config versus a hyperparameter dictionary**The analogy **holds** in that both pull the adjustable knobs out of the logic and into a single place you can change without editing code. It **breaks down** in scope: a hyperparameter dictionary is consumed once, in one process, to fit one model, whereas application configuration also selects *behaviour across environments* — which database to connect to, which bucket to write to, whether to run in debug mode. Config therefore carries a dimension the hyperparameter dict doesn't: the same schema with different values in development, staging, and production.**Exercise 5: When hard-coding is acceptable**A genuine constant — a value that is part of the logic and does not vary by run or environment, such as a mathematical constant or a truly fixed business rule — is perfectly fine as a *named* constant in the code. A value becomes configuration when it varies between environments, changes between runs, or is something you tune. The signal that a hard-coded value has become a liability is any of: you find yourself editing code to change it, it differs between dev and prod, or you can't tell from the code why it has the value it does. Secrets are the absolute case — always a liability when hard-coded, regardless of anything else.## Chapter 12: API design {#sec-answers-api-design}**Exercise 1: Wrap a model in an endpoint**Experiential/applied. What the endpoint forces you to decide, and a notebook `predict` let you ignore, is the *contract*: the exact request format (field names, types, units), what the response contains (a label, a probability, both), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, explicitly and in advance, what a caller must send and will receive.**Exercise 2: Validation returns a clear 422**Experiential. Constraining the request fields means a malformed request is rejected at the door, before the model runs, with a message that says what was wrong — rather than reaching the model and producing a confidently wrong prediction from garbage, or crashing deep inside and returning an opaque `500`. The lesson is that validation converts an unpredictable internal failure into a precise, early, client-facing error.**Exercise 3: Response schema and live docs**Experiential. The documentation is generated *from* the schema and code (as OpenAPI), so it cannot drift out of sync — change a field and the docs change with it. This matters because callers integrate against the documentation: hand-written API docs inevitably fall behind the implementation, and documentation that lies is worse than none, because it sends integrators down paths that no longer work.**Exercise 4: Endpoint versus `model.predict()`**The analogy **holds** in that both take features and return a prediction, with the request and response schemas playing the role of the input and output. It **breaks down** because an endpoint receives input from untrusted strangers across a network, so it must validate that input, handle errors without leaking internals, and version itself so the model can change without breaking existing callers — none of which an in-notebook `predict` call, with its single trusting user, ever has to do. The endpoint handles the adversarial, multi-caller, evolving reality the notebook call is insulated from.**Exercise 5: Batch versus real-time**A scheduled batch job is the right mechanism when predictions are consumed in bulk on a known cadence and latency doesn't matter — nightly churn scores feeding a dashboard, weekly demand forecasts written to a table. A real-time API is necessary when predictions are needed on demand, one at a time, with low latency, in response to a user action or another system's request — a fraud check at checkout, a recommendation as a page loads. The deciding property is *how and when the prediction is consumed*: in bulk on a schedule points to batch; on demand with a latency requirement points to an API.## Chapter 13: Continuous integration {#sec-answers-ci}**Exercise 1: Add a CI workflow**Experiential. The payoff is the moment the status goes red on a deliberately broken test before you'd have noticed any other way — that's the days-to-minutes gap from the chapter's opening, closed. A workflow that installs the *locked* dependencies (rather than whatever the runner happens to have) is also what makes the run a faithful check rather than a coincidence of the runner's environment.**Exercise 2: Lint and format gate**Experiential. The first run typically flags the same mix Chapter 5 described — unused imports, unformatted files, the occasional shadowed name or undefined reference — and sorting them into "real defect" and "pure style" is the exercise. The style pile is exactly what the formatter fixes automatically, so once `ruff format` is in the gate it stops recurring, and the gate's signal becomes mostly about real problems.**Exercise 3: Set up pre-commit**Experiential. The hook stops the commit locally, before anything reaches CI, which is the point: the cheapest place to catch a formatting slip or a stray large file is before it's even recorded. Pre-commit and CI are complementary — the local hook gives instant feedback on the trivial things, and CI remains the authoritative shared gate that everyone's changes must pass.**Exercise 4: CI versus model evaluation**The analogy **holds** in the trigger: both say "something changed, so the previous verdict is no longer trustworthy — re-verify". You re-run a holdout evaluation after changing features; CI re-runs the checks after changing code. It **breaks down** in the verdict. A model evaluation yields a graded score you weigh with judgement (is 0.82 good enough?); CI yields a binary gate that is open or shut, with no "82% of the tests passed, ship it". The same reflex produces a number in one case and a door in the other.**Exercise 5: What belongs in CI**Run on every change the checks that are *fast and deterministic*: unit tests, linting, type checks, and a small end-to-end smoke test on sample data. Push to an occasional or nightly job the things that are slow, expensive, or non-deterministic: full model training, integration tests against real external services, validation over large datasets, performance benchmarks. The principle is that every-push checks must be quick and reliable enough that developers never feel the urge to route around them — a gate that is slow or flaky gets disabled or ignored, at which point it protects nothing. Speed and trustworthiness are what make the gate worth having.## Chapter 14: Containerisation {#sec-answers-containerisation}**Exercise 1: Write a Dockerfile**Experiential. A sound first Dockerfile starts from a slim base, installs the locked requirements, copies the code, and sets the run command; building and running it confirms the service starts in a clean, sealed environment rather than relying on anything on your machine. If it runs in the container but not on a colleague's bare machine, the container has done its job — it carried the environment with it.**Exercise 2: Improve the image**Experiential. Ordering the instructions so dependencies are installed before the code is copied means a code edit reuses the cached dependency layer instead of reinstalling everything — rebuilds drop from minutes to seconds. A slimmer base, a multi-stage build, or removing build tools afterwards shrinks the image substantially, often from north of a gigabyte to a few hundred megabytes. The exercise is to measure the before and after, because the gains are larger than most people expect.**Exercise 3: Keep data and secrets out**Experiential. Baking *data* into the image bloats every copy of it, ties the image to a single snapshot of the data, and can push sensitive records into registries. Baking a *secret* in is the Chapter 11 mistake reincarnated: the credential ends up inside an artefact that gets pushed to registries and shared, so it leaks and must be rotated, not merely removed. The fix is to keep the image a generic definition of *how to run* — mount the data as a volume and inject the secret as an environment variable at run time.**Exercise 4: Container versus lockfile**The analogy **holds** in principle: both pin the things that affect behaviour so they can't drift between machines — the same "control your inputs" move. It **breaks down** in reach and form. A container pins what the lockfile cannot — the interpreter, the system libraries, and the operating system — sealing the whole tower. In exchange you give up transparency: a lockfile is a small text file you can read, diff, and review, whereas an image is an opaque binary blob you manage through versioning and a registry rather than by reading it.**Exercise 5: When to containerise**A container is clearly worth it whenever the code must run *reliably on a machine that isn't yours*: a deployed service, a job on a shared cluster, a pipeline that has to reproduce exactly across environments, or onboarding a team to one consistent setup. It's overkill for a one-off local analysis or an exploration only you will ever run on your own machine, where a virtual environment and a lockfile already give you everything you need. The deciding property is exactly that: does it need to run identically somewhere other than your machine? If yes, containerise; if it lives and dies on your laptop, the lockfile is enough.## Chapter 15: Deployment {#sec-answers-deployment}**Exercise 1: Deploy the service**Experiential. The revealing part is what the platform makes you supply explicitly that your laptop quietly provided: the environment variables and secrets (Chapter 11), the port to expose, persistent storage for any data, the exact run command, and resource limits. On your machine all of that was implicit context; deployment turns it into configuration you have to state, which is itself why the container and config work of the previous chapters pays off here.**Exercise 2: Staging with pass criteria**Experiential. The discipline is to run the *same* artefact as production with different configuration, and to decide what "passing staging" means *before* you look at the result — a latency ceiling, an error-rate ceiling, a smoke test that must pass. Deciding the bar in advance is what stops the very human temptation to wave a release through because it "seems fine", which is the operational version of moving the goalposts.**Exercise 3: Schedule a batch job**Experiential. The key behaviour is that a failed run is *detectable*: the job exits non-zero (Chapter 4), so the scheduler can alert, rather than failing silently and leaving you to discover days later that the table was never updated. A batch job that fails quietly is worse than one that fails loudly, because the missing output often looks just like stale-but-present output.**Exercise 4: Staging versus a holdout**The analogy **holds** in the principle: both are "try it somewhere safe, on conditions it wasn't built on, before it counts". It **breaks down** in what you measure and how you judge. A holdout is a fixed sample you score once for a single accuracy number; staging is an environment you run continuously, and "passing" it is a judgement about operational behaviour — latency, error rate, resource use under load — rather than one metric. Passing a holdout is a number clearing a bar; passing staging is a system behaving acceptably.**Exercise 5: Batch versus always-on, and rollback**A nightly churn-scoring job feeding a report is naturally batch; a real-time fraud-scoring API is naturally always-on. Their rollbacks differ accordingly. For the batch job, a rollback is usually re-running the previous version (or simply retaining yesterday's output) — the failure is recoverable because nothing depended on the run in real time. For the online service, a rollback means switching live traffic back to the previous container version immediately — a blue-green flip or redeploying the prior tag — because every second on the bad version affects real requests. Batch buys you time; online demands a fast switch.## Chapter 16: Monitoring and observability {#sec-answers-monitoring}**Exercise 1: Logging and health**Experiential. To investigate "a strange answer last Tuesday" you need, at minimum, the timestamped request *inputs*, the *prediction* returned, and the *model version* that served it — ideally tied together by a request ID so you can correlate across log lines. The common failure is logging only that "a prediction was made": that confirms the service ran but lets you reconstruct nothing. The test of your logging is whether you could replay last Tuesday's prediction from it alone.**Exercise 2: A drift check**Experiential. Store a reference sample from training, compare each live batch to it with a KS test or population stability index, and alert when the statistic crosses a threshold. As for which feature drifts first, the usual culprits are externally driven or time-sensitive ones — a monetary feature exposed to inflation, a feature fed by an upstream source that changes format or coverage, or anything seasonal — because the world moves those independently of your model.**Exercise 3: A useful alert**Experiential. Choose a condition and a threshold, then defend it against fatigue: set the threshold from observed normal variation rather than a round number, alert on a sustained breach rather than a single spike, deduplicate repeated firings, and route only actionable alerts to a human. This is the flaky-test lesson from Chapter 7 transplanted to operations — an alert that cries wolf trains the team to ignore it, and an ignored alert protects nothing, including on the day it's right.**Exercise 4: Monitoring versus validation**The analogy **holds** in that both check the model against data it didn't train on — monitoring simply continues that check indefinitely rather than once. It **breaks down** over labels: a holdout has ground truth, so you measure accuracy directly, whereas in production the labels usually lag (you learn who actually churned months later) or never arrive. So you generally *cannot* measure live accuracy as directly as holdout accuracy, and you fall back on input and prediction drift as proxies that hint at trouble without confirming it.**Exercise 5: Data drift versus concept drift**Data drift is a change in the input distribution P(X): the feature values coming in look different from training — for example, a new customer demographic produces values the model rarely saw. Concept drift is a change in the relationship P(Y|X): the same inputs now map to different outcomes — for example, fraudsters change tactics, so transaction features that meant "safe" last year no longer do. Detecting input drift without labels warns you the model is operating on unfamiliar data, where its learned assumptions may no longer hold — but it can't confirm a real accuracy drop, because the model might still perform well on the shifted inputs, or the relationship might have changed while the inputs looked unchanged. Only ground-truth labels, when they eventually arrive, can confirm the model has genuinely become less accurate.## Chapter 17: Code review {#sec-answers-code-review}**Exercise 1: A small, focused pull request**Experiential. The thing to notice is *why* a small, well-described PR is easy to follow: the reviewer can hold the whole change in their head at once, and a description of what-and-why spares them reconstructing your intent from the diff. The contrast with a sprawling change is the lesson — reviewability is mostly a property of size and framing, not of how clever the code is.**Exercise 2: Review someone else's PR**Experiential. Most people find that *finding* issues is easier than *phrasing* them well. The discipline is to mark each comment as blocking or a suggestion (so the author knows what must change), to keep it about the code rather than the person, and to say why — because the reasoning is what teaches and what makes the comment land as help rather than criticism.**Exercise 3: A bug that automated checks miss**A leak, a wrong metric, or an off-by-one in a split passes the linter and the tests because it is syntactically valid and the tests only check what the author already thought to check; a linter inspects *form*, never domain correctness. Catching it needs a reviewer who reads the logic and the data flow with domain knowledge — someone who knows the scaler must be fit on training data only, or that this metric is wrong for an imbalanced problem. That is exactly the attention automation cannot provide and review exists to supply.**Exercise 4: Code review versus peer review of a paper**The analogy **holds**: both are a knowledgeable peer examining the reasoning and method before the work counts, catching what the author is too close to see. It **breaks down** in cadence and size — peer review of a paper is rare, heavy, and large (a whole study, reviewed once), whereas code review is frequent, light, and small (one change, many times a week). The implication is to submit work for review in small, frequent pieces rather than saving up a quarter's work for one enormous request that can only be rubber-stamped.**Exercise 5: A data science review checklist**Items worth adding that a general checklist wouldn't have: does it leak (preprocessing fit on test data, a target in the features)? are the data assumptions stated and checked? is it reproducible (seed and config captured)? is the metric appropriate to the problem? are there hard-coded secrets? What to leave off is style and formatting — not because it doesn't matter, but because the formatter and linter settle it automatically (Chapter 5). Leaving it off makes reviews *better*, because litigating whitespace in comments consumes the human attention that should go to logic and trains a team to nitpick form instead of reasoning about correctness.## Chapter 18: Documentation {#sec-answers-documentation}**Exercise 1: Write a README**Experiential. The point at which your reader first gets stuck is the most valuable output: it's almost always an undocumented environment variable, a data source that needs access you forgot to mention, or a setup step so habitual you didn't know you were doing it. Timing the run from clone to running surfaces the assumptions you can't see precisely because they're yours.**Exercise 2: Add docstrings**Experiential. The test is whether `help()` on a function tells a reader enough to *use* it without reading the body. If it doesn't, the docstring is missing part of the contract — usually the parameters, the return value, the exceptions it raises, or a worked example. A docstring that only restates the function name has documented nothing.**Exercise 3: Write a model card**Experiential. The hardest section is almost always "known limitations / where it should not be used", because it forces you to articulate the model's failure modes and the populations it was *not* validated on — exactly the questions exploratory work leaves implicit. That difficulty is itself informative: where the model card is hard to write is where your understanding of the model's boundaries is thinnest, and therefore where the risk lives.**Exercise 4: Classify documents with Diátaxis**A docstring is *reference* (look up what a function takes and returns). A tutorial notebook is a *tutorial* (learning by the hand). A model card is mostly *reference* plus *explanation* (facts about the model, and the why behind its limits). A README is a deliberate blend — at its best a brief *tutorial/how-to* that orients a newcomer and points onward to the rest. Mixing the jobs makes a document worse because a reader arrives with one need — to learn, to look up, or to understand — and a document trying to serve two serves neither: a reference padded with teaching is slow to search, and a tutorial listing every option is impossible to follow.**Exercise 5: Keeping documentation in sync**Two structural practices: co-locate documentation with the code (docstrings), so a change to the code sits right beside the text that describes it; and generate reference documentation from the code and make examples executable (doctest, or a tested snippet), so a changed signature or a stale example becomes a build or test *failure* rather than a silent lie. "Remember to update the docs" is not a third practice because it relies on human discipline under deadline pressure with no feedback when it's forgotten — the documentation rots quietly and you only discover it when it has already misled someone. Structural defences make drift either impossible or loud; a reminder makes it neither.## Chapter 19: Technical debt {#sec-answers-technical-debt}**Exercise 1: Audit a project for debt**Experiential. Sorting each item into deliberate (you knew you were cutting the corner) and inadvertent (you've only just noticed) is the instructive part, and the inadvertent pile is usually the larger and more alarming one. The item that surprises people most is almost always a piece of "temporary" code — a hard-coded value, a quick script — that turned out to be load-bearing and has been quietly holding production together for months.**Exercise 2: The boy-scout rule**Experiential. Paying down one item while you're already in the file — adding a test, extracting a function, naming a constant — is typically quick relative to the change you came to make, and that's exactly the point: opportunistic repayment is cheap because you've already paid the cost of understanding the code. Debt repaid this way never has to be scheduled.**Exercise 3: A debt log**Experiential. A debt item is worth writing down, rather than fixing on the spot, when the fix is larger than the time you have, when the code might be discarded anyway, or when stopping to repay it now would derail the task in hand — but you still want it *visible* so it isn't silently forgotten. Trivial fixes don't go in the log; they go in the boy-scout pass. The log's whole purpose is to make deferred debt a deliberate, tracked decision rather than a thing you rediscover at 3am.**Exercise 4: Debt versus financial debt**The analogy **holds**: a shortcut borrows time now and charges interest later, in that every future change to that code is slower and riskier. It **breaks down** in the shape of the interest. A loan has a known rate and a schedule, so you can plan around it; technical debt's interest is unpredictable and lumpy — it costs nothing until the day you have to touch the code, then it can cost an enormous amount at once. With no monthly statement to remind you it exists, it's easy to defer indefinitely, which is precisely why it accumulates.**Exercise 5: When debt is the right call**A shortcut is the correct decision for code with a short or uncertain life — a prototype that may be discarded, a hypothesis you're testing, a genuine deadline where shipping now matters more than polish. It's reckless when taken in code you already know will be load-bearing, when taken without recording it, or when the resulting failure would be silent and high-consequence. The distinguishing property is the code's expected lifetime and criticality, combined with whether the debt is acknowledged: debt on disposable, low-stakes code is a tool; unrecorded debt on code others will depend on is a liability waiting to come due.## Chapter 20: Cross-discipline collaboration {#sec-answers-cross-discipline}**Exercise 1: Map the vocabulary gaps**Experiential. The classic three: *test* (a data scientist means evaluation metrics; an engineer means pass/fail assertions on code), *validation* (DS: holding out data to measure generalisation; SE: checking inputs against a schema), and *model* (DS: a learned predictive function; SE: an abstraction of a domain, like a class diagram). The gap that has usually caused a real misunderstanding is "is it tested/validated?" — where both parties said yes, meaning entirely different things, and discovered the mismatch only later.**Exercise 2: Write a handoff document**Experiential. The revealing part is what you find yourself making explicit for the first time: the failure modes, the edge cases of the input contract, the *caveats* on performance (where the model is weak, the populations it wasn't validated on), how it should be monitored, and who owns it when it misbehaves. All of that typically lived only in your head, which is exactly why the handoff is where things go wrong.**Exercise 3: An interface as a contract**Experiential. Agreeing the schema, latency budget, and bad-input behaviour in advance is cheaper because the alternative — discovering in production that you returned a label where the service expected a probability, or the wrong units, or an unhandled null — is an incident with real cost and a paging at a bad hour. The contract converts an integration surprise into a build-time check: a conversation now versus an outage later.**Exercise 4: Team interface versus data contract**The analogy **holds** in that both define exactly what crosses the boundary, so neither side has to guess or reverse-engineer the other. It **breaks down** because a data contract is enforced by code — the schema validates or the request fails — whereas a team contract also carries a human dimension no schema captures: the shared understanding of intent, of what "good enough" means for this use, and of who is responsible when the model misbehaves. The schema pins the bytes; it cannot pin the agreement about ownership and intent, and that is where collaboration actually succeeds or fails.**Exercise 5: Two rigours pulling opposite ways**A clear case: a model that needs frequent retraining and experimentation (the data scientist wants fast, loose iteration) while it serves live production traffic (the engineer wants stability, tests, and controlled releases). The instincts genuinely conflict. A team holding both resolves it not by one side winning but by engineering the *boundary* tightly so that exploration can stay loose safely — an automated retraining pipeline with validation gates and canary releases, behind a stable contract and CI the engineer trusts, so the data scientist iterates freely without putting production at risk. The general principle is that the engineering instinct and the data science instinct are reconciled by tightly engineering the interface so that what happens behind it can remain appropriately loose.## Chapter 21: Notebook to production API {#sec-answers-notebook-to-api}**Exercise 1: Carry a model one stage further**Experiential. For most readers the next stage is extracting the feature and training logic into an importable module of pure functions. What you have to change to make it importable is everything that tied the code to the notebook: replace reliance on notebook globals with explicit function arguments (Chapter 6), separate the logic from the cell that happened to run it, and give each function a clear input and return. The pure-function discipline from Chapter 6 is precisely what makes the code importable — a function that depends on whatever is in the kernel can't be lifted out of it.**Exercise 2: The train–serve safeguard**Experiential. The test feeds a single raw record through both the training feature path and the serving feature path and asserts the resulting features are identical. It is worth more than a test of the model's accuracy because train–serve skew is a *silent, high-impact* bug: if serving computes a feature even slightly differently from training, the model receives inputs unlike anything it learned on and degrades quietly, with nothing failing. That has a definite right answer a test can pin, whereas accuracy is a moving statistical quantity that doesn't belong in a pass/fail gate (Chapter 7). The skew test catches a real deployment defect; an accuracy assertion catches noise.**Exercise 3: Wrap the model in an API**Experiential. What the endpoint forces you to make explicit, and the notebook let you leave implicit, is the contract: the exact request schema (field names, types, and valid ranges), what the response carries (a probability rather than a label, plus a `model_version` for traceability), and what counts as a valid input. In the notebook you built the feature array yourself and trusted it; the endpoint has to state, in advance, what a caller must send and will receive — and the malformed request returning a clean 422 is that contract doing its job.**Exercise 4: Productionising versus publication**The analogy **holds** in that both turn an exploratory finding into a rigorous, reproducible, reviewable artefact others can rely on — the package is the methods section, the lockfile and config are the reproducibility statement, the tests are the peer review you run on yourself. It **breaks down** because a paper, once published, is finished and frozen, whereas a deployed service runs continuously against data that keeps changing, so it needs something a paper never does: monitoring, to tell you when the result it embodies has stopped being true. That is the subject of Chapter 16, *Monitoring and observability*.**Exercise 5: How far to walk the route**A throwaway analysis should stop at the notebook (version-controlled at most); an internal tool typically warrants a package, a few tests, and externalised config, but rarely a full container-and-CD pipeline; a model real users depend on needs the whole route — package, tests, API, container, CI, deployment, and monitoring. The signal to go further is always the same kind of thing: someone else needs to run it, it runs repeatedly or unattended, real decisions now depend on its output, or it must reproduce exactly. "It has outlived its expected life, or someone now depends on it" is the trigger to take it one stage further down the path.## Chapter 22: Reproducible research pipeline {#sec-answers-reproducible-pipeline}**Exercise 1: One command, from raw data to result**Experiential. The valuable discovery is the hidden dependency the `Makefile` flushes out — the step that only worked because of something on *your* machine: a file in your home directory, a package you installed once and forgot, an environment variable set months ago, or a manual "and then I clicked export" step. Declaring every stage's inputs and outputs forces those implicit dependencies into the open, which is exactly why a one-command rebuild is a stronger guarantee than "it ran when I did it".**Exercise 2: Version a dataset**Experiential. Before versioning, reconstructing the exact data behind an old result was usually impossible because the raw file had been overwritten or changed in place, with no record of which version produced the figure. DVC (or, at a minimum, a dated immutable copy plus a checksum committed alongside the code) makes the input recoverable, so checking out the commit behind a result also restores the dataset that produced it — the missing fourth input from the chapter.**Exercise 3: Generate the number, don't paste it**Experiential. A hand-pasted number becomes wrong because nothing updates it when the data or code changes — it is a snapshot frozen at the moment you copied it, with no link back to its source, so the day the analysis changes the figure in the slide silently disagrees with the figure in the code. A generated number is recomputed from the data every time the report is rendered, so it cannot drift away from the result it claims to report; the worst that can happen is the build fails, which is loud rather than silent.**Exercise 4: Data versioning versus a seed**The analogy **holds** in that both pin an input so the result is repeatable — the seed pins the randomness, data versioning pins the dataset, and in each case you're removing a way the result could change without your intending it. It **breaks down** in mechanism: a seed is a single integer you drop into the code, whereas data is large and lives outside the code, so it can't be a line in a script. It needs its own external storage and a small, committable pointer — which is precisely what DVC provides. The thing being pinned differs in size and location, so the handle must differ too.**Exercise 5: What a notebook doesn't pin**A single notebook, even under version control, pins the *code* but not the other three inputs. It does not pin the **environment** — the packages it imports are whatever happens to be installed in the kernel, so a colleague with a different pandas version can get a different result from identical code. It does not pin the **data** — it reads whatever the file path points to, and that file can be overwritten or updated without the notebook changing at all. (And it pins **randomness** only if you remembered to set seeds.) "It's all in one notebook" addresses code organisation, not reproducibility: the notebook is necessary but nowhere near sufficient, because the result depends on three things living entirely outside it.## Chapter 23: MLOps pipeline {#sec-answers-mlops}**Exercise 1: Sketch the loop**Experiential. Naming each stage for your own model — the training pipeline, the registry, the deployment, the monitoring signal, the retraining trigger — usually reveals that the missing or manual stage is the *return arrow*: most teams have a way to train and a way to deploy, but monitoring is thin and retraining is ad hoc, done when someone happens to notice a problem. Automating it means adding drift monitoring that emits a signal, a triggered training pipeline, and a promotion gate — closing the loop so the cycle runs on a signal rather than on someone's memory.**Exercise 2: The retraining trigger**Experiential. The false-alarm rate matters because a trigger that fires too often is the flaky test of MLOps (Chapters 7 and 16): each false alarm causes a needless retrain, which costs compute and — worse — risks promoting a model trained on a blip. A trigger becomes more trouble than it's worth once its false positives are frequent enough that the team disables it or ignores its output, at which point it protects nothing. The defences are the same as for alerts: set the threshold from observed normal variation rather than a round number, and require sustained drift rather than a single noisy batch.**Exercise 3: The promotion gate**Experiential. The comparison must use the *same* evaluation data because scoring two models on different datasets confounds "the candidate is better" with "the candidate's test set was easier" — you could not tell skill from luck of the draw. And you require a *margin* rather than strict improvement because a tiny difference in a metric like AUC is within its own run-to-run and sampling variability; promoting on a hair's-breadth win means swapping the production model on noise, which adds risk and churn for no real gain. The candidate should have to beat the incumbent by more than the metric's own wobble before it earns promotion.**Exercise 4: MLOps loop versus the experiment–iterate cycle**The analogy **holds**: both are a cycle of train, evaluate, adjust, and retrain. It **breaks down** in who drives it and what it optimises for. Your exploratory loop optimises for discovery, and you are inside it applying judgement at every turn; the production loop optimises for staying current and must run with you out of it most of the time. So the judgement you'd apply by eye during exploration — is this drifting enough to act on, is this new model actually better — has to be made *explicit* in the production loop, as a drift threshold and a promotion gate, because nobody is watching each iteration.**Exercise 5: The weakest practice**Take rollback (Chapter 15). Without it, the loop's promotion gate is a one-way door: the moment a candidate is promoted — perhaps trained on a corrupted batch, or scoring well on a test set that didn't catch a regression — it serves production traffic with no fast way back, and an automated loop that can promote but not un-promote has merely automated the act of shipping a bad model. The same argument lands on any link: without reproducibility (Chapter 22) you can't retrain to a comparable result, so the candidate can't be trusted or traced; without monitoring (Chapter 16) nothing triggers the loop and the model decays in silence; without testing (Chapter 7) a broken transform propagates into every retrain. The cycle only runs safely if every one of these holds, which is why automating it is the last thing you do, not the first.