---
# Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md
title: "Documentation"
---
## The README that says "run the notebook" {#sec-run-the-notebook}
You inherit a project from someone who has left. The README, if there is one, says "run `analysis.ipynb`". There are no docstrings. The data comes from "the usual place". The model in `models/` was trained by a script nobody can find, with hyperparameters nobody recorded. Everything needed to use this work lived in one person's head, and that person is gone.
Documentation is what stands between a project that someone else can pick up and one that dies with its author. For data scientists this is a particularly acute risk, because the notebook *feels* self-documenting — it has the code, the outputs, even prose between the cells — right up until you're not there to narrate it. This chapter is about the documentation that actually survives the handover: not more of it, but the right kinds, kept honest.
## Different documents for different questions {#sec-diataxis}
The reason so much documentation is bad is that it tries to do several incompatible jobs at once. The Diátaxis framework [@procida_diataxis] — the structure this book itself follows — separates documentation by what the reader needs at the moment they're reading. A *tutorial* teaches a newcomer by the hand. A *how-to guide* walks an already-competent reader through a specific task. *Reference* lets someone look up a precise fact — what this function takes, what that column means. *Explanation* gives the why behind a design. A reference page that tries to teach is bloated; a tutorial that lists every option is unfollowable. Knowing which kind you're writing is most of the battle.
::: {.callout-note}
## Data Science Bridge
You already produce documentation — you just don't always call it that. A *model card* is documentation: a statement of what a model does, the data it was trained on, how it was evaluated, its known limitations, and where it should and shouldn't be used. A *data dictionary* is reference documentation: each column with its type, meaning, and units. The shift is to recognise these as documentation, with the discipline that implies — kept current, written for a reader, treated as part of the deliverable — rather than as one-off artefacts you make when asked.
Where the analogy breaks down: a model card has sections that general software documentation doesn't — intended use, the populations the model was and wasn't validated on, fairness and ethical considerations — because a model carries risks that ordinary code doesn't. Documenting a model is documenting a thing that makes consequential decisions about people, so it answers questions ("who might this harm, and where is it not safe to use?") that a function's reference page never has to.
:::
## Docstrings: documentation that lives in the code {#sec-docstrings}
The documentation least likely to rot is the documentation closest to the code, and nothing is closer than a docstring. A docstring sits inside the function it describes, is accessible at runtime and in every editor, and travels with the code wherever it goes:
```{python}
#| label: docstring-as-documentation
#| echo: true
import inspect
def spend_per_active_day(spend: float, active_days: int) -> float:
"""Average daily spend over a customer's active period.
Parameters
----------
spend : float
Total spend over the period, in pounds.
active_days : int
Number of days the customer was active; must be positive.
Returns
-------
float
Spend divided by active days.
Examples
--------
>>> spend_per_active_day(100.0, 4)
25.0
"""
return spend / active_days
# The docstring is live documentation — available wherever the code is.
print(inspect.getdoc(spend_per_active_day).splitlines()[0])
print(f"example checks out: {spend_per_active_day(100.0, 4)}")
```
The docstring is retrievable at runtime (`inspect.getdoc`, or `help()`), pops up in the IDE as you type the call, and — because it states the contract right where the contract lives — is far harder to leave out of date than a separate document. A consistent convention (the NumPy style shown here, or Google's) covers what the function does, its parameters, what it returns, what it raises, and an example.
## Documentation that stays true {#sec-docs-stay-true}
The chronic disease of documentation is rot: prose that confidently describes code which has since changed. The defences are structural, not motivational. Keep documentation as close to the code as possible, so a change to one prompts a change to the other — docstrings over separate documents. Generate reference documentation *from* the code (tools like MkDocs, Sphinx, and Quarto read your docstrings), so the reference cannot describe a function signature that no longer exists. And make examples executable — a `doctest` or a tested snippet turns a stale example into a failing test, so the example in the docstring above can't silently lie. At the project level, the highest-leverage document of all is a README that covers the few things a newcomer most needs: what this is, how to set up the environment, how to run it, and where the data comes from (the orientation promised back in Chapter 9).
::: {.callout-tip}
## Author's Note
Documentation feels like pure overhead in the moment. It's writing *about* the work instead of doing the work, for a reader who is hypothetical and may never come. The data science instinct sharpens this: the notebook, with its inline outputs and prose, seems to document itself, so why write more? The answer is that the notebook documents itself only while you're standing next to it to explain which cell to skip and what the magic number meant.
The reframe is that every document is a letter to a specific future reader — and that reader is, more often than not, you, six months from now, with no memory of today's context. Failing that, it's the colleague who inherits the project when you move on. Neither can ask you. Writing the README is the difference between your work becoming a foundation others build on and a dead end they quietly rewrite from scratch because they couldn't understand it. And of all the documentation you could write, the README earns its keep fastest: the single page that takes a newcomer from `git clone` to a running system in minutes is worth more than any volume of prose nobody can find.
:::
## Summary {#sec-documentation-summary}
Documentation is what lets work outlive the person who wrote it:
1. **Different documents answer different questions.** Tutorials teach, how-to guides walk through a task, reference lets you look things up, explanation gives the why — and mixing them is why documentation fails.
2. **You already write documentation.** A model card and a data dictionary are documentation; treat them as such, with the discipline of being kept current and written for a reader.
3. **Docstrings live in the code.** Co-located with the function, accessible at runtime and in the IDE, they're the documentation least likely to drift — write them to a consistent convention.
4. **Fight rot structurally.** Keep docs near the code, generate reference from the code, and make examples executable — and prioritise the README, the cheapest, highest-leverage document there is.
The next chapter confronts what accumulates when these practices are skipped under deadline — and how to manage it: *technical debt*.
## Exercises {#sec-documentation-exercises}
1. Write or rewrite the README for one of your projects so that a newcomer can go from `git clone` to a running system in minutes — what it does, how to set up the environment, how to run it, where the data comes from. Hand it to someone unfamiliar and note the first point at which they get stuck.
2. Add proper docstrings to the public functions of one of your modules — what it does, parameters, returns, and an example. Then call `help()` on one and ask whether the rendered documentation is enough to *use* the function without reading its body.
3. Write a model card for a model you've built: intended use, training data, evaluation, known limitations, and where it should not be used. Which section was hardest to write, and what does that difficulty tell you about the model?
4. **Conceptual:** Using the Diátaxis categories (tutorial, how-to, reference, explanation), classify a README, a docstring, a tutorial notebook, and a model card. Where does each fit, and why does mixing two of these jobs into one document make it worse at both?
5. **Conceptual:** Documentation rots. Describe two practices that structurally keep documentation in sync with the code, and explain why "remember to update the docs" is not one of them.