5 Readable code

5.1 The code you’ll have to read again

The first chapter listed df2, tmp, and final_final as a gift to no one. It’s worth dwelling on why, because readability is the engineering practice with the least obvious payoff and the highest hidden cost. Unreadable code isn’t a tidiness problem; it’s a change problem. You can only safely modify code you understand, and the person who most often has to understand your code is you, six months from now, with no memory of why df2 was different from df or what 0.73 meant.

Code is read far more often than it is written. A cell you wrote once will be read a dozen times — when you debug it, when you extend it, when a colleague reviews it, when you copy it into the next project. Every one of those readings pays a tax if the code is cryptic, and the tax compounds. Readability is the practice of paying that cost down once, at writing time, so that every later reading is cheap.

This is not about aesthetics or pleasing a linter. It’s about making the logic visible, so that understanding the code doesn’t require reconstructing the mental state you were in when you wrote it.

5.2 Names are the cheapest documentation

The single highest-leverage readability habit is naming things for what they are. A variable called high_value_customers tells the reader what it holds; df2 tells them only that there was a df1. A function called filter_high_value announces its job; proc announces nothing. Good names turn code into something close to prose: high_value = filter_high_value(customers, threshold=200) reads as a sentence, and needs no comment.

The same applies to the magic numbers that litter exploratory code. The MAGIC_NUMBER = 0.73 from Chapter 1 — a threshold tuned by hand during some forgotten sprint — is unreadable not because 0.73 is wrong but because the number carries no meaning at the point it’s used. A named constant, CHURN_PROBABILITY_THRESHOLD = 0.73, says what the number is, and gives you one place to change it. Naming is documentation that can’t fall out of date, because it lives in the code rather than alongside it.

5.3 Functions that do one thing

The 400-line cell from Chapter 1 — load, clean, engineer features, train, plot — is unreadable for a structural reason as much as a naming one: it does five things, so to understand any one of them you have to read all five. Breaking it into functions, each doing a single job named for that job, lets a reader understand the shape of the whole from the function names alone, and dive into the detail only where they need to.

A useful discipline is to keep the nesting flat. Deeply indented code — a loop inside a condition inside a try inside another condition — forces the reader to hold several contexts in their head at once. A guard clause that returns early (“if the input is empty, return an empty result”) handles the edge case and gets it out of the way, so the main logic reads at a single level of indentation.

Data Science Bridge

Readable code is the methods section of your analysis, made executable. When you write up a piece of work, you name your variables meaningfully, explain each step in order, and leave out the dead ends — because a reader needs to follow the reasoning, not reconstruct it. Readable code does the same job for the same reason: the next person to touch it (often you) needs to follow the logic without rerunning it in their head.

Where the analogy breaks down: a methods section describes what you did once, and a reader will forgive a little ambiguity because they’re not going to re-execute your prose. Code is run repeatedly, by machines and by people making changes, so the ambiguity a human reader would gloss over becomes a latent bug — the place where someone misreads the intent and “fixes” something that wasn’t broken. Code has to be clearer than prose precisely because it’s read in order to be changed, not just understood.

5.4 Type hints and docstrings

Two lightweight tools make a function’s contract explicit. Type hints declare what goes in and what comes out — def filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame tells the reader (and the IDE, and a type checker like mypy) that this takes a DataFrame and a number and returns a DataFrame, without their having to read the body. Docstrings explain the why and the what for: the intent, the assumptions, the edge cases. Comments inside the body should be reserved for the genuinely non-obvious — the why behind a surprising line — not a running narration of what each line does, which the code already says.

The difference this makes is easiest to see directly. The two functions below compute exactly the same thing:

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
customers = pd.DataFrame({
    "spend": rng.lognormal(4.8, 0.85, 200),
    "active_days": rng.integers(0, 365, 200),
    "signup_year": rng.choice([2021, 2022, 2023], 200),
})

# Before: what does this do, and what does it return?
def proc(d, t=200):
    x = d[d["spend"] > t].copy()
    x["s"] = x["spend"] / x["active_days"].replace(0, np.nan)
    return x.groupby("signup_year")["s"].median()

# After: the name, the types, and the docstring answer those questions
# before you read a single line of the body.
def median_daily_spend_by_cohort(
    customers: pd.DataFrame, spend_threshold: float = 200
) -> pd.Series:
    """Median daily spend of high-value customers, by signup year.

    High-value customers are those whose total spend exceeds
    `spend_threshold`. Daily spend is total spend over active days;
    customers with zero active days are excluded to avoid dividing by zero.
    """
    high_value = customers[customers["spend"] > spend_threshold].copy()
    high_value["spend_per_active_day"] = (
        high_value["spend"] / high_value["active_days"].replace(0, np.nan)
    )
    return high_value.groupby("signup_year")["spend_per_active_day"].median()

# Same logic, same result — the second is just legible.
assert np.allclose(proc(customers).values,
                   median_daily_spend_by_cohort(customers).values,
                   equal_nan=True)
print("Identical results; only the second can be read without decoding it.")

Identical results; only the second can be read without decoding it.

Not a single line of logic changed. What changed is that the second version can be understood from its signature and docstring, reviewed without the author present, and reused with confidence — because its contract is written down.

5.5 Let tools handle the formatting

A great deal of what passes for “readability” — indentation, spacing, where the commas go — is not worth a single minute of human attention, because a tool can do it perfectly and identically every time. Python has a style standard (PEP 8) and formatters that enforce it automatically: black, and the formatter built into ruff. Run one across your project and the whole question of layout disappears; everyone’s code looks the same, and diffs show only real changes rather than someone’s reformatting.

The linter is the more valuable half. ruff check reads your code and flags genuine problems — an unused import, a variable assigned but never used, a bare except that will swallow errors, a name that shadows a builtin. These are the small defects that hide in exploratory code, and a linter surfaces them in seconds. The point of automating both is to remove a whole category of distraction from code review, so that human attention goes to the logic, which is the only thing a tool can’t check.

Author’s Note

The reasonable objection from a data scientist is that exploration is no place for this. When you’re three cells deep in checking whether a transformation helps, stopping to write a docstring and rename tmp would be absurd, and it would be. Throwaway code should be throwaway — cryptic names are fine for code that will be deleted within the hour.

The shift is recognising the moment code stops being throwaway. The instant you find yourself copying a cell into the next notebook, or relying on a result a week later, or handing the work to someone else, the code has graduated from scratch to kept, and the economics invert: now it will be read many times, and the few minutes of naming and documenting pay for themselves on the first re-reading. The skill isn’t writing every cell as production code; it’s noticing when a cell has earned the investment, and cleaning it up then. Most of the readability debt in data science comes not from writing scratch code, but from scratch code quietly being promoted to load-bearing without anyone cleaning it up on the way.

5.6 Summary

Readability is what lets code be changed safely, by you and by others:

Code is read far more than it’s written. The cost of cryptic code is paid on every reading, and the most frequent reader is future-you.
Names are documentation that can’t go stale. Name variables and functions for what they are; replace magic numbers with named constants.
One function, one job. Small, single-purpose functions and flat nesting let a reader understand the whole from the parts.
Make the contract explicit, and automate the rest. Type hints and docstrings document what a function takes and why; formatters and linters handle layout and catch small defects, freeing human attention for the logic.

In the next chapter we take the natural next step from well-named functions to reusable ones: functions, modules, and packages.

5.7 Exercises

Take a cryptic cell or function from one of your own projects and refactor it purely for readability — rename variables to reveal intent, replace any magic numbers with named constants, and add type hints and a docstring, without changing the logic. Did renaming alone surface anything you’d been unsure about?
Run a formatter and a linter over a project (ruff format then ruff check, or black and flake8). Read the linter’s warnings and sort them into two piles: real defects (unused imports, shadowed names, bare excepts) and pure style. How many were real?
Find a function that does several things and split it into smaller functions, each named for the single job it does. Afterwards, read just the names in sequence — do they tell the story of what the original function did?
Conceptual: You inherit a dense but correct thirty-line function implementing a sampling correction, and you have an hour. You can either rewrite it with meaningful names and smaller functions, or leave the code untouched and write a careful paragraph above it explaining the method — the methods-section approach. Choose one, and justify the choice in terms of who reads this function next and what they will do with it. Then describe the kind of function for which the opposite choice would be right.
Conceptual: When is investing in readability not worth it? Describe a concrete piece of code where throwaway names are the right call, and name the specific signal that tells you it has crossed the line into code worth cleaning up.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Readable code {#sec-readable-code} ## The code you'll have to read again {#sec-read-again} The first chapter listed `df2`, `tmp`, and `final_final` as a gift to no one. It's worth dwelling on why, because readability is the engineering practice with the least obvious payoff and the highest hidden cost. Unreadable code isn't a tidiness problem; it's a *change* problem. You can only safely modify code you understand, and the person who most often has to understand your code is you, six months from now, with no memory of why `df2` was different from `df` or what `0.73` meant. Code is read far more often than it is written. A cell you wrote once will be read a dozen times — when you debug it, when you extend it, when a colleague reviews it, when you copy it into the next project. Every one of those readings pays a tax if the code is cryptic, and the tax compounds. Readability is the practice of paying that cost down once, at writing time, so that every later reading is cheap. This is not about aesthetics or pleasing a linter. It's about making the logic visible, so that understanding the code doesn't require reconstructing the mental state you were in when you wrote it. ## Names are the cheapest documentation {#sec-names} The single highest-leverage readability habit is naming things for what they *are*. A variable called `high_value_customers` tells the reader what it holds; `df2` tells them only that there was a `df1`. A function called `filter_high_value` announces its job; `proc` announces nothing. Good names turn code into something close to prose: `high_value = filter_high_value(customers, threshold=200)` reads as a sentence, and needs no comment. The same applies to the magic numbers that litter exploratory code. The `MAGIC_NUMBER = 0.73` from @sec-notebook-to-system — a threshold tuned by hand during some forgotten sprint — is unreadable not because `0.73` is wrong but because the number carries no meaning at the point it's used. A named constant, `CHURN_PROBABILITY_THRESHOLD = 0.73`, says what the number *is*, and gives you one place to change it. Naming is documentation that can't fall out of date, because it lives in the code rather than alongside it. ## Functions that do one thing {#sec-one-thing} The 400-line cell from @sec-notebook-to-system — load, clean, engineer features, train, plot — is unreadable for a structural reason as much as a naming one: it does five things, so to understand any one of them you have to read all five. Breaking it into functions, each doing a single job named for that job, lets a reader understand the shape of the whole from the function names alone, and dive into the detail only where they need to. A useful discipline is to keep the nesting flat. Deeply indented code — a loop inside a condition inside a try inside another condition — forces the reader to hold several contexts in their head at once. A guard clause that returns early ("if the input is empty, return an empty result") handles the edge case and gets it out of the way, so the main logic reads at a single level of indentation. ::: {.callout-note} ## Data Science Bridge Readable code is the methods section of your analysis, made executable. When you write up a piece of work, you name your variables meaningfully, explain each step in order, and leave out the dead ends — because a reader needs to follow the reasoning, not reconstruct it. Readable code does the same job for the same reason: the next person to touch it (often you) needs to follow the logic without rerunning it in their head. Where the analogy breaks down: a methods section describes what you did *once*, and a reader will forgive a little ambiguity because they're not going to re-execute your prose. Code is run repeatedly, by machines and by people making changes, so the ambiguity a human reader would gloss over becomes a latent bug — the place where someone misreads the intent and "fixes" something that wasn't broken. Code has to be clearer than prose precisely because it's read in order to be *changed*, not just understood. ::: ## Type hints and docstrings {#sec-hints-docstrings} Two lightweight tools make a function's contract explicit. Type hints declare what goes in and what comes out — `def filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame` tells the reader (and the IDE, and a type checker like `mypy`) that this takes a DataFrame and a number and returns a DataFrame, without their having to read the body. Docstrings explain the *why* and the *what for*: the intent, the assumptions, the edge cases. Comments inside the body should be reserved for the genuinely non-obvious — the *why* behind a surprising line — not a running narration of *what* each line does, which the code already says. The difference this makes is easiest to see directly. The two functions below compute exactly the same thing: ```{python} #| label: before-after-readable #| echo: true import numpy as np import pandas as pd rng = np.random.default_rng(42) customers = pd.DataFrame({ "spend": rng.lognormal(4.8, 0.85, 200), "active_days": rng.integers(0, 365, 200), "signup_year": rng.choice([2021, 2022, 2023], 200), }) # Before: what does this do, and what does it return? def proc(d, t=200): x = d[d["spend"] > t].copy() x["s"] = x["spend"] / x["active_days"].replace(0, np.nan) return x.groupby("signup_year")["s"].median() # After: the name, the types, and the docstring answer those questions # before you read a single line of the body. def median_daily_spend_by_cohort( customers: pd.DataFrame, spend_threshold: float = 200 ) -> pd.Series: """Median daily spend of high-value customers, by signup year. High-value customers are those whose total spend exceeds `spend_threshold`. Daily spend is total spend over active days; customers with zero active days are excluded to avoid dividing by zero. """ high_value = customers[customers["spend"] > spend_threshold].copy() high_value["spend_per_active_day"] = ( high_value["spend"] / high_value["active_days"].replace(0, np.nan) ) return high_value.groupby("signup_year")["spend_per_active_day"].median() # Same logic, same result — the second is just legible. assert np.allclose(proc(customers).values, median_daily_spend_by_cohort(customers).values, equal_nan=True) print("Identical results; only the second can be read without decoding it.") ``` Not a single line of logic changed. What changed is that the second version can be understood from its signature and docstring, reviewed without the author present, and reused with confidence — because its contract is written down. ## Let tools handle the formatting {#sec-formatters} A great deal of what passes for "readability" — indentation, spacing, where the commas go — is not worth a single minute of human attention, because a tool can do it perfectly and identically every time. Python has a style standard (PEP 8) and formatters that enforce it automatically: `black`, and the formatter built into `ruff`. Run one across your project and the whole question of layout disappears; everyone's code looks the same, and diffs show only real changes rather than someone's reformatting. The linter is the more valuable half. `ruff check` reads your code and flags genuine problems — an unused import, a variable assigned but never used, a bare `except` that will swallow errors, a name that shadows a builtin. These are the small defects that hide in exploratory code, and a linter surfaces them in seconds. The point of automating both is to remove a whole category of distraction from code review, so that human attention goes to the logic, which is the only thing a tool can't check. ::: {.callout-tip} ## Author's Note The reasonable objection from a data scientist is that exploration is no place for this. When you're three cells deep in checking whether a transformation helps, stopping to write a docstring and rename `tmp` would be absurd, and it would be. Throwaway code *should* be throwaway — cryptic names are fine for code that will be deleted within the hour. The shift is recognising the moment code stops being throwaway. The instant you find yourself copying a cell into the next notebook, or relying on a result a week later, or handing the work to someone else, the code has graduated from scratch to *kept*, and the economics invert: now it will be read many times, and the few minutes of naming and documenting pay for themselves on the first re-reading. The skill isn't writing every cell as production code; it's noticing when a cell has earned the investment, and cleaning it up *then*. Most of the readability debt in data science comes not from writing scratch code, but from scratch code quietly being promoted to load-bearing without anyone cleaning it up on the way. ::: ## Summary {#sec-readable-code-summary} Readability is what lets code be changed safely, by you and by others: 1. **Code is read far more than it's written.** The cost of cryptic code is paid on every reading, and the most frequent reader is future-you. 2. **Names are documentation that can't go stale.** Name variables and functions for what they are; replace magic numbers with named constants. 3. **One function, one job.** Small, single-purpose functions and flat nesting let a reader understand the whole from the parts. 4. **Make the contract explicit, and automate the rest.** Type hints and docstrings document what a function takes and why; formatters and linters handle layout and catch small defects, freeing human attention for the logic. In the next chapter we take the natural next step from well-named functions to reusable ones: *functions, modules, and packages*. ## Exercises {#sec-readable-code-exercises} 1. Take a cryptic cell or function from one of your own projects and refactor it purely for readability — rename variables to reveal intent, replace any magic numbers with named constants, and add type hints and a docstring, *without changing the logic*. Did renaming alone surface anything you'd been unsure about? 2. Run a formatter and a linter over a project (`ruff format` then `ruff check`, or `black` and `flake8`). Read the linter's warnings and sort them into two piles: real defects (unused imports, shadowed names, bare excepts) and pure style. How many were real? 3. Find a function that does several things and split it into smaller functions, each named for the single job it does. Afterwards, read just the names in sequence — do they tell the story of what the original function did? 4. **Conceptual:** You inherit a dense but correct thirty-line function implementing a sampling correction, and you have an hour. You can either rewrite it with meaningful names and smaller functions, or leave the code untouched and write a careful paragraph above it explaining the method — the methods-section approach. Choose one, and justify the choice in terms of who reads this function next and what they will do with it. Then describe the kind of function for which the opposite choice would be right. 5. **Conceptual:** When is investing in readability *not* worth it? Describe a concrete piece of code where throwaway names are the right call, and name the specific signal that tells you it has crossed the line into code worth cleaning up.