19  Technical debt

19.1 The notebook graveyard

Every data scientist has one: a folder of dead notebooks, half-finished experiments, a utils.py that everything imports and nobody dares touch, a model in models/ trained by a script that no longer runs. It accumulated without anyone deciding it should. Each piece was a reasonable shortcut at the time — a quick experiment, a value hard-coded to ship before a deadline, a function copied rather than shared because copying was faster.

This is technical debt: the accumulated cost of the shortcuts and deferred cleanups that every real project takes on. The term is not an insult. Debt, used deliberately, is a legitimate tool — you borrow time now and agree to pay it back later. The danger is the debt nobody acknowledged, that compounds quietly until the day a small change takes a week because the code around it is a tangle no one understands. Every project carries debt; the only question is whether it’s managed or merely accreting.

19.2 What technical debt is

The financial metaphor is exact enough to be useful. A shortcut borrows time — you ship faster now — and charges interest, in that every future change to that code is slower and riskier than it would have been. There are two kinds. Deliberate debt is a conscious choice: “I’ll hard-code this for the demo and clean it up after”, taken knowingly and (ideally) recorded. Inadvertent debt is the debt you didn’t know you were taking — a design that seemed fine and turned out wrong, an assumption that the data later violated. Deliberate debt is a tool; inadvertent debt, and deliberate debt you forgot about, is the liability.

What makes technical debt insidious is that its interest is invisible until you have to change the code. A shortcut can run perfectly for months:

import numpy as np

# A shortcut taken under deadline: skip the zero-active-days guard,
# because "the data never has zeros". It runs fine — for now.
def mean_spend_per_day(spend, active_days):
    return np.mean(spend / active_days)

clean = mean_spend_per_day(np.array([100.0, 200.0]), np.array([4, 8]))
print(f"on today's clean data:  {clean:.1f}")

# Months later, the upstream source starts emitting a zero. The shortcut
# doesn't crash — it silently returns nonsense, which is worse.
with_zero = mean_spend_per_day(np.array([100.0, 200.0]), np.array([4, 0]))
print(f"the day a zero arrives: {with_zero}")
on today's clean data:  25.0
the day a zero arrives: inf

The shortcut took no time to write and worked flawlessly until the input changed — and then it didn’t fail loudly, it returned inf, the kind of silent wrongness that propagates into a report before anyone notices. That delay between taking the shortcut and paying for it is exactly why debt is so easy to accumulate and so dangerous to ignore.

NoteData Science Bridge

Technical debt is the un-cleaned-up analysis you already know intimately. Every data scientist has the notebook that “works” but is a thicket of out-of-order cells, hard-coded paths, and variables named df2 — and you already pay its interest every time you reopen it and have to reconstruct how it works before you can change anything. Refactoring is simply tidying that thicket so the next change is cheap instead of frightening.

Where the analogy to financial debt breaks down: a loan has a known interest rate and a repayment schedule, so you can plan around it. Technical debt’s interest is unpredictable and lumpy — it costs nothing at all until the day you have to touch the code, and then it can cost an enormous amount at once. That irregularity is what makes it so easy to defer: there’s no monthly statement reminding you it’s there, so it stays invisible until a deadline collides with it.

19.3 The debt data science accumulates

Some forms of debt are particular to data science work. Untested transformations (Chapter 7) are debt — every one is a change you can’t make safely. Copy-pasted logic (Chapter 6) is debt that compounds, because a fix to one copy leaves the others wrong. Exploratory notebooks promoted to production without cleanup (Chapter 1) are debt with the principal still outstanding. And then there is the debt of dead experiments and abandoned features cluttering the repository, and the glue code holding a pipeline together that everyone is afraid to remove.

Machine learning adds its own categories, catalogued in a widely-cited paper by Sculley and colleagues (Sculley et al. 2015): configuration that sprawls until no one knows which settings are live, data dependencies that change silently upstream, and “pipeline jungles” of accreted transformation steps. Their central point is that in machine learning, debt hides not only in the code but in the data and the configuration — the parts data scientists are least likely to treat as engineering artefacts, and therefore least likely to keep clean.

19.4 Managing debt

You don’t eliminate technical debt; you manage it. Three practices do most of the work. Track it — a debt log, or issues in your tracker, so a known shortcut is visible and deliberate rather than forgotten. The boy-scout rule — leave code a little better than you found it — pays debt down opportunistically, a test added or a constant named while you’re in the file for another reason. And scheduled paydown — deliberate refactoring time, not squeezed in around feature work — keeps the larger debts from growing unboundedly.

The judgement is about timing. Taking on debt is right for a prototype that might be discarded, a hypothesis you’re testing, or a genuine deadline — there’s no sense gold-plating code you may throw away tomorrow. Repaying it is right before the code becomes load-bearing: before others depend on it, before it’s scheduled in production, before the shortcut you took for a throwaway becomes a foundation.

TipAuthor’s Note

Data science intentionally incurs debt, and that instinct is correct. Exploration is supposed to be fast, messy, and disposable; wrapping a hypothesis you might abandon tomorrow in tests and abstractions is waste, not virtue. The problem is almost never that data scientists take on too much debt during exploration. It’s that the messy exploratory code silently becomes the production system without anyone deciding it should — so debt taken on for a prototype that was meant to live an afternoon is now load-bearing, unacknowledged, and overdue.

The skill, then, is not avoiding debt but tracking it — keeping a record of the shortcuts you’ve taken, so that when a piece of code graduates from scratch to kept, you can see what you owe and choose to repay it deliberately, rather than rediscovering it at 3am when the zero finally arrives. Debt you chose, wrote down, and can repay on your terms is a tool. Debt you forgot you took is the thing that turns a small change into a lost week.

19.5 Summary

Technical debt is inevitable; managing it deliberately is the skill:

  1. Debt is borrowed time with interest. A shortcut is fast now and costlier on every later change; deliberate, acknowledged debt is a tool, while forgotten debt is the liability.

  2. Its interest is invisible until you pay. A shortcut runs fine until the input or the requirement changes, then charges all at once — often as silent wrongness rather than a clean failure.

  3. Data science has its own debts. Untested transforms, copy-paste, promoted notebooks, and — uniquely — debt hiding in data and configuration, not just code.

  4. Manage it: track, tidy, and schedule. Record known shortcuts, leave code better than you found it, and set aside real time for repayment — repaying before the code becomes load-bearing.

The final chapter of this part turns from the debt within a project to the people across it: cross-discipline collaboration.

19.6 Exercises

  1. Audit one of your own projects for technical debt: list the shortcuts, untested code, copy-pasted logic, dead experiments, and hard-coded values you find. Mark each as deliberate (you knew you were taking it) or inadvertent (you’ve only just noticed). Which category was larger, and which item surprised you?

  2. Apply the boy-scout rule: while you’re in a file for some other reason, pay down one debt item — add a test, extract a function, name a constant. How long did it take relative to the change you were originally making?

  3. Start a debt log — a file or a set of issues — recording known shortcuts, the cost of leaving each one, and the event that should trigger its repayment. What makes a debt item worth writing down rather than simply fixing on the spot?

  4. Conceptual: The Data Science Bridge compares technical debt to financial debt. Give one way the analogy holds and one way it breaks down. What is unusual about technical debt’s “interest”, and why does that make it easy to ignore?

  5. Conceptual: Taking on debt is sometimes the right call. Describe a situation where a shortcut is the correct decision and one where it’s reckless, and name the property that distinguishes the two.