7 Testing stochastic code

7.1 “How do you test something random?”

This is the question every data scientist asks the first time someone suggests they write tests, and it’s a fair one. Testing, as software engineers describe it, sounds built for a world of deterministic functions — given this input, assert exactly that output — and a model that involves a random train/test split, stochastic gradient descent, and sampling seems to live in a different universe. If the output is different every run, what is there to assert?

The question contains a hidden assumption worth dismantling. It assumes your code is mostly random. It isn’t. The overwhelming majority of a data science codebase is deterministic data transformation: cleaning, joining, feature engineering, reshaping, aggregating. Given the same input, add_spend_per_day returns the same output every single time, and that is exactly what a test checks. The genuinely stochastic parts are a small minority, and — as we’ll see — they’re testable too, just with different techniques. The resistance comes from imagining the hard case (testing a model’s accuracy) and concluding the whole practice is impossible, when the everyday case (testing a transform) is as straightforward as it gets.

This is also where the work of the last two chapters pays off. A pure function whose output depends only on its inputs (Chapter 6) is trivially testable; a function that reaches out to global state is barely testable at all. You made your code testable without knowing it.

7.2 What a test actually is

A test is just a function that calls your code and asserts something about the result. The testing framework — pytest is the standard — finds every function whose name starts with test_, runs it, and reports which assertions held. There is no magic; a test file is ordinary Python.

# tests/test_features.py — discovered and run by `pytest`
import numpy as np
import pytest
from customer_value.features import standardise

def test_standardise_centres_and_scales():
    z = standardise(np.array([1.0, 2.0, 3.0, 4.0]))
    assert z.mean() == pytest.approx(0.0)
    assert z.std() == pytest.approx(1.0)

def test_standardise_on_empty_input_returns_empty():
    assert len(standardise(np.array([]))) == 0

You run pytest from the command line (the fluency from Chapter 4 earning its keep), and it tells you, in seconds, whether every checked behaviour still holds. pytest.approx handles the floating-point reality that 0.1 + 0.2 is not exactly 0.3 — you assert close enough, with a tolerance, rather than bit-exact equality. The other construct you’ll use constantly is the fixture: a function that builds a small, reproducible dataset shared across tests, so each test starts from a known state rather than reconstructing one.

Data Science Bridge

The first chapter compared a test suite to a holdout set, and it’s worth completing the thought. A fixture is the test’s equivalent of a fixed, version-controlled evaluation dataset: a known input you control completely, so that when a test fails you know the code changed, not the data. The discipline is identical to the one you already apply when you freeze a holdout set so that this week’s model is comparable to last week’s.

But the crucial difference, the one this whole chapter turns on, is the verdict. A holdout set yields a score on a continuum — 0.82 AUC, which you judge against a baseline and call good or not. A test yields a boolean — it passed or it failed, and a failure is a defect, not a number to weigh. The skill a data scientist has to add is not a new mechanism (you already separate evaluation data from training data); it’s accepting that for code, “mostly works” is a contradiction in terms.

7.3 Testing the deterministic core

Start where testing is easy, because that’s where most of the value is. For every transformation function, a good test states the contract with a representative input and an expected output, then adds the edge cases that exploratory work skips: an empty frame, a column of all zeros, a missing value, a single row. These are precisely the inputs that crash a pipeline at 3am, and a handful of assertions catches them before they ship.

The pattern is always the same — construct a small known input, call the function, assert on the result — and it costs a few minutes per function. The return is that you can change that function later (optimise it, extend it, refactor it) and know within seconds whether you broke its contract. That confidence is what makes a codebase safe to evolve rather than frozen by fear.

7.4 Testing the stochastic parts

Now the part that prompted the question. Genuinely stochastic code is testable with three techniques, in rough order of how often you’ll reach for them.

The first is to fix the seed, which turns a stochastic function into a deterministic one for the duration of the test. If you pass an explicit random generator, the same seed produces the same output, and you can assert exact equality.

import numpy as np

def jitter(values, rng, scale=1.0):
    """Add Gaussian noise to each value."""
    return values + rng.normal(0, scale, size=len(values))

# Technique 1 — fix the seed: the stochastic function becomes reproducible,
# so an exact assertion is valid.
base = np.array([10.0, 20.0, 30.0])
first = jitter(base, np.random.default_rng(0))
again = jitter(base, np.random.default_rng(0))
assert np.array_equal(first, again)

# Technique 2 — assert a statistical property within tolerance, over many draws.
rng = np.random.default_rng(0)
draws = np.concatenate([jitter(np.zeros(1_000), rng, scale=2.0) for _ in range(50)])
assert abs(draws.mean() - 0.0) < 0.05      # noise is centred on zero
assert abs(draws.std() - 2.0) < 0.05       # spread matches the scale

print("Seeded output reproducible; statistical properties hold within tolerance.")

Seeded output reproducible; statistical properties hold within tolerance.

The second, shown above, is to assert a statistical property within a tolerance: you can’t predict an individual noisy value, but you can assert that the mean of many draws is near zero and the spread near the scale you asked for. The tolerance is a deliberate, documented choice — wide enough not to fail by chance, tight enough to catch a real defect.

The third is to assert invariants — properties that must hold for any input, whatever the random draw. A standardising function must always produce output with mean zero and unit variance; a shuffle must preserve the set of elements; a sampler must never return a value it wasn’t given. You test these across many random inputs, which is the essence of property-based testing.

def standardise(x):
    """Scale an array to zero mean and unit variance."""
    return (x - x.mean()) / x.std()

# The invariant — mean 0, std 1 — must hold whatever the input. Check it
# across many randomly generated inputs rather than one hand-picked case.
rng = np.random.default_rng(0)
for _ in range(100):
    centre, spread = rng.uniform(-50, 50), rng.uniform(1, 20)
    z = standardise(rng.normal(centre, spread, size=200))
    assert abs(z.mean()) < 1e-9
    assert abs(z.std() - 1.0) < 1e-9

print("Invariant holds across 100 random inputs.")

Invariant holds across 100 random inputs.

Writing the loop by hand makes the idea concrete, but the library hypothesis automates exactly this: you declare the property and the shape of valid inputs, and it generates dozens of cases — including the nasty edge cases you’d never think to write — and even shrinks any failure to the smallest input that triggers it.

7.5 What to test in a model pipeline

The judgement that matters most is knowing what not to test. Do not write a unit test that asserts your model achieves 85% accuracy. Model performance is a statistical quantity that depends on data, sampling, and randomness; it belongs in evaluation, monitored over time, not pinned to a threshold that will fail the first time the data shifts and tell you nothing about whether the code is correct.

What you test instead is everything around the model that is deterministic and does have a right answer: that your data validation rejects a malformed input, that a feature transform produces the expected columns and introduces no leakage, that the pipeline runs end to end on a tiny sample without error, and that a saved model round-trips — serialise it, load it, and confirm it gives identical predictions. These catch the failures that actually break model pipelines in production, none of which is about accuracy. The companion volume, Thinking in Uncertainty, approaches the same boundary from the statistician’s side, distinguishing what evaluation can tell you from what testing can.

Author’s Note

The “good enough” mindset is the deepest source of friction here, and it’s worth naming precisely because it’s a strength misapplied. A data scientist’s professional judgement is calibrated to a continuum: a model is never perfect, 82% might be excellent, and insisting on more is often the wrong call. That instinct serves model evaluation beautifully. Carried into testing, it’s corrosive, because it produces the flaky test — one that “passes most of the time” — which is worse than no test at all: it cries wolf until everyone ignores it, including when it’s right.

The resolution is to keep the two questions firmly apart. Is the model good enough? is an evaluation question, answered on a continuum with judgement. Does this code do what I specified? is a testing question, answered pass or fail. A test that fails one run in twenty hasn’t found a “good enough” answer; it’s a broken test, and the fix is to make it deterministic — fix the seed, widen the tolerance to something principled, or assert an invariant instead of a value. Tests are the one place in your work where the continuum thinking has to be switched off.

7.6 Summary

Testing is not only possible for data science code; it’s where modular, pure code pays off:

Most of your code is deterministic. Cleaning, joining, and feature engineering have exact right answers and test trivially — the supposed difficulty is the rare case, not the common one.
A test is a function that asserts on your code’s output. pytest runs them; pytest.approx handles floating point; fixtures give every test a known, reproducible starting point.
Stochastic code is testable three ways. Fix the seed for exact assertions, assert statistical properties within a tolerance, or assert invariants that must hold for any input — with hypothesis to generate the cases.
Test the code, not the model’s accuracy. Validate data, transforms, end-to-end runs, and serialisation round-trips; leave performance to evaluation. And never tolerate a flaky test — a test that “mostly passes” is broken.

In the next chapter we turn from preventing defects to finding them once they’ve slipped through: debugging and profiling.

7.7 Exercises

Take a deterministic transformation function from one of your own projects and write three tests for it: one with a representative input and expected output, and two for edge cases (an empty input, a column of zeros, or a missing value). Did writing the edge-case tests reveal a behaviour you hadn’t decided on?
Take a function with a stochastic element (anything using a random generator) and make it testable by having it accept an explicit rng argument. Write one test that fixes the seed and asserts an exact result, and one that asserts a statistical property of many draws within a tolerance you justify.
Identify an invariant in one of your transforms — a property that must hold for every input (output shape, no new missing values, a bounded range, preserved row count) — and write a test that checks it across a range of random inputs. If you have hypothesis available, express the same property with it and see what inputs it generates.
Conceptual: Explain why asserting model.score(X_test, y_test) > 0.85 is a poor unit test. What question is it actually trying to answer, where does that question belong instead, and what should you test about the model pipeline?
Conceptual: A colleague’s test suite has one test that fails roughly one run in ten and is usually re-run until it passes. Explain why this is worse than not having the test, and describe two concrete ways to fix it without simply deleting it.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Testing stochastic code {#sec-testing} ## "How do you test something random?" {#sec-test-random} This is the question every data scientist asks the first time someone suggests they write tests, and it's a fair one. Testing, as software engineers describe it, sounds built for a world of deterministic functions — given this input, assert exactly that output — and a model that involves a random train/test split, stochastic gradient descent, and sampling seems to live in a different universe. If the output is different every run, what is there to assert? The question contains a hidden assumption worth dismantling. It assumes your code is mostly random. It isn't. The overwhelming majority of a data science codebase is *deterministic* data transformation: cleaning, joining, feature engineering, reshaping, aggregating. Given the same input, `add_spend_per_day` returns the same output every single time, and that is exactly what a test checks. The genuinely stochastic parts are a small minority, and — as we'll see — they're testable too, just with different techniques. The resistance comes from imagining the hard case (testing a model's accuracy) and concluding the whole practice is impossible, when the everyday case (testing a transform) is as straightforward as it gets. This is also where the work of the last two chapters pays off. A pure function whose output depends only on its inputs (@sec-functions-modules) is *trivially* testable; a function that reaches out to global state is barely testable at all. You made your code testable without knowing it. ## What a test actually is {#sec-what-a-test-is} A test is just a function that calls your code and asserts something about the result. The testing framework — `pytest` is the standard — finds every function whose name starts with `test_`, runs it, and reports which assertions held. There is no magic; a test file is ordinary Python. ```python # tests/test_features.py — discovered and run by `pytest` import numpy as np import pytest from customer_value.features import standardise def test_standardise_centres_and_scales(): z = standardise(np.array([1.0, 2.0, 3.0, 4.0])) assert z.mean() == pytest.approx(0.0) assert z.std() == pytest.approx(1.0) def test_standardise_on_empty_input_returns_empty(): assert len(standardise(np.array([]))) == 0 ``` You run `pytest` from the command line (the fluency from @sec-command-line earning its keep), and it tells you, in seconds, whether every checked behaviour still holds. `pytest.approx` handles the floating-point reality that `0.1 + 0.2` is not exactly `0.3` — you assert *close enough*, with a tolerance, rather than bit-exact equality. The other construct you'll use constantly is the **fixture**: a function that builds a small, reproducible dataset shared across tests, so each test starts from a known state rather than reconstructing one. ::: {.callout-note} ## Data Science Bridge The first chapter compared a test suite to a holdout set, and it's worth completing the thought. A fixture is the test's equivalent of a fixed, version-controlled evaluation dataset: a known input you control completely, so that when a test fails you know the code changed, not the data. The discipline is identical to the one you already apply when you freeze a holdout set so that this week's model is comparable to last week's. But the crucial difference, the one this whole chapter turns on, is the verdict. A holdout set yields a *score* on a continuum — 0.82 AUC, which you judge against a baseline and call good or not. A test yields a *boolean* — it passed or it failed, and a failure is a defect, not a number to weigh. The skill a data scientist has to add is not a new mechanism (you already separate evaluation data from training data); it's accepting that for code, "mostly works" is a contradiction in terms. ::: ## Testing the deterministic core {#sec-deterministic-core} Start where testing is easy, because that's where most of the value is. For every transformation function, a good test states the contract with a representative input and an expected output, then adds the edge cases that exploratory work skips: an empty frame, a column of all zeros, a missing value, a single row. These are precisely the inputs that crash a pipeline at 3am, and a handful of assertions catches them before they ship. The pattern is always the same — construct a small known input, call the function, assert on the result — and it costs a few minutes per function. The return is that you can change that function later (optimise it, extend it, refactor it) and know within seconds whether you broke its contract. That confidence is what makes a codebase safe to evolve rather than frozen by fear. ## Testing the stochastic parts {#sec-testing-stochastic} Now the part that prompted the question. Genuinely stochastic code is testable with three techniques, in rough order of how often you'll reach for them. The first is to **fix the seed**, which turns a stochastic function into a deterministic one for the duration of the test. If you pass an explicit random generator, the same seed produces the same output, and you can assert exact equality. ```{python} #| label: seed-and-tolerance #| echo: true import numpy as np def jitter(values, rng, scale=1.0): """Add Gaussian noise to each value.""" return values + rng.normal(0, scale, size=len(values)) # Technique 1 — fix the seed: the stochastic function becomes reproducible, # so an exact assertion is valid. base = np.array([10.0, 20.0, 30.0]) first = jitter(base, np.random.default_rng(0)) again = jitter(base, np.random.default_rng(0)) assert np.array_equal(first, again) # Technique 2 — assert a statistical property within tolerance, over many draws. rng = np.random.default_rng(0) draws = np.concatenate([jitter(np.zeros(1_000), rng, scale=2.0) for _ in range(50)]) assert abs(draws.mean() - 0.0) < 0.05 # noise is centred on zero assert abs(draws.std() - 2.0) < 0.05 # spread matches the scale print("Seeded output reproducible; statistical properties hold within tolerance.") ``` The second, shown above, is to **assert a statistical property within a tolerance**: you can't predict an individual noisy value, but you can assert that the mean of many draws is near zero and the spread near the scale you asked for. The tolerance is a deliberate, documented choice — wide enough not to fail by chance, tight enough to catch a real defect. The third is to **assert invariants** — properties that must hold for *any* input, whatever the random draw. A standardising function must always produce output with mean zero and unit variance; a shuffle must preserve the set of elements; a sampler must never return a value it wasn't given. You test these across many random inputs, which is the essence of property-based testing. ```{python} #| label: invariant-property #| echo: true def standardise(x): """Scale an array to zero mean and unit variance.""" return (x - x.mean()) / x.std() # The invariant — mean 0, std 1 — must hold whatever the input. Check it # across many randomly generated inputs rather than one hand-picked case. rng = np.random.default_rng(0) for _ in range(100): centre, spread = rng.uniform(-50, 50), rng.uniform(1, 20) z = standardise(rng.normal(centre, spread, size=200)) assert abs(z.mean()) < 1e-9 assert abs(z.std() - 1.0) < 1e-9 print("Invariant holds across 100 random inputs.") ``` Writing the loop by hand makes the idea concrete, but the library `hypothesis` automates exactly this: you declare the property and the shape of valid inputs, and it generates dozens of cases — including the nasty edge cases you'd never think to write — and even shrinks any failure to the smallest input that triggers it. ## What to test in a model pipeline {#sec-what-to-test-ml} The judgement that matters most is knowing what *not* to test. Do not write a unit test that asserts your model achieves 85% accuracy. Model performance is a statistical quantity that depends on data, sampling, and randomness; it belongs in evaluation, monitored over time, not pinned to a threshold that will fail the first time the data shifts and tell you nothing about whether the *code* is correct. What you test instead is everything around the model that *is* deterministic and *does* have a right answer: that your data validation rejects a malformed input, that a feature transform produces the expected columns and introduces no leakage, that the pipeline runs end to end on a tiny sample without error, and that a saved model round-trips — serialise it, load it, and confirm it gives identical predictions. These catch the failures that actually break model pipelines in production, none of which is about accuracy. The companion volume, *Thinking in Uncertainty*, approaches the same boundary from the statistician's side, distinguishing what evaluation can tell you from what testing can. ::: {.callout-tip} ## Author's Note The "good enough" mindset is the deepest source of friction here, and it's worth naming precisely because it's a *strength* misapplied. A data scientist's professional judgement is calibrated to a continuum: a model is never perfect, 82% might be excellent, and insisting on more is often the wrong call. That instinct serves model evaluation beautifully. Carried into testing, it's corrosive, because it produces the flaky test — one that "passes most of the time" — which is worse than no test at all: it cries wolf until everyone ignores it, including when it's right. The resolution is to keep the two questions firmly apart. *Is the model good enough?* is an evaluation question, answered on a continuum with judgement. *Does this code do what I specified?* is a testing question, answered pass or fail. A test that fails one run in twenty hasn't found a "good enough" answer; it's a broken test, and the fix is to make it deterministic — fix the seed, widen the tolerance to something principled, or assert an invariant instead of a value. Tests are the one place in your work where the continuum thinking has to be switched off. ::: ## Summary {#sec-testing-summary} Testing is not only possible for data science code; it's where modular, pure code pays off: 1. **Most of your code is deterministic.** Cleaning, joining, and feature engineering have exact right answers and test trivially — the supposed difficulty is the rare case, not the common one. 2. **A test is a function that asserts on your code's output.** `pytest` runs them; `pytest.approx` handles floating point; fixtures give every test a known, reproducible starting point. 3. **Stochastic code is testable three ways.** Fix the seed for exact assertions, assert statistical properties within a tolerance, or assert invariants that must hold for any input — with `hypothesis` to generate the cases. 4. **Test the code, not the model's accuracy.** Validate data, transforms, end-to-end runs, and serialisation round-trips; leave performance to evaluation. And never tolerate a flaky test — a test that "mostly passes" is broken. In the next chapter we turn from preventing defects to finding them once they've slipped through: *debugging and profiling*. ## Exercises {#sec-testing-exercises} 1. Take a deterministic transformation function from one of your own projects and write three tests for it: one with a representative input and expected output, and two for edge cases (an empty input, a column of zeros, or a missing value). Did writing the edge-case tests reveal a behaviour you hadn't decided on? 2. Take a function with a stochastic element (anything using a random generator) and make it testable by having it accept an explicit `rng` argument. Write one test that fixes the seed and asserts an exact result, and one that asserts a statistical property of many draws within a tolerance you justify. 3. Identify an invariant in one of your transforms — a property that must hold for every input (output shape, no new missing values, a bounded range, preserved row count) — and write a test that checks it across a range of random inputs. If you have `hypothesis` available, express the same property with it and see what inputs it generates. 4. **Conceptual:** Explain why asserting `model.score(X_test, y_test) > 0.85` is a poor unit test. What question is it actually trying to answer, where does that question belong instead, and what *should* you test about the model pipeline? 5. **Conceptual:** A colleague's test suite has one test that fails roughly one run in ten and is usually re-run until it passes. Explain why this is worse than not having the test, and describe two concrete ways to fix it without simply deleting it.