---
# Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md
title: "Configuration and secrets"
---
## The password in the notebook {#sec-password-in-notebook}
Two lines from a real notebook, both committed to a shared repository. The first: `THRESHOLD = 0.73`, a value tuned by hand one afternoon, sitting in the middle of a cell with no explanation. The second, a few cells down: `conn = connect("postgresql://admin:hunter2@prod-db:5432/customers")`, a production database password in plain text. Each is a different kind of mistake, and both are endemic to data science code.
The first is a *configuration* problem: a value that ought to be adjustable is baked into the logic, so changing it means editing (and re-reviewing) code, and there's no single place that records what the run's settings actually were. The second is a *secrets* problem, and a more serious one: a credential that should never have entered the repository is now in its history permanently — recall from Chapter 2 that deleting the file doesn't remove it from past commits. This chapter is about getting both kinds of value out of your code: configuration into files you can change without touching logic, and secrets out of the repository entirely.
## Configuration is not code {#sec-config-not-code}
The principle is to separate the *what* from the *how* — the values a run depends on (file paths, thresholds, hyperparameters, table names, model versions) from the logic that uses them. The values live in a configuration file, loaded at runtime; the code reads them by name. A small `config.yaml` might hold:
```yaml
data:
raw_path: data/raw/customers.csv
min_active_days: 1
model:
test_size: 0.2
random_state: 42
n_estimators: 300
```
This buys three things. The same code runs in development and in production with different config rather than different code. A threshold can change without anyone editing a line of logic, so the change is low-risk and doesn't need a code review of the algorithm. And the config file is a readable, version-controllable record of the settings a given run used — far better than reconstructing them from scattered literals.
## Typed, validated configuration {#sec-typed-config}
Loading config into a bare dictionary trades one problem for another: a typo in a key (`raw_paht`) gives you a silent `None` at 2am rather than a clear error. The fix is to load configuration into a *typed* object that validates itself at startup, so a missing key, a wrong type, or an out-of-range value fails immediately, with a message that says what's wrong. `pydantic` is the standard tool.
```{python}
#| label: typed-config
#| echo: true
from pydantic import BaseModel, Field, ValidationError
class ModelConfig(BaseModel):
test_size: float = Field(gt=0, lt=1) # must be a proportion
random_state: int
n_estimators: int = Field(gt=0)
# A valid config (as if parsed from the YAML above) loads cleanly.
config = ModelConfig(test_size=0.2, random_state=42, n_estimators=300)
print(f"loaded: test_size={config.test_size}, n_estimators={config.n_estimators}")
# An invalid one fails at startup with a precise message — not a silent None.
try:
ModelConfig(test_size=1.5, random_state=42, n_estimators=300)
except ValidationError as exc:
print(f"rejected: {exc.errors()[0]['loc'][0]} — {exc.errors()[0]['msg']}")
```
The valid config loads into an object whose fields are typed and autocomplete in your editor; the invalid one — a `test_size` of 1.5, which is not a proportion — is rejected the moment it's loaded, naming the offending field. Catching a bad configuration at startup, rather than three stages into a pipeline, is the same boundary-validation idea as the pipeline gates of the previous chapter, applied to the run's settings.
::: {.callout-note}
## Data Science Bridge
You already externalise configuration, informally. The dictionary of hyperparameters you pass to a model, or the block of `PARAMS` at the top of a notebook that everything below reads from, is exactly this instinct: pull the knobs out of the logic so they live in one place you can adjust. Moving them into a validated config file is that same move made robust — one place for the settings, checked for sanity before the run starts.
Where it breaks down: a hyperparameter dictionary is consumed once, in one process, to fit one model. Application configuration often selects *behaviour across environments* — which database to connect to, which storage bucket to write to, whether to run in debug mode — a concern a hyperparameter sweep never has. So config grows a dimension the hyperparameter dict doesn't: the same code, the same config *schema*, but different *values* in dev, staging, and production.
:::
## Secrets never live in the repository {#sec-secrets}
A secret — a database password, an API key, an access token — is configuration that must never be committed. The rule is absolute because the cost is asymmetric: the convenience of pasting a key inline is small, and the consequence of it entering the repository's permanent history is large. Secrets are loaded from the *environment* instead (the environment variables from Chapter 4), kept locally in a `.env` file that is git-ignored, and injected by a secret manager in production.
```{python}
#| label: secret-from-env
#| echo: true
import os
# In production the platform sets this; locally it comes from an
# untracked .env file. Either way, the value is never in the code.
os.environ.setdefault("DATABASE_URL", "postgresql://localhost/dev") # demo default
database_url = os.environ["DATABASE_URL"]
print(f"connecting to: {database_url}")
print("The code names the variable; the value lives outside the repository.")
```
The pattern that keeps this honest is to commit a `.env.example` listing the *names* of the variables the application needs, with placeholder values, while the real `.env` stays untracked:
```bash
# .env.example (committed — a template, no real values)
DATABASE_URL=postgresql://user:password@host:5432/dbname
MODEL_API_KEY=your-key-here
```
A new colleague copies `.env.example` to `.env`, fills in the real values, and is running — without any secret ever touching the repository. (And if a secret *was* ever committed, removing it from the code is not enough: it must also be rotated, because it lives in the history.)
## Configuration for experiments {#sec-config-experiments}
There's a bonus for data science specifically. Once hyperparameters live in config files rather than inline, each experiment's settings become a version-controlled artefact — a record of exactly what you ran, which is the experiment-log discipline from Chapter 2 made concrete. A config file per experiment, committed alongside the code, lets you reproduce or compare runs precisely, and tools like Hydra extend this to composing and sweeping over configurations. Configuration stops being plumbing and becomes part of how your results are made reproducible.
::: {.callout-tip}
## Author's Note
Hard-coding is genuinely faster in the moment, and pretending otherwise would be dishonest. Typing `0.73` directly into the cell beats stopping to set up a config file when you're iterating, and during exploration that trade is often correct — the value is changing every few minutes and isn't worth externalising yet. The cost is invisible right up until the work has to run somewhere else, or again: the value you tuned by hand is now buried in logic with no record of why it's `0.73`, and anyone running the code elsewhere has to find and change it in place.
The reframe is that externalising configuration is the point at which your code stops being tied to one machine and one run. Pull the values out when the code graduates from scratch to kept, the same threshold as every other practice in this part. Secrets are the one exception to the "later is fine" rule: pull those out *immediately*, because a hard-coded threshold is a small cleanup deferred, but a committed credential is an expensive mistake that the version history makes permanent.
:::
## Summary {#sec-config-secrets-summary}
Getting values out of code makes it portable, reproducible, and safe to share:
1. **Configuration is not code.** Lift the values a run depends on — paths, thresholds, hyperparameters — into a file loaded at runtime, so the same code runs anywhere and changing a setting doesn't mean editing logic.
2. **Load config into a typed, validated object.** A `pydantic` model catches a typo, a wrong type, or an out-of-range value at startup, instead of failing silently deep in a run.
3. **Secrets never enter the repository.** Load them from the environment, keep a git-ignored `.env` locally and a committed `.env.example` template, and use a secret manager in production — a committed secret is permanent and must be rotated.
4. **Configured experiments are reproducible experiments.** Hyperparameters in version-controlled config files become a precise record of what each run used.
This closes Part 3. With code that is structured, pipelined, and configured, Part 4 takes it to production, beginning with the safety net that makes frequent change safe: *continuous integration*.
## Exercises {#sec-config-secrets-exercises}
1. Take a script or notebook of your own with hard-coded values — paths, thresholds, table names — and lift them into a config file loaded at runtime. Which of those values turned out to differ between your machine and where the code actually needs to run?
2. Find a secret in your code or, worse, in your repository's history (a password, key, or token) and move it to an environment variable loaded at runtime; add `.env` to `.gitignore` and commit a `.env.example` template instead. If the secret was ever committed, note why moving it is not sufficient on its own.
3. Load your configuration into a typed `pydantic` object rather than a bare dictionary, and add at least one constraint (a numeric range, or an allowed set of values). Feed it a deliberately invalid configuration and confirm it fails at startup with a message that names the problem.
4. **Conceptual:** The Data Science Bridge compares externalised configuration to a hyperparameter dictionary. Give one way the analogy holds and one way it breaks down. What does application configuration have to handle that a hyperparameter dictionary never does?
5. **Conceptual:** Hard-coding is not always wrong. Distinguish a value that should be configuration from one that's perfectly fine as a named constant in the code, and state the signal that tells you a hard-coded value has become a liability.