14 Containerisation

14.1 Shipping the environment

Chapter 3 got your dependencies under control: a lockfile pins every package to an exact version, so a colleague who installs it gets the same packages you have. But the lockfile stops at the Python layer. It doesn’t capture the Python interpreter itself, the system libraries your packages link against (the C library a wheel was built for, a CUDA runtime), or the operating system underneath. So “works in my environment” can still fail when your environment and mine differ below the packages — a different Python, a missing system library, a Linux-versus-macOS discrepancy in how some dependency behaves.

A container closes that last gap by shipping the whole environment as a single artefact: your code, your locked dependencies, the interpreter, the system libraries, and a minimal operating system, all sealed together. The promise is that the container runs identically on your laptop, your colleague’s, the CI runner, and the production server, because it isn’t relying on any of those machines for anything but the ability to run containers.

14.2 Images and containers

Two words do most of the work. An image is the frozen, layered filesystem — your code plus everything it needs to run, down to the OS userland (the system libraries and command-line tools a program expects to find around it — everything in an operating system except the kernel itself) — built once from a recipe. A container is a running instance of an image. The relationship is the one between a class and an object, or between a saved model artefact and the loaded model serving predictions: one is the immutable definition, the other is the live thing.

What the image actually pins is the entire tower from Chapter 3, the layers your code silently sits on:

import platform
import sys
from importlib.metadata import version

# The slice of "my environment" that determines how the code behaves.
# A lockfile pins the packages; a container image freezes all of this.
print(f"OS:        {platform.system()} {platform.release()}")
print(f"Python:    {sys.version.split()[0]}")
for package in ("numpy", "pandas", "scikit-learn"):
    print(f"{package + ':':11}{version(package)}")

OS:        Linux 6.17.0-1020-azure
Python:    3.12.13
numpy:     2.5.1
pandas:    3.0.3
scikit-learn:1.9.0

A lockfile records only the bottom three lines — the package versions. The image freezes all of it, including the operating system and the interpreter, so none of it can drift between machines. That completeness is the whole point: there’s nothing left for the host to get wrong.

Data Science Bridge

A container is the logical end of the lockfile from Chapter 3. The lockfile was the move from “install whatever’s newest” to “install exactly these versions”; the container is the same move taken all the way down — from “exactly these packages” to “exactly this entire machine”. You already accepted the principle (pin the things that affect your results so they can’t change underneath you); the container just applies it to the layers the lockfile couldn’t reach.

Where it breaks down: a lockfile is a small text file you can read, diff, and review in a pull request — you can see at a glance that pandas went from 2.2.1 to 2.2.3. An image is an opaque binary blob, often hundreds of megabytes, that you can’t meaningfully read. You trade transparency for completeness, and the management shifts accordingly: instead of reviewing a diff, you version images, store them in a registry — a server that hosts built images the way PyPI hosts packages, so a machine elsewhere can pull the exact image by name and version — and rebuild them from the Dockerfile when something changes.

14.3 A Dockerfile for a model service

A Dockerfile is the recipe an image is built from. For the prediction service from Chapter 12, it reads as a short sequence of steps — start from a base, install the dependencies, copy the code, say how to run it:

# A minimal base: the interpreter and a thin OS, nothing else.
FROM python:3.12-slim

WORKDIR /app

# Install dependencies first, as their own layer, so this step is cached
# and only re-runs when requirements change — not on every code edit.
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Then copy and install the application code (changes often, so it comes last).
# The install step is what puts the src-layout package on the import path.
COPY pyproject.toml .
COPY src/ ./src/
RUN pip install --no-cache-dir --no-deps .

# How to run the service (the FastAPI app from the API chapter).
CMD ["uvicorn", "customer_value.api:app", "--host", "0.0.0.0", "--port", "80"]

Two details carry most of the craft. The ordering — dependencies before code — exploits Docker’s layer caching: because each instruction is a cached layer, copying and installing requirements before copying the code means a code edit doesn’t trigger a full reinstall, the file-level version of the caching idea from Chapter 10. And the base image matters: -slim keeps the image small, and a multi-stage build (compiling in a fat builder image, then copying only the result into a slim final image) keeps it smaller still.

Equally important is what does not go in the image: data and secrets. Baking a dataset in bloats every copy of the image and ties it to one snapshot; baking in a secret puts a credential into an artefact that gets pushed to registries and shared (the Chapter 11 mistake, reincarnated). Data is mounted as a volume at run time — a directory on the host machine (or a managed storage area) grafted into the container’s filesystem at a chosen path, so the container sees it as an ordinary folder while the data itself lives, and survives, outside the container — and secrets are injected as environment variables — the image stays a generic, shareable definition of how to run, not a container for what to run on or which credentials to use.

14.4 Size, and the architecture trap

That Dockerfile builds and runs, and two things about it will bite you later. Both are invisible on your own machine, which is precisely why they bite.

The first is size. An image isn’t built once and forgotten; it is pulled — downloaded onto a host — before anything can run. Every deployment pulls it, and so does every autoscale event, so when traffic spikes on Monday morning and the platform starts three more instances, each one waits on that download before it can serve a single request. A two-gigabyte image, most of it compilers you needed for ninety seconds at build time, turns “scale up now” into “scale up shortly”, which is exactly the wrong latency for a spike. Size also enlarges the attack surface: every system package in the image is a package that may turn out to have a vulnerability, and the ones you never use are the ones nobody thinks to patch.

A multi-stage build is the standard fix, and it’s simpler than the name suggests. You build in one image and ship a different one:

# Stage 1 — the builder. Compilers, headers and build tools live here.
FROM python:3.12-slim AS builder
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2 — the image you actually ship. Nothing from the builder comes with it
# except the installed packages copied on the line below.
FROM python:3.12-slim
COPY --from=builder /install /usr/local
WORKDIR /app
COPY pyproject.toml .
COPY src/ ./src/
RUN pip install --no-cache-dir --no-deps .
CMD ["uvicorn", "customer_value.api:app", "--host", "0.0.0.0", "--port", "80"]

Only the final stage becomes the image; the builder is discarded once it has produced the installed packages. You keep the ability to compile a dependency that has no pre-built wheel, without paying for the compiler on every pull, forever.

The second trap is architecture, and it catches nearly everyone once. An image is built for a specific CPU architecture, and by default that’s the architecture of the machine doing the building. Build on an Apple Silicon Mac and you get an arm64 image; most cloud Linux hosts are amd64 — the x86-64 architecture that has run servers for two decades. The image runs beautifully on your laptop and then dies on the server with exec format error. The kinder failure is the loud one. The unkind version is a host that quietly runs your arm64 image under emulation, several times slower, which presents as an unexplained performance regression rather than as a build mistake — you go looking for a slow query when the answer is that the container is pretending to be a different computer. Name the target when you build:

# Build for the deployment target, not for the laptop doing the building.
docker buildx build --platform linux/amd64 -t customer-value:1.0.0 .

Better still, build the image in CI (Chapter 13), where the runner’s architecture matches production and nobody has to remember the flag. That’s the general lesson underneath both traps: an image is only as portable as the assumptions frozen into it, and the ones that hurt are the assumptions you froze without noticing you’d made them.

14.5 Composing services

Real systems are rarely a single service. The prediction API might sit alongside a database and a cache, and you want to run them together, wired up, with one command. docker-compose declares the set in a YAML file:

# compose.yaml
services:
  api:
    build: .
    ports: ["8000:80"]
    env_file: .env                                    # the git-ignored file from Chapter 11
    environment:
      DATABASE_URL: postgresql://app:${POSTGRES_PASSWORD}@db:5432/customers
  db:
    image: postgres:16
    env_file: .env
    environment:
      POSTGRES_USER: app
      POSTGRES_DB: customers
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}         # required; the image refuses to start without it
    volumes: ["pgdata:/var/lib/postgresql/data"]      # data lives in a volume
volumes:
  pgdata:

docker compose up starts the whole stack. Note where the password comes from: not the YAML file, which is committed, but .env, which is not — the same split between a committed .env.example template and a git-ignored .env that Chapter 11 established. The compose file names the secret; it never contains it. This is the unit that the next chapter deploys and that the end-to-end project in MLOps pipeline assembles in full; for now the point is that a container is composable — services declared together, each a sealed environment, wired by configuration rather than by hope.

Author’s Note

Most data scientists meet Docker as an obstacle: an incantation Ops insists on, a Dockerfile copied from a colleague and tweaked until it builds, a layer of mystery between “my code works” and “it’s running in production”. Approached that way it’s pure friction, learned by superstition.

The reframe is to notice that the problem Docker solves is one you’ve hit your entire career, and hit hard in Chapter 3: it works on my machine. A container is “my machine”, packaged — so that the sentence stops being an excuse for an irreproducible result and becomes a guarantee that the thing which ran for you will run identically for everyone else. Seen as environment-as-code taken to its conclusion — not a new platform to master, but the lockfile finished off — the incantation turns into a recipe you can read, reason about, and write. You’re not learning Docker for its own sake; you’re closing the last gap in a problem you already understand.

14.6 Summary

A container ships the whole environment, not just the packages:

It seals the layers a lockfile can’t reach. Interpreter, system libraries, and OS are frozen alongside your pinned packages, so nothing drifts between machines — the completion of Chapter 3.
An image is the frozen definition; a container is a running instance. The image is built once from a Dockerfile and runs identically anywhere Docker runs.
Order layers for caching, and keep data and secrets out. Install dependencies before copying code so edits don’t trigger reinstalls; mount data as volumes and inject secrets as environment variables rather than baking either into the image.
Compose multiple services. docker-compose declares an API, a database, and more as one wired-together stack started with a single command.

With the service verified by CI and packaged into an image, the next chapter puts it somewhere it can actually serve traffic: deployment.

14.7 Exercises

Write a Dockerfile for a small service — the FastAPI model from Chapter 12, or one of your own: a slim base image, install your locked requirements, copy the code, and set the run command. Build the image and run a container from it.
Improve the image. Order the instructions so that editing your code does not trigger a dependency reinstall, then reduce the image size (a slimmer base, a multi-stage build, or removing build tools afterwards). By how much did the image shrink?
Keep data and secrets out of the image: mount a data directory as a volume and pass a secret as an environment variable at run time, rather than baking either into the image. Explain concretely what goes wrong if you bake them in instead.
Conceptual: The Data Science Bridge calls a container the logical end of the Chapter 3 lockfile. Stress-test that: two colleagues build the same Dockerfile, from the same commit, six months apart. Explain how they can end up with images that behave differently, and what you would change in the Dockerfile to make the analogy honest.
Conceptual: Not everything needs to be containerised. Describe a piece of work for which a container is clearly worth the effort and one for which it is overkill, and name the property of the situation that decides between them.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Containerisation {#sec-containerisation} ## Shipping the environment {#sec-shipping-env} @sec-environments got your dependencies under control: a lockfile pins every package to an exact version, so a colleague who installs it gets the same packages you have. But the lockfile stops at the Python layer. It doesn't capture the Python interpreter itself, the system libraries your packages link against (the C library a wheel was built for, a CUDA runtime), or the operating system underneath. So "works in my environment" can still fail when *your* environment and *mine* differ below the packages — a different Python, a missing system library, a Linux-versus-macOS discrepancy in how some dependency behaves. A container closes that last gap by shipping the *whole* environment as a single artefact: your code, your locked dependencies, the interpreter, the system libraries, and a minimal operating system, all sealed together. The promise is that the container runs identically on your laptop, your colleague's, the CI runner, and the production server, because it isn't relying on any of those machines for anything but the ability to run containers. ## Images and containers {#sec-images-containers} Two words do most of the work. An **image** is the frozen, layered filesystem — your code plus everything it needs to run, down to the OS userland (the system libraries and command-line tools a program expects to find around it — everything in an operating system except the kernel itself) — built once from a recipe. A **container** is a running instance of an image. The relationship is the one between a class and an object, or between a saved model artefact and the loaded model serving predictions: one is the immutable definition, the other is the live thing. What the image actually pins is the entire tower from @sec-environments, the layers your code silently sits on: ```{python} #| label: environment-fingerprint #| echo: true import platform import sys from importlib.metadata import version # The slice of "my environment" that determines how the code behaves. # A lockfile pins the packages; a container image freezes all of this. print(f"OS: {platform.system()} {platform.release()}") print(f"Python: {sys.version.split()[0]}") for package in ("numpy", "pandas", "scikit-learn"): print(f"{package + ':':11}{version(package)}") ``` A lockfile records only the bottom three lines — the package versions. The image freezes all of it, including the operating system and the interpreter, so none of it can drift between machines. That completeness is the whole point: there's nothing left for the host to get wrong. ::: {.callout-note} ## Data Science Bridge A container is the logical end of the lockfile from @sec-environments. The lockfile was the move from "install whatever's newest" to "install exactly these versions"; the container is the same move taken all the way down — from "exactly these packages" to "exactly this entire machine". You already accepted the principle (pin the things that affect your results so they can't change underneath you); the container just applies it to the layers the lockfile couldn't reach. Where it breaks down: a lockfile is a small text file you can read, diff, and review in a pull request — you can see at a glance that `pandas` went from 2.2.1 to 2.2.3. An image is an opaque binary blob, often hundreds of megabytes, that you can't meaningfully read. You trade transparency for completeness, and the management shifts accordingly: instead of reviewing a diff, you version images, store them in a *registry* — a server that hosts built images the way PyPI hosts packages, so a machine elsewhere can pull the exact image by name and version — and rebuild them from the Dockerfile when something changes. ::: ## A Dockerfile for a model service {#sec-dockerfile} A `Dockerfile` is the recipe an image is built from. For the prediction service from @sec-api-design, it reads as a short sequence of steps — start from a base, install the dependencies, copy the code, say how to run it: ```dockerfile # A minimal base: the interpreter and a thin OS, nothing else. FROM python:3.12-slim WORKDIR /app # Install dependencies first, as their own layer, so this step is cached # and only re-runs when requirements change — not on every code edit. COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Then copy and install the application code (changes often, so it comes last). # The install step is what puts the src-layout package on the import path. COPY pyproject.toml . COPY src/ ./src/ RUN pip install --no-cache-dir --no-deps . # How to run the service (the FastAPI app from the API chapter). CMD ["uvicorn", "customer_value.api:app", "--host", "0.0.0.0", "--port", "80"] ``` Two details carry most of the craft. The ordering — dependencies before code — exploits Docker's layer caching: because each instruction is a cached layer, copying and installing requirements *before* copying the code means a code edit doesn't trigger a full reinstall, the file-level version of the caching idea from @sec-data-pipelines. And the base image matters: `-slim` keeps the image small, and a *multi-stage build* (compiling in a fat builder image, then copying only the result into a slim final image) keeps it smaller still. Equally important is what does *not* go in the image: data and secrets. Baking a dataset in bloats every copy of the image and ties it to one snapshot; baking in a secret puts a credential into an artefact that gets pushed to registries and shared (the @sec-config-secrets mistake, reincarnated). Data is mounted as a *volume* at run time — a directory on the host machine (or a managed storage area) grafted into the container's filesystem at a chosen path, so the container sees it as an ordinary folder while the data itself lives, and survives, outside the container — and secrets are injected as environment variables — the image stays a generic, shareable definition of *how to run*, not a container for *what to run on* or *which credentials to use*. ## Size, and the architecture trap {#sec-size-architecture} That `Dockerfile` builds and runs, and two things about it will bite you later. Both are invisible on your own machine, which is precisely why they bite. The first is size. An image isn't built once and forgotten; it is *pulled* — downloaded onto a host — before anything can run. Every deployment pulls it, and so does every autoscale event, so when traffic spikes on Monday morning and the platform starts three more instances, each one waits on that download before it can serve a single request. A two-gigabyte image, most of it compilers you needed for ninety seconds at build time, turns "scale up now" into "scale up shortly", which is exactly the wrong latency for a spike. Size also enlarges the attack surface: every system package in the image is a package that may turn out to have a vulnerability, and the ones you never use are the ones nobody thinks to patch. A *multi-stage build* is the standard fix, and it's simpler than the name suggests. You build in one image and ship a different one: ```dockerfile # Stage 1 — the builder. Compilers, headers and build tools live here. FROM python:3.12-slim AS builder RUN apt-get update && apt-get install -y --no-install-recommends build-essential \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir --prefix=/install -r requirements.txt # Stage 2 — the image you actually ship. Nothing from the builder comes with it # except the installed packages copied on the line below. FROM python:3.12-slim COPY --from=builder /install /usr/local WORKDIR /app COPY pyproject.toml . COPY src/ ./src/ RUN pip install --no-cache-dir --no-deps . CMD ["uvicorn", "customer_value.api:app", "--host", "0.0.0.0", "--port", "80"] ``` Only the final stage becomes the image; the builder is discarded once it has produced the installed packages. You keep the ability to compile a dependency that has no pre-built wheel, without paying for the compiler on every pull, forever. The second trap is architecture, and it catches nearly everyone once. An image is built for a specific CPU architecture, and by default that's the architecture of the machine doing the building. Build on an Apple Silicon Mac and you get an `arm64` image; most cloud Linux hosts are `amd64` — the x86-64 architecture that has run servers for two decades. The image runs beautifully on your laptop and then dies on the server with `exec format error`. The kinder failure is the loud one. The unkind version is a host that quietly runs your `arm64` image under emulation, several times slower, which presents as an unexplained performance regression rather than as a build mistake — you go looking for a slow query when the answer is that the container is pretending to be a different computer. Name the target when you build: ```bash # Build for the deployment target, not for the laptop doing the building. docker buildx build --platform linux/amd64 -t customer-value:1.0.0 . ``` Better still, build the image in CI (@sec-ci), where the runner's architecture matches production and nobody has to remember the flag. That's the general lesson underneath both traps: an image is only as portable as the assumptions frozen into it, and the ones that hurt are the assumptions you froze without noticing you'd made them. ## Composing services {#sec-compose} Real systems are rarely a single service. The prediction API might sit alongside a database and a cache, and you want to run them together, wired up, with one command. `docker-compose` declares the set in a YAML file: ```yaml # compose.yaml services: api: build: . ports: ["8000:80"] env_file: .env # the git-ignored file from Chapter 11 environment: DATABASE_URL: postgresql://app:${POSTGRES_PASSWORD}@db:5432/customers db: image: postgres:16 env_file: .env environment: POSTGRES_USER: app POSTGRES_DB: customers POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} # required; the image refuses to start without it volumes: ["pgdata:/var/lib/postgresql/data"] # data lives in a volume volumes: pgdata: ``` `docker compose up` starts the whole stack. Note where the password comes from: not the YAML file, which is committed, but `.env`, which is not — the same split between a committed `.env.example` template and a git-ignored `.env` that @sec-config-secrets established. The compose file names the secret; it never contains it. This is the unit that the next chapter deploys and that the end-to-end project in *MLOps pipeline* assembles in full; for now the point is that a container is composable — services declared together, each a sealed environment, wired by configuration rather than by hope. ::: {.callout-tip} ## Author's Note Most data scientists meet Docker as an obstacle: an incantation Ops insists on, a `Dockerfile` copied from a colleague and tweaked until it builds, a layer of mystery between "my code works" and "it's running in production". Approached that way it's pure friction, learned by superstition. The reframe is to notice that the problem Docker solves is one you've hit your entire career, and hit hard in @sec-environments: *it works on my machine*. A container is "my machine", packaged — so that the sentence stops being an excuse for an irreproducible result and becomes a guarantee that the thing which ran for you will run identically for everyone else. Seen as environment-as-code taken to its conclusion — not a new platform to master, but the lockfile finished off — the incantation turns into a recipe you can read, reason about, and write. You're not learning Docker for its own sake; you're closing the last gap in a problem you already understand. ::: ## Summary {#sec-containerisation-summary} A container ships the whole environment, not just the packages: 1. **It seals the layers a lockfile can't reach.** Interpreter, system libraries, and OS are frozen alongside your pinned packages, so nothing drifts between machines — the completion of @sec-environments. 2. **An image is the frozen definition; a container is a running instance.** The image is built once from a `Dockerfile` and runs identically anywhere Docker runs. 3. **Order layers for caching, and keep data and secrets out.** Install dependencies before copying code so edits don't trigger reinstalls; mount data as volumes and inject secrets as environment variables rather than baking either into the image. 4. **Compose multiple services.** `docker-compose` declares an API, a database, and more as one wired-together stack started with a single command. With the service verified by CI and packaged into an image, the next chapter puts it somewhere it can actually serve traffic: *deployment*. ## Exercises {#sec-containerisation-exercises} 1. Write a `Dockerfile` for a small service — the FastAPI model from @sec-api-design, or one of your own: a slim base image, install your locked requirements, copy the code, and set the run command. Build the image and run a container from it. 2. Improve the image. Order the instructions so that editing your code does not trigger a dependency reinstall, then reduce the image size (a slimmer base, a multi-stage build, or removing build tools afterwards). By how much did the image shrink? 3. Keep data and secrets out of the image: mount a data directory as a volume and pass a secret as an environment variable at run time, rather than baking either into the image. Explain concretely what goes wrong if you bake them in instead. 4. **Conceptual:** The Data Science Bridge calls a container the logical end of the @sec-environments lockfile. Stress-test that: two colleagues build the same `Dockerfile`, from the same commit, six months apart. Explain how they can end up with images that behave differently, and what you would change in the `Dockerfile` to make the analogy honest. 5. **Conceptual:** Not everything needs to be containerised. Describe a piece of work for which a container is clearly worth the effort and one for which it is overkill, and name the property of the situation that decides between them.