rlim

Optimize your docker build

written by Ricky Lim on 2025-02-05

Why Docker Size Matters for Data Scientists

Data scientists often rely on powerful computational libraries to drive innovative research and analysis. These essential libraries, when packaged as a docker container, can significantly impact the size of the container. A typical Python data science container can easily exceed 1GB. Optimizing these container sizes presents an opportunity to speed up development and deployment cycles, empowering your overall data science workflow.

While dependencies are essential for any project, not all dependencies come equally. There is an intended dependency like Numpy that helps data scientist to do data analysis. Unfortunately, there is also the unintended dependency like build tools such as gcc, build-essential, that sneaks into your final docker image. Though these dependencies are necessary for building the application, they are not required for running the application. In this blog, we will learn how to create a docker image using Docker multi-stage builds, keeping only the dependencies you need for your application.

The Car Factory Analogy

Think of Docker multi-stage builds like car manufacturing and delivery.

Just we don't need a whole factory to drive a car, we don't need build tools to run our application.

Data Science Container Example

Let's build a container for our matrix operations tool matrixops.py

🐘 Traditional Build Approach

This approach includes all build dependencies in the final image, making it unintendedly large.

Here's a typical unoptimized Dockerfile:

FROM python:3.12

WORKDIR /app

# Build dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    gcc \
    && rm -rf /var/lib/apt/lists/*

RUN pip install numpy

COPY matrixops.py .

ENTRYPOINT [ "python", "matrixops.py" ]
CMD ["--help"]

🚀 The Multi-stage Build Approach

With multi-stage build, we can separate the unintended dependencies from the final image, resulting in an intendedly smaller image.

Here's an optimized Dockerfile:

# Build stage
FROM python:3.12 AS builder

WORKDIR /app

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    gcc \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --prefix=/install numpy

# Final stage
FROM python:3.12-slim

WORKDIR /app
COPY --from=builder /install /usr/local

COPY matrixops.py .

ENTRYPOINT [ "python", "matrixops.py" ]
CMD ["--help"]

Hands-on Demo

Download the required files:

To build this image, run:

$ docker build -f unoptimized.dockerfile -t matrixops:unoptimized .
$ docker build -f optimized.dockerfile -t matrixops:optimized .

Example run:

$ docker run -it matrixops:optimized identity 3 3

🔢 Generated IDENTITY Matrix (3x3):
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

📊 Matrix Analysis:
Mean: 0.3333333333333333
Std Dev: 0.4714045207910317
Min: 0.0
Max: 1.0
Sum: 3.0
Determinant: 1.0

Image Size comparison

Let's compare the image sizes:

$ docker images | grep matrixops

matrixops                  optimized     d8a7716ca91c   6 minutes ago    276MB
matrixops                  unoptimized   54f928a6525b   10 minutes ago   1.61GB

Impact by numbers

Key Takeawsys

  1. Multi-stage builds separate intended and unintended dependencies. The uninteded dependencies are moved to the builder stage, reducing the final image size.

  2. Leaner images lead to faster container initialization. The smaller our docker image, less data to load into memory and process at runtime, giving you a more responsive application.

  3. Efficient development and deployment. Quicker deployments when pushing the image to a registry and also when pulling the image from the registry, perfect for CI/CD pipelines.

Next Steps:

Remember: Learner Docker images = Faster to iterate. KISS - Keep It Simple & Small!