Data scientists often rely on powerful computational libraries to drive innovative research and analysis. These essential libraries, when packaged as a docker container, can significantly impact the size of the container. A typical Python data science container can easily exceed 1GB. Optimizing these container sizes presents an opportunity to speed up development and deployment cycles, empowering your overall data science workflow.
While dependencies are essential for any project, not all dependencies come equally.
There is an intended dependency like Numpy
that helps data scientist to do data analysis.
Unfortunately, there is also the unintended dependency like build tools such as gcc
, build-essential
, that sneaks into your final docker image.
Though these dependencies are necessary for building the application, they are not required for running the application.
In this blog, we will learn how to create a docker image using Docker multi-stage builds, keeping only the dependencies you need for your application.
Think of Docker multi-stage builds like car manufacturing and delivery.
Traditional Build: Shipping an entire car factory with all machinery, raw materials, and the finished car to your customer.
Multi-stage Build: Shipping only the finished car
Just we don't need a whole factory to drive a car, we don't need build tools to run our application.
Let's build a container for our matrix operations tool matrixops.py
This approach includes all build dependencies in the final image, making it unintendedly large.
Here's a typical unoptimized Dockerfile:
FROM python:3.12
WORKDIR /app
# Build dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
gcc \
&& rm -rf /var/lib/apt/lists/*
RUN pip install numpy
COPY matrixops.py .
ENTRYPOINT [ "python", "matrixops.py" ]
CMD ["--help"]
With multi-stage build, we can separate the unintended dependencies from the final image, resulting in an intendedly smaller image.
Here's an optimized Dockerfile:
# Build stage
FROM python:3.12 AS builder
WORKDIR /app
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
gcc \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --prefix=/install numpy
# Final stage
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY matrixops.py .
ENTRYPOINT [ "python", "matrixops.py" ]
CMD ["--help"]
Download the required files:
To build this image, run:
$ docker build -f unoptimized.dockerfile -t matrixops:unoptimized .
$ docker build -f optimized.dockerfile -t matrixops:optimized .
Example run:
$ docker run -it matrixops:optimized identity 3 3
🔢 Generated IDENTITY Matrix (3x3):
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
📊 Matrix Analysis:
Mean: 0.3333333333333333
Std Dev: 0.4714045207910317
Min: 0.0
Max: 1.0
Sum: 3.0
Determinant: 1.0
Let's compare the image sizes:
$ docker images | grep matrixops
matrixops optimized d8a7716ca91c 6 minutes ago 276MB
matrixops unoptimized 54f928a6525b 10 minutes ago 1.61GB
Multi-stage builds separate intended and unintended dependencies. The uninteded dependencies are moved to the builder stage, reducing the final image size.
Leaner images lead to faster container initialization. The smaller our docker image, less data to load into memory and process at runtime, giving you a more responsive application.
Efficient development and deployment. Quicker deployments when pushing the image to a registry and also when pulling the image from the registry, perfect for CI/CD pipelines.
Remember: Learner Docker images = Faster to iterate. KISS - Keep It Simple & Small!