Docker images layers and cache
Published on
· 10 min read
Dockercachelayer
In this post, we'll walk through Docker image layers and the caching around them from the point of view of a Docker user. I'll assume you're already familiar with Dockerfiles and Docker concepts in general.
✌️ The two axioms of Docker layers
There are two key concepts to understand, from which everything else is deduced. Let's call them our axioms.
- Axiom 1
- Every instruction in a Dockerfile results in a layer1. Each layer is stacked onto the previous one and depends upon it.
- Axiom 2
- Layers are cached and this cache is invalidated whenever the layer or its parent change. The cache is reused on subsequent builds.
So, what happens when we build a small Docker image?
1 2 3 4 5 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
- Docker first downloads our base image since it doesn't exist in the local registry.
- It creates the
/app
directory. Subsequent commands will run inside this directory. - It copies the file from our local directory to the image.
- It stores the MD5 hash of our file inside a file named
somefile.md5
.
Now if we try to build the image again, without changing anything, here's what happens:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
For every step, Docker says it's "using cache." Remember our axioms? Well, each step of our first build generated a layer which is cached locally and was reused for our second build.
🔄 Cache invalidation
We can get some information about the layers of our image using docker history
:
1 2 3 4 5 6 7 8 9 10 11 |
|
This output should be read as a stack: the first layer is at the bottom and the last layer of the image is at the top. This illustrates the dependencies between layers: if a "foundation" layer changes, Docker has to rebuild it and all the layers that were built upon.
It's natural: your layers 2 and 3 may depend on the output of the layer 1, so they should be rebuilt when layer 1 changes.
In our example:
1 2 3 4 5 |
|
- the
COPY
instruction depends on the previous layer because if the working directory were to change, we would need to change the location of the file. - the
RUN
instruction must be replayed if the file changes or if the working directory changes because then the output file would be placed elsewhere. It also depends on the presence of themd5sum
command, which exists in theubuntu
image but might not exist in another one.
So if we change the content of somefile
, the COPY
will be replayed as well as the RUN
. If after that we change the WORKDIR
, it will be replayed as well as the other two.
Let's try this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
See, Docker detected that our file had changed, so it ran the copy again as well as the md5sum
but used the WORKDIR
from the cache.
This mechanism is especially useful for builds that take time, like installing your app's dependencies.
🏃♂️ Speed up your builds
Let's consider another example:
1 2 |
|
1 2 3 4 5 |
|
1 2 3 4 5 6 |
|
Let's build this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Running this image gives us:
1 2 3 |
|
That's ok but we'd prefer a nicer output. What about using pprint
? Easy! We just need to edit our main.py
and rebuild.
1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
See? Because we chose to add all of our files in one command, whenever we modify our source code, Docker has to invalidate all the subsequent layers including the dependencies installation.
In order to speed up our builds locally, we may want to skip the dependency installation if they don't change. It's quite easy: add the requirements.txt
first, install the dependencies and then add our source code.
1 2 3 4 5 6 7 |
|
After a first successful build, changing the source code will not trigger the dependencies installation again. Dependencies will only be re-installed if:
- You pull a newer version of
python:3.8.6-buster
- The
requirements.txt
file is modified - You change any instruction in the Dockerfile from the
FROM
to theRUN pip install
(included). For example if you change the working directory, or if you decide to copy another file with the requirements, or if you change the base image.
⏬ Reduce your final image size
Now you may also want to keep your images small. Since an image size is the sum of the size of each layer, if you create some files in a layer and delete them in a subsequent layers, these files will still account in the total image size, even if they are not present in the final filesystem.
Let's consider a last example:
1 2 3 4 5 6 |
|
Pop quiz! Given the following:
- The ubuntu image I'm using weighs 73MB
- The file created by
fallocate
is actually 104857600 bytes, or about 105MB - The md5 sum file size is negligible
What will be the final size of the image?
- 73MB
- 105MB
- 178MB
- zzZZZzz... Sorry, you were saying?
Well I'd like the answer to be 73MB but instead the image will weigh the full 178MB. Because we created the big file in its own layer, it will account for the total image size even if it's deleted afterwards.
What we could have done instead, is combine the three RUN
instructions into one, like so:
1 2 3 4 5 6 |
|
This Dockerfile produces a final image that looks exactly the same as the previous one but without the 105MB overweight. Of course, this has the downside of making you recreate the big file every time this layer is invalidated, which could be annoying if creating this file is a costly operation.
This pattern is often used in official base image that try to be small whenever they can. For example, consider this snippet from the python:3.8.7-buster
image (MIT License):
1 2 3 4 5 6 7 8 9 10 11 12 |
|
See how python.tar.xz
is downloaded and then deleted all in the same step? That's to prevent it from weighing in the final image. It's quite useful! But don't overuse it or your Dockerfiles might become unreadable.
🗒 Key takeaways
- Every instruction in a Dockerfile results in a layer1. Each layer is stacked onto the previous one and depends upon it.
- Layers are cached and this cache is invalidated whenever the layer or its parent change. The cache is reused on subsequent builds.
- Use
docker history
to know more about your image's layers. - Reduce your build duration by adding only the files you need when you need them. Push files that might change a lot to the bottom of your Dockerfile (dependencies installation example).
- Reduce your image size by combining multiple
RUN
instructions into one if you create files and delete them shortly after (big file deletion example).
Well that wraps it up for today! It was quite technical but I hope you learned something along the way 🙂
As always, please contact me if you have comments or questions!
📚 Further reading
-
Well, that's not true anymore, see Best practices for writing Dockerfiles: Minimize the number of layers (Docker docs). But since it's easier to understand this way, I'm willing to make this compromise for this article. ↩↩