How we avoid Docker Hub rate limits for hundreds of GitLab CI/CD jobs - with minimal impact

von Jan Koppe
As probably most companies do today, BurdaForward is making heavy use of containerization when running our products and applications. We build many different images in CI/CD, and these are most often based on images pulled from Docker Hub. During a regular work day, there will be many hundreds of image pulls from Docker Hub, and this has recently became an issue for us.
Some years ago, Docker already introduced rate limits for image pulls from Docker Hub. This was a big problem for us, as we were hitting these limits quite often. We do use a lot of GitLab runner instances, so the issue was not apparent right away, but as usage increased over time with more and more teams adopting containerization and shortening their deployment cadence, the issue started affecting us.
With builds failing over and over again, we had to quickly find a solution for this.
Previous solution attempts
Some years ago, the first solution to this was to switch our setup from AWS EC2 instances behind a NAT gateway to AWS EC2 instances with public IPs. This was a quick fix, but it was not a proper solution. It effectively circumvented the IP rate limits by just using different IPs after a number of jobs had run. Cost-wise, in the beginning it was even beneficial to use this approach, as there were no more costs for the NAT Gateway traffic, but with the public IPv4 address charge that AWS introduced on February 1, 2024, this is no longer the case.
Another issue is the reduced security of having instances with public IPs. While incoming connection attempts can be effectively mitigated by properly configuring the security groups, we were unable to provide teams with a set of IPs that they could allowlist on external services, as the IPs were changing all the time. IP allowlisting in itself is not a great security measure, but it’s an additional layer of security that we could not make use of with this setup.
The biggest practical issue with this approach is that we are effectively hoping to get “clean” IP addresses from the EC2 public IP pool, which is not guaranteed. Most of the time this worked very well, but this bandaid solution also started to show issues over time.
A better solution
To make sure that we fix this for good, we decided to set up a machine user in Docker Hub with paid subscriptions. As outlined in the Docker Hub rate limits documentation, starting on April 1, 2025, all Docker Hub users with a paid subscriptions will be able to pull images without any rate limits. This seems like the perfect solution for us.
But, how do you set this up in practice, with hundreds of GitLab CI/CD configurations running on tens of GitLab runners across many development teams?
We did not want to touch every GitLab CI/CD configuration file and distribute login credentials across all teams, so we have come up with a different solution.
Docker registry
We are using a self-hosted Docker registry instance as a pull-through cache, which allows us to pull images from Docker Hub once and then cache them in our registry. All pulls done through this registry instance will be using the Docker Hub credentials that we configure in this single place, and thus we can make use of unlimited pulls from Docker Hub for this instance.
In our case, we are running the registry on an ECS Fargate task within the same private subnet as the GitLab runners. This way, we can make sure that the registry is only accessible from the GitLab runners and not from the public internet. Also, no traffic needs to leave or enter the VPC for this function to work, reducing traffic costs further.
For the ECS task, we are configuring the Docker registry via environment variables by setting the following variables:
- REGISTRY_PROXY_REMOTEURL: registry-1.docker.io
- REGISTRY_PROXY_USERNAME: <username>
- REGISTRY_PROXY_PASSWORD: <password>
- REGISTRY_PROXY_TTL: 168h (7 days, to make sure that we are not caching mutable tags for too long)
- REGISTRY_STORAGE: s3
- REGISTRY_STORAGE_S3_REGION: <region>
- REGISTRY_STORAGE_S3_BUCKET: <bucket>
- REGISTRY_STORAGE_S3_DELETE_ENABLED: true
- OTEL_TRACES_EXPORTER: none (We’re not using OpenTelemetry in this setup, this disables unnecessary log lines)
We are storing cached images on an S3 bucket, which is a good solution for this use case. ECS Fargate tasks only have a small, ephemeral storage, which is not enough to store all the images that we are pulling. By using S3, we can store all the images that we are caching.

One difficulty that comes with using ECS tasks is that you will have a random IP address for each new task, which is not ideal to configure for the clients. To make sure that our registry is always accessible from a known address, we are using AWS service discovery to create a Route 53 record for the registry. Make sure that your VPC has enabled the options for DNS resolution so resolving the internal hostname works. This way, we can always access the registry via registry.<internal-zone>. This saves us from having to run an expensive load balancer in front of the registry, or trying to update IP addresses in configuration files or DNS records all the time. It’s a great way to connect to your ECS services internally, when a full load balancer is not needed.
One of the issues that we ran into when first building out the setup was that no matter how often we double-checked the configuration, the registry was unable to authenticate with S3. After lots of head-scratching and reading the source code (one of the superpowers you can use by chosing FLOSS software!) and some GitHub issues, it became clear that our ECS tasks are getting their IAM credentials via IMDSv2, which is not yet supported in the registry version 2 that we were working with.
Luckily, the registry has merged some changes that update the dependencies to a version that supports IMDSv2. These changes are only available with registry version 3, which currently is still in a Release Candidate phase. We decided to go forward with using the release candidates, as this seemed the most future proof and secure option. As of writing this, we have not yet encountered any issues with the original RC1 version that we are using, and we are planning to update to the final 3.0.0 version as soon as it will be released.
With this setup, we can now pull Docker images through this registry instance, and we will not hit any rate limits. But how do we make sure that all GitLab CI/CD jobs are using this registry instance?
Gitlab configuration
In our setup, we are still making use of the docker-machine executor for GitLab runners. This allows us to autoscale the number of runners based on the number of jobs that are running at any given time. Even though we are running some CI/CD jobs during nights and weekends, we are still able to significantly save costs by scaling down the amount of runners outside of regular working hours.
To make sure that all GitLab CI/CD jobs are using the registry, we are configuring the GitLab runner instances to use the registry as a proxy for Docker Hub. This is usually done by configuring the daemon.json file on the runner instances to use the registry as a proxy, and then making sure that the Docker daemon is started with this configuration.
Luckily, docker-machine provides an option to configure the registry-mirror directly. We are using this option to directly point the runner instances to the registry instance like so:
[[runners]]
...
[runners.machine]
...
MachineOptions = [
...,
"engine-registry-mirror=http://registry.<internal-zone>:5000",
]
```
[[runners]]
...
[runners.docker]
...
volumes = [
"/opt/dind/daemon.json:/etc/docker/daemon.json",
]
```
This is pretty good! It will configure all runners to use the registry as a proxy for Docker Hub, and all jobs will be pulling images through the registry. This way, we can make sure that we are not hitting any rate limits, and we can also make sure that all jobs are using the same registry instance.
We did not need to touch any of our CI/CD configuration files, and we did not need to distribute any credentials to the teams. This is a great solution for us, and we are happy to have this in place.
Docker-in-Docker
Sadly, we can’t stop here. Many of our teams are using Docker in GitLab CI/CD to build container images, and this is usually done by using Docker-in-Docker. Another usecase for Docker-in-Docker is to run tests with docker-compose, allowing developers to use the same setup locally and in CI/CD.
When using Docker-in-Docker, the Docker daemon that is running inside the container is not aware of the daemon.json configuration that we set on the host. This means that the Docker daemon inside the container will not be using the registry as a proxy for Docker Hub, and we will still be hitting the rate limits.
We could, again, modify all of the CI/CD configuration files to use the registry as a proxy for Docker Hub, but this is still not a great solution.
Luckily, GitLab runners give us the option to define volumes for any job containers. We can make use of that to mount a daemon.json file from the host into the job containers, and thus make sure that the Docker daemon inside any container is using the registry as a proxy for Docker Hub.
While it might be the easiest to re-use the daemon.json file from the host, we decided to go with a more secure approach. We are creating a separate daemon.json file that only contains the registry configuration, and we are mounting this file into the job containers. This way, we can make sure that the job containers are only using the registry as a proxy for Docker Hub, and not any other configuration that might be set on the host.
[[runners]]
...
[runners.machine]
...
MachineOptions = [
...,
"amazonec2-userdata=/etc/gitlab-runner-userdata.txt",
]
```
Now, we just need to figure out how to actually create this daemon.json file on the autoscaled runner hosts. The simplest way to do this is to make use of cloud-init, which is being used anyway. Docker-machine allows us to pass on a user-data script to the runner instances, which we can use to create the daemon.json file.
This is the content of the gitlab-runner-userdata.txt file:
#!/bin/sh
sysctl -w vm.max_map_count=262144
mkdir -p /opt/dind && echo '{"registry-mirrors": ["http://registry.<internal-zone>:5000"]}' > /opt/dind/daemon.json
Small extra tip: We are also setting the vm.max_map_count kernel parameter in this script, as we are running Elasticsearch in some of our CI/CD jobs, and this parameter is needed for Elasticsearch to run properly. This is a good way to set this parameter on all of the runner instances, and make sure that all jobs are running properly.
In our case, we are bootstrapping the main GitLab runner instance, which then autoscales the worker instances, via cloud-init as well. This means that we use cloud-init on the main instance to write a user-data file which instructs cloud-init on the worker instances to write the daemon.json file, which then gets mounted into the job containers. It’s turtles all the way down!
Conclusion
To recap: We now have one central registry instance, which is caching images from Docker Hub, reducing our NAT gateway traffic and costs. We are not hitting any rate limits anymore, we don’t need to modify CI/CD configuration files or distribute credentials to teams. We can get rid of the public IPs on our GitLab runners, thus allowing teams to allowlist our NAT gateway IPs on some external services.
There are still some improvements that can be made here, and we might look into those in the future. For one, GitLab is slowly moving over to their own fleeting abstraction to phase out the deprecated docker-machine executor. We have made some experiments with fleeting already, and it seems to be working well, so a switch to that might be coming soon. Once that’s finished there will surely be an update to this blog post.
Another interesting option that this setup opens up is to use a more advanced registry to have tight control over the images that are being pulled. Especially in a security-sensitive environment, having tight control over the images that are being pulled, gathering information on their SBOMs and their vulnerabilities is a great way to increase security. In such cases, a different registry is needed, but this setup can be a good starting point for that.
Overall, we are pretty happy with this solution, and it is working very well for us.