Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker container logs fill up disk space on production servers #1014

Open
regisb opened this issue Mar 13, 2024 · 12 comments
Open

Docker container logs fill up disk space on production servers #1014

regisb opened this issue Mar 13, 2024 · 12 comments
Labels
documentation In order to close this issue, some documentation should be improved

Comments

@regisb
Copy link
Contributor

regisb commented Mar 13, 2024

Bug description

In Open edX platforms that run for a long time (with tutor local), the Docker containers output logs that gradually fill up the disks.

How to reproduce

Run Open edX for a long while, run df -lh.

Additional context

I'm not sure if this is an issue or a feature that we want to keep in Tutor. We could configure all containers to log-rotate logs (with these options). But it might be important for some platforms to preserve logs for a very long time. Moreover, users who want to enable log-rotation can do so by configuring the Docker daemon (see this).

At the very least, we should document how platform administrators can enable log rotation on their server.

@regisb regisb added the documentation In order to close this issue, some documentation should be improved label Mar 13, 2024
@dagg
Copy link

dagg commented Mar 14, 2024

Indeed, our implementation of the platform seems to produce quite large amounts of logs, both platform logs and docker container logs.

In particular, our container logs are about 48.5G in total and they have increased about 4.5G the last 3 days.

The 4 containers with the largest log files are:

  • The container with the biggest log size (currently 25.2G) is the one with name: "/tutor_local-lms-1" based on the image with RepoTag: "overhangio/openedx:16.1.7"

  • The second biggest one (18.9G currently) is named: "/tutor_local-caddy-1" based on the image: "caddy:2.6.4"

  • The 3rd (1.4G): "/tutor_local-mongodb-1" based on the image: "mongo:4.4.22"

  • The 4th (1.2G): "/tutor_local-lms-worker-1" based on the image: "overhangio/openedx:16.1.7"

We currently use the Palm.4 release and our databases are about 56GB for MySQL and 21G (about 53G of actual data size) for MongoDB.

I should mention that we started using the Palm release after we transferred from Lilac (Lilac was a native installation, no tutor/docker) a few days before Christmas, and this was our first time using tutor with docker, so obviously the almost 50G of container logs were created the last couple of months.

I can keep an eye on the rate with which the logs increase daily and post it here if this will help.

@regisb
Copy link
Contributor Author

regisb commented Mar 15, 2024

Thank you for these details Dimitris. I believe you also have an issue with the "all.log" and "tracking.log" files, right? In your email to me, you estimated that they used about ~1GB/day. This is a data point that will be useful in crafting an adequate solution.

Finally, can you give us a sense of your number of daily active users (DAU), such that we can make a recommendation that makes sense for most people?

@dagg
Copy link

dagg commented Mar 15, 2024

I believe you also have an issue with the "all.log" and "tracking.log" files, right?

Yes, exactly, I didn't include those because the issue mentioned docker container logs.
Actually those are increasing with a rate about 1 to 1.5G per day, depending on the traffic on our platform.

The platform's (our Open edX, palm.4 release) log files were about 62G (32G and 30G for "all.log" and "tracking.log" respectively) in total from a couple of days before Christmas when we first setup our Palm release, up to about 4 days ago when I moved them elsewhere and they started from zero bytes.
The last 4 days both of these files are now 4G (2.1G and 1.9G for "all.log" and "tracking.log" respectively) so in total they increase about 1G per day.

Finally, can you give us a sense of your number of daily active users (DAU), such that we can make a recommendation that makes sense for most people?

In our platform we have about 127000 active users (meaning, they have enrolled in one or more courses and often interact with the platform).
I don't have a very accurate way to measure our daily active users yet, but from some crude metrics I can say that we have about 2000 to 2500 unique daily users that interact with our platform, and during the exams deadline periods (just like last weekend) this number increases to about 3500 to 4000 users.

@dagg
Copy link

dagg commented Mar 15, 2024

Quick update concerning the Docker Container logs!!!

A few days before we discover the problem with the disk space I had installed and enabled Cairn, but disabled it just to gain a bit of disk space thinking it might help.

A few minutes ago, I again enabled the Cairn plugin (so to get some more accurate metrics), and run "tutor local launch".
Right after that, some of the biggest container logs I described on my first post, disappeared!

The only big one that remained the way it was before, is now the log in the "/tutor_local-caddy-1" container with current size of 19.1G.
All the other containers' logs are less than 70MB each, so there was about 30G of disk space freed!
I noticed though that although the "/tutor_local-lms-1" container's log was about 50MB (after running "tutor local launch") it's increasing rapidly (about 1 to 2 MB per minute, that's more than 2G per day with that rate) so it's a matter of time before it becomes huge again!

I am not very familiar with how docker works, and if this is somehow related with the re-enabling of the Cairn plugin or it was just the "tutor local launch" that helped (although I had run it before and it didn't change much in terms of disk space and container logs), but I thought it's an important development that might help a lot!

@regisb
Copy link
Contributor Author

regisb commented Mar 26, 2024

I am not very familiar with how docker works, and if this is somehow related with the re-enabling of the Cairn plugin or it was just the "tutor local launch" that helped

launch should definitely have helped, because it stops the running containers before starting them again. Logs from stopped containers are deleted, as far as I understand.

Dimitris, given that you have a large platform with many active users, I strongly suggest that you automatically log-rotate Docker logs by configuring the Docker daemon, as documented here: https://docs.docker.com/config/containers/logging/configure/#configure-the-default-logging-driver

It should be as simple as modifying the /etc/docker/daemon.json file. If this works for you, then we should definitely add these instructions to the Tutor docs -- e.g: in the scaling tutorial.

@dagg
Copy link

dagg commented Mar 26, 2024

@regisb thank you so much! I will try this, let it there for a few days and let you know how it goes.
I have included here a screenshot of some graphs that show the increase in size (and decrease in free disk space) per day for about the last 10 days.

LogsSizeDiff

@dagg
Copy link

dagg commented Apr 2, 2024

It should be as simple as modifying the /etc/docker/daemon.json file. If this works for you, then we should definitely add these instructions to the Tutor docs -- e.g: in the scaling tutorial.

While the configuration is indeed very simple, it doesn't seem to work for containers already created...
I have tried it on our test platform for several days (of course I restarted the docker service as it's suggested in the docs, and even run tutor local launch ) but it didn't change anything.

I also found a comment on docker's forums that confirms it:
https://forums.docker.com/t/how-to-limit-or-rotate-docker-container-logs/112378/9

This person though proposes the emptying of the log file manually.
I haven't tried that yet to see what will happen but I will soon and let you know.

@regisb
Copy link
Contributor Author

regisb commented Apr 2, 2024

I'm guessing you should try to delete existing containers, with something like:

tutor local stop
docker container prune
tutor local start -d

Or alternatively:

tutor local dc down
tutor local start -d

@dagg
Copy link

dagg commented Apr 2, 2024

I was a bit afraid doing that actually, thank you though for the suggestion, I will try this now :)

@dagg
Copy link

dagg commented Apr 2, 2024

tutor local dc down
tutor local start -d

Thank you @regisb this actually did the job, the container logs are starting now from scratch, I will keep an eye on them to see if they now rotate properly and let you know so to put this method to the Tutor docs, I bet it will be useful for lot of people!

@dagg
Copy link

dagg commented Apr 15, 2024

The logs are rotated correctly in my test platform after setting up the /etc/docker/daemon.json file and running tutor local dc down and tutor local start -d.

In production though, after I run tutor local dc down, the network tutor_local_default didn't stop/removed:

! Network tutor_local_default Resource is still in use

Nevertheless, after I run tutor local start -d, everything seems to be working properly, I hope there are no side-effects because of this (thankfully there are none so far as I can see).

That helped me claim back about 50G of disk space which is quite helpful considering these particular logs were increasing by about 1 to 1.2G per day!

It should be as simple as modifying the /etc/docker/daemon.json file. If this works for you, then we should definitely add these instructions to the Tutor docs -- e.g: in the scaling tutorial.

I think now it's safe and useful to include these instructions to tutor docs, I am sure many people using the platform will use it!

@regisb
Copy link
Contributor Author

regisb commented Apr 15, 2024

Let's keep this issue open then. To close it, we will have to add some instructions to the scaling tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation In order to close this issue, some documentation should be improved
Projects
Development

No branches or pull requests

2 participants