Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ninja consumes all available memory #1441

Closed
Alexander-- opened this issue May 27, 2018 · 33 comments
Closed

Ninja consumes all available memory #1441

Alexander-- opened this issue May 27, 2018 · 33 comments

Comments

@Alexander--
Copy link

I have been investigating causes of swapping on my system, and stumbled upon this fragment in ninja code:

ninja/src/ninja.cc

Lines 223 to 233 in 03df526

int GuessParallelism() {
switch (int processors = GetProcessorCount()) {
case 0:
case 1:
return 2;
case 2:
return 3;
default:
return processors + 2;
}
}

Ninja is used by Android build system, and since I am compiling a lot of Android code, it's performance strongly affects usability of my system.

My work PC has 4-core CPU with max 8 threads, and the home PC has 8-core CPU with max 16 (!!) threads. Both have 8Gb of RAM.

Needless to say, ninja compilations quickly hoard all available memory and cause heavy swapping.

Right now ninja defaults to allocating CPU+2 threads, which can easily exhaust OS resources, if amount of available memory does not "match" count of CPUs. There are few other programs with this kind of default, but most of those are games, which are optimized to handle fixed assets and conserve memory. Ninja processes external data — software source code, — some of which is very memory heavy (e.g. C++). This is definitely NOT ok. If the current CPU trend continues, we will soon see consumer-targeted computers with 64+ cores. If the current RAM trend continues, most of those computers won't have matching amount of RAM.

I have seen some discussions about conserving memory, used by compilation, by dynamically monitoring memory usage. I don't personally care about that — most of my projects have predictable compilation footprint.

Instead I'd like ninja to perform some basic sanity checks and limit it's maximum parallelism, based on available system memory. If some of installed CPUs don't have at least 1Gb of RAM to each, don't count those CPUs towards default parallelism setting. This will keep number of parallel jobs roughly same for most systems with <4 CPUs as well as enterprise Xeon buildservers, while providing more reasonable default for systems with subpar RAM amount.

@atetubou
Copy link
Contributor

ninja does not know about the process it runs, also I think the model a process uses one cpu core seems best fit for most of buildings.
But memory consumption is largely different between commands (e.g. a simple python script vs large object linking), and it is difficult to have common assumption of memory usage for such commands.

If you want to control the parallelism of memory consuming process, it is better to specify lower -j or use pool feature when you know memory footprint of your build well.
https://ninja-build.org/manual.html#ref_pool

@Alexander--
Copy link
Author

I think the model a process uses one cpu core seems best fit for most of buildings.

I believe, that total amount of RAM should also be considered. Swapping to hard drive certainly does not help real-world performance.

If you want to control the parallelism of memory consuming process, it is better to specify lower -j or use pool feature

I am not developer of Android NDK build system, so I don't pass command-line arguments to ninja command, — Android build system does.

I have set environment variable MAKEFLAGS="-j4" (which is abided by most make-based build systems out there), but ninja does not appear to use it.

@jimon
Copy link

jimon commented May 27, 2018

I speculate that you can provide arguments to ninja directly through cmake from gradle. From what I can see, cmake integration in gradle has arguments field. Maybe this will work?

@Alexander--
Copy link
Author

Alexander-- commented May 27, 2018

@jimon

I understand, that you want to help, but that's not the point.

My point:

  1. make has safe defaults and allows you to switch to greater parallelism via environment variable.
  2. ninja has unsafe defaults. I am constantly afraid, that building some program will make my system hang (I don't want to check, what build system is used by each random AUR software, before building it!) I don't know, how to globally change defaults of ninja.

@jimon
Copy link

jimon commented May 27, 2018

@Alexander-- , imagine you have a single executable (compiler or similar) that just allocates more RAM than you have (via swap file or other means). You have a build system that just executes that app and viola - your system is not responsive. By this mental example it's obvious that this problem is not solvable in absolute. Now let's try to imagine a case where executable allocates memory gradually over time, ninja has no knowledge upfront of memory usage, and it will become only apparent that RAM is gone only after it's actually gone, at that point there is nothing to be done. The problem is not trivially solvable, and I don't believe it should be in scope of the project.

As for make having safer defaults - I can't imagine how many developer time is wasted simply because devs are either not aware or can't be bothered with setting parallelism in make. I constantly see my colleagues wasting minutes or even hours because they are not aware that make builds in one thread by default.

I don't want to oppose fixing your problem, and I don't want to sound harsh :) But I think the most optimal solution is just to override gradle behavior and move on. Because it is very specific to your project/computer and probably doesn't showing up on a bigger scale.

@danw
Copy link
Contributor

danw commented May 27, 2018

Which Android build system are you using? Building Android apps with the NDK, or building the entire platform (ROM)?

@av930
Copy link

av930 commented Jun 21, 2018

Ninja + Gradle + Jack = Hell. (it takes up at least 16GB RAM)
I don't know where problem exists exactly. but I guess the major problem is from java build using jack-server. More Important thing is that Ninja(from Soong) changes all legacy things and introduces so many small software things, therefore we are lost in build system and cannot find the problem and solution.
I missed old Make. please get me out of this Hell of build.
I should got back to solve this "No Jack server running. Try 'jack-admin start-server'"

@evmar
Copy link
Collaborator

evmar commented Jun 21, 2018

To rephrase @atetubou: if your CPU has 64 cores, it is reasonable to assume it is capable of executing 64 programs in parallel. The reason this assumption is a problem for you is that your compilation tasks appear to be larger than an ordinary program. Ninja has a mechanism for communicating this information, via the 'pool' feature. If you read that docs section you'll see it's designed for exactly this problem. https://ninja-build.org/manual.html#ref_pool

I am sympathetic to your problem but I don't see an easy way for Ninja to solve it. You say it should "consider" the total amount of RAM, but what sort of formula could we use?

@mydongistiny
Copy link

mydongistiny commented Jun 28, 2018

@av930 jack is going away in P and can be disabled in O builds. I removed the use of jack from O builds and my 8-core with 12 gig memory has no problem running with 16+ threads. You can also change the arguments ninja uses during the builds. The problem is jack not ninja.

Also this change might help you: #1399

@av930
Copy link

av930 commented Jun 29, 2018

Oh my god! thanks, It works. Finally I finished AOSP full build in 16GRAM machine.
jack disable: ANDROID_COMPILE_WITH_JACK := false in build/make/core/javac.mk

this is log.
[ 99% 101694/101695] Install system fs image: out/target/product/taimen/system.img
out/target/product/taimen/system.img+ maxsize=2740531200 blocksize=135168 total=1076289888 reserve=27709440
[100% 101695/101695] Target vbmeta image: out/target/product/taimen/vbmeta.img

build completed successfully (04:43:31 (hh:mm:ss))

@jhasse
Copy link
Collaborator

jhasse commented Nov 2, 2018

I think this might be solvable by telling the OS that for a group of n running processes it's okay to suspend up to n-1 processes in an out-of-memory situation. Does someone know if this can be achieved with cgroups on Linux?

@jhasse
Copy link
Collaborator

jhasse commented Nov 2, 2018

There are actually two PRs which implement a memory limit: #1354 and #660. Discussion in the latter is interesting.

@nico
Copy link
Collaborator

nico commented Nov 15, 2018

This mostly sounds like the Android build was holding ninja wrong by not putting jack in a pool.

@nico nico closed this as completed Nov 15, 2018
@Alexander--
Copy link
Author

@nico

Could you point at specific Ninja version, that fixed this issue? I am still seeing Ninja use too much parallel tasks by default whenever I build something from AUR. It looks like Ninja is still attempting to create 18 thread on a machine with 16 cores, regardless of available amount of RAM.

@nico
Copy link
Collaborator

nico commented Nov 16, 2018

Ninja doesn't do anything here, it's up to the generator to make sure things that need lots of RAM (or similar) are in an appropriately-sized pool.

@Alexander--
Copy link
Author

Alexander-- commented Nov 17, 2018

it's up to the generator to make sure things that need lots of RAM (or similar) are in an appropriately-sized pool

Is this your personal opinion, or official position of the Ninja developers? The documentation never speaks of such responsibility, and I haven't heard of generators, that actually meddle with Ninja parallelism.

Either way, generator writers aren't in better position than you to choose, how many parallel processes to run. That's best decided by buildserver administrator.

You are saying, that there is no (and should never be) a generic way to lower Ninja resource usage, that works regardless of used generator. I disagree. There are build tools (e.g. GNU Make), that have that functionality, so there is already a precedent for such feature.

@jhasse
Copy link
Collaborator

jhasse commented Nov 30, 2018

The documentation never speaks of such responsibility, and I haven't heard of generators, that actually meddle with Ninja parallelism.

Not explicitly about RAM, but:

"build-time customization of the build. Options belong in the program that generates the ninja files."

"Ninja has almost no features; just those necessary to get builds correct while punting most complexity to generation of the ninja input files."

You are saying, that there is no (and should never be) a generic way to lower Ninja resource usage, that works regardless of used generator.

There are two ways: The -j and the -l flag.

I am still seeing Ninja use too much parallel tasks by default whenever I build something from AUR.

I don't know exactly how the AUR works, but would putting the following script into /usr/local/bin and making it executable via chmod +x /usr/local/bin/ninja work?

#!/usr/bin/env python3

import sys
import subprocess

try:
	subprocess.check_call(['/usr/bin/ninja', '-j1'] + sys.argv[1:])
except subprocess.CalledProcessError as e:
	exit(e.returncode)

@d3x0r
Copy link

d3x0r commented Sep 4, 2019

-j1 could have been mentioned like in the first response.
I have this project which has over time aquired a lot of internal external_projects. That is, they build statically instead of the dynamic build that the common makefile uses, so they each have their own flags and sources that get built.
Today, on this older laptop, (I haven't previously used ninja on this system) it's only a 4GB windows 7 system with core i7 (8 threads). When I started the build, the whole system went to a crawl for half an hour while it swapped to and from the disk.
It resulted in baizaare errors

C:/general/build/mingw64-x64/sack/RelWithDebInfo_out/core/include/SACK/stdhdrs.h:259:24: fatal error: C:/general/build/mingw64-x64/sack/RelWithDebInfo_out/core/include/SACK/loadsock.h: Invalid argument
compilation terminated.

why is including a file an invalid argument? the file exists....
so the whole build finished, but didn't build very many targets successfully...
even when I now use

ninja -j1 -v install 

the inner ninja processes still use a -j8

The build ended up being -j8 on the first which launched 8 external projects each with -j8, so -j64 peaked out memory until there was none...

@jhasse
Copy link
Collaborator

jhasse commented Sep 4, 2019

Jobserver support might help in that case: #1139

@eli-schwartz
Copy link

The independent Ninja build language reimplementation https://github.com/michaelforney/samurai supports this via the environment variable SAMUFLAGS. Simply use samu anywhere you'd otherwise use ninja, or install samu as your system /usr/bin/ninja (some linux distros have options to install this competing implementation as the default, or only, ninja).

I advise updating AUR packages that build with ninja, to use community/samurai instead.

@ericoporto
Copy link

https://komodor.com/learn/how-to-fix-oomkilled-exit-code-137/#:~:text=The%20OOMKilled%20error%2C%20also%20indicated,utilize%20on%20the%20host%20machine.

I have a problem that ninja is a bit gambley in a Continuous Integration system using GKE, like it can make it faster, but sometimes it makes the container be killed for by OOM. Any ideas?

@LeCmnGend
Copy link

Oh my god! thanks, It works. Finally I finished AOSP full build in 16GRAM machine. jack disable: ANDROID_COMPILE_WITH_JACK := false in build/make/core/javac.mk

this is log. [ 99% 101694/101695] Install system fs image: out/target/product/taimen/system.img out/target/product/taimen/system.img+ maxsize=2740531200 blocksize=135168 total=1076289888 reserve=27709440 [100% 101695/101695] Target vbmeta image: out/target/product/taimen/vbmeta.img

build completed successfully (04:43:31 (hh:mm:ss))

thank you

@isarkis
Copy link

isarkis commented Apr 6, 2022

@ericoporto, we are having similar problem with Ninja on GKE. No matter how much memory we allocate to our containers once in a while Ninja consumes all memory causing OOM event both in pods and even starving GKE node resources resulting in runtime crash. We tried changing link pool setting, but it didn't help. It seems like the only option is to set -j option to some small number, but that affects build turnaround time quite a bit.

Have you been able to find a better solution?

@ericoporto
Copy link

@isarkis I am using your same solution.

I noticed in the beefier VMs I am using that are MacOS based this never happens, but I guess it's because of the amount of RAM that is really big for my case.

@ericoporto
Copy link

@isarkis I don't know if you know, but there's now : https://github.com/evmar/n2

It's by @evmar too.

@rubyFeedback
Copy link

rubyFeedback commented Feb 12, 2023

Just a quick fly-by comment from me.

I purchased a new computer; it's fairly fast now. My main work computer, running linux.

I compile most things from source now. It has 64GB RAM and 32 cores in total. AMD Ryzen
5 5500.

Most things that can be compiled via meson + ninja, or even cmake + ninja work well.

But when I tried to compile webkitgtk, I ran into issues. In fact, my computer froze
and I had to restart it.

I don't know what is the issue exactly, but it happens in the later stage of webkitgtk,
often during some linking (for some reason, the later steps take longer to compile,
I get the same behaviour when I try to compile the game called wesnoth by the way,
but this one compiles cleanly and the resulting binary works, I can play the game
fine).

I somewhat suspect that ninja is in part responsible for it. Or perhaps a combination
of webkitgtk and ninja. It seems to consume more and more stuff and then suddenly
it seems to be "too much" and things slow down/break down.

Having explained that use case, I am kind of hoping of more fine-tuned control, so
that I can look where the issue is, and then once I found out, find some ways to
mitigate that. I am fine with a slower compile speed in such a case - slower is
still better than my computer crashing (it runs in the background so I don't mind
anyway, I do other things while things are compiling).

I know there are some ways on linux to handle process priority and what not,
renice, and probably some container/cgroups things - but everything
ninja could do to show more transparency here, would be SUPER helpful
too. So I am just +1 this thread here. I can kind of relate to the issue proposed
and wanted to expand on it a bit more. (I'll try to compile the unstable
webkitgtk, perhaps it is a combination of ninja + webkitgtk).

@memchr
Copy link

memchr commented Aug 18, 2023

If you are using systemd, the CPUQouta property can be used to limit the maximum CPU usage, albeit this will not change the number of cores visible to Ninja.

systemd-run --user -G --pty -p CPUQuota=400% -p WorkingDirectory=$PWD <command>

@UffeJakobsen
Copy link

If you are using systemd, the CPUQouta property can be used to limit the maximum CPU usage, albeit this will not change the number of cores visible to Ninja.

Even with (s)lower cpu usage - each core will still do its allocations and eventually starve out the memory...

@memchr
Copy link

memchr commented Aug 18, 2023

If you are using systemd, the CPUQouta property can be used to limit the maximum CPU usage, albeit this will not change the number of cores visible to Ninja.

Even with (s)lower cpu usage - each core will still do its allocations and eventually starve out the memory...

Well, there is an easier way, using taskset, if you don't care about limiting CPU quotas

taskset -c 0-3 ninja

This will restrict ninja to use only CPU 0 to CPU 3.
taskset should be available on most distributions and is part of the util-linux project.

This can also be done with CPUAffinity= in systemd.

And of course, the above assumes that you cannot pass the -j (limit number of parallel jobs) or -l (limit load average, effectively) flag to Ninja itself.

@UffeJakobsen
Copy link

UffeJakobsen commented Aug 18, 2023

Well, there is an easier way, using taskset, if you don't care about limiting CPU quotas

taskset -c 0-3 ninja

This will restrict ninja to use only CPU 0 to CPU 3. taskset should be available on most distributions and is part of the util-linux project.

This can also be done with CPUAffinity= in systemd.

If you are able to prefix your command with taskset - then why not just run ninja -j 4

I acknowledge that this thread has become so long that the points that were made in the past have become lost...

Recap:
There are programs/setups that indirectly invoke different build systems (ninja being one of them) - where it is not possible to dynamically define a limit for ninja.

The ArchLinux AUR package build system - is just one example where parallelization cannot be dynamically adjusted per build instance...

What most of us are asking for is a way to present the parallel cpu limit in an environment var - just like it is possible with make, pmake, cmake and similar

@memchr
Copy link

memchr commented Aug 18, 2023

points that were made in the past have become lost...
where it is not possible to dynamically define a limit for ninja.

Sorry for the confusion. Actually, the workaround I am talking about is specific to this kind of problem. The restrictions imposed by the taskset or systemd-run are inherited by all child processes.

Albeit you cannot call ninja directly with taskset or systemd-run, you can still call the parent command with them.

The ArchLinux AUR package build system - is just one example where parallelization cannot be dynamically adjusted per build instance...

For example

taskset -c 0-3 makepkg

Or, if you can't do that either, because you're using a helper that calls makepkg indirectly, for example paru, then run paru with taskset.

taskset -c 8-11 paru -S linux-clang

Or just start a new shell with taskset

taskset -c 0-3 $SHELL

@memchr
Copy link

memchr commented Aug 18, 2023

If you have no control over the parents, there are always the grandparents. It's turtles all the way down.

@UffeJakobsen
Copy link

Sorry for the confusion. Actually, the workaround I am talking about is specific to this kind of problem. The restrictions imposed by the taskset or systemd-run are inherited by all child processes.

Albeit you cannot call ninja directly with taskset or systemd-run, you can still call the parent command with them.

The ArchLinux AUR package build system - is just one example where parallelization cannot be dynamically adjusted per build instance...

For example

taskset -c 0-3 makepkg

It seems a little counter-productive that I need to do specific core pinning/allocation/planning for such simple operations - just because my system has many cores and not that much memory

Also to be honest - I would prefer a solution internally to ninja so that it will be neutral/identical for all platforms that can run ninja (FreeBSD and similar)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests