releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run #857

clemensutschig · 2022-02-24T06:39:08Z

We run jenkins with EDP - for document generation - and can see that memory used during the run is not reclaimed / freed up after

the run started at 5:44pm - and was done at 11pm CET - you can see - no GC // reclaim

@braisvq1996 @albertpuente

metmajer · 2022-03-02T07:23:01Z

@clemensutschig we looked into this. What we detected is that the available memory is immediately consumed upon start, starting with 10 GB and quickly going up to 14 GB. We also see regular garbage collector activity and that Jenkins' underlying Hudson implementation keeps up to 5 M references to ArrayList$Itr objects, which are small though and only sum up to roughly 150 MB.

We do see regular suspension times induced by the GC, which is related to taking objects from the Eden Space into the Old Gen. I assume that our configuration has something to do with this, but it came as a recommendation from CloudBees, see opendevstack/ods-core#412.

metmajer · 2022-03-02T07:28:29Z

metmajer · 2022-03-02T07:30:35Z

@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption?

clemensutschig · 2022-03-02T09:57:45Z

hey @metmajer this is interesting - do we have a view what is stored in this extensionLists ... ? I took a look what's in the doc about this .. maybe here is our culprit .. .?!

metmajer · 2022-03-02T15:18:44Z

@clemensutschig yes, might make a difference, but could depend on the GC. Let's not take for granted what CloudBees recommended some time ago. The following article might be interesting: https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector

victorpablosceruelo · 2022-03-03T11:43:24Z

@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption?

Hi there.
Yes, it should affect the memory consumption because we will stop using some SharedLibrary functionality (that will be removed later), but we do not know yet if it will really be the improvement we are looking for.

We are requesting access to the monitor tools you show in this issue. The more details we have, the more concise will be our answer.

clemensutschig · 2022-03-08T13:50:10Z

4.x with a super simple testcase (RM + Jira + one (failing) test component - it's not a lot per run .. but it's definitely accumulating .. the start of the graph is a restart of jenkins - to have a clean base .. and then one run after another ..

20 runs - one gig of RAM that is not reclaimed ..

metmajer · 2022-03-08T14:59:58Z

Here is an overview on the Jenkins configuration we are currently using: opendevstack/ods-core#412 (comment)

clemensutschig · 2022-03-08T18:47:52Z

even with keep only 2 failures (via build config, and hence deleting the rest) - there is no difference:

we see the same growth .. if you zoom in - you see GC kicking in - and reclaiming 10 or less megs after every run - but overall we see (close to) linear increase. This leads me (for now) to believe that there is a mem leak,

clemensutschig · 2022-03-09T10:35:23Z

even with a whole night - no real GC reclaim happend ... I am trying now to get this thing to an oom

clemensutschig · 2022-03-09T12:42:17Z

hurray :)

Terminated at Mar 9, 2022 1:40:55 PM with exit code 137 (OOMKilled) (5103megs)

metmajer · 2022-03-09T12:58:54Z

@clemensutschig you're the first person that's happy about an OOM :) Are you able to achieve better insights? Thinking the the GC options might need a re-evaluation.

clemensutschig · 2022-03-09T13:08:09Z

I am trying to get better insights - and cleanup all sorts of things now in my branch - to see if that impacts anything ..

what's interesting - I have turned on cleanup (thru build config) of failed builds .. and take a look below:

rg.jenkinsci.plugins.workflow.job.WorkflowRun finish testbug-cd/testbug-cd-release-master #185 completed: FAILURE
--
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onCompleted onCompleted job/testbug-cd/job/testbug-cd-release-master/185/
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.JenkinsUtils deleteRun Deleting run: testbug-cd/testbug-cd-release-master #183
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onFinalized onFinalized job/testbug-cd/job/testbug-cd-release-master/185/
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onDeleted onDeleted job/testbug-cd/job/testbug-cd-release-master/183/
  | 2022-03-09 13:03:05 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider gracefulClose Not closing io.fabric8.kubernetes.client.DefaultKubernetesClient@76b3c497: there are still running (1) or queued (0) calls
  | 2022-03-09 13:03:05 WARNING io.fabric8.jenkins.openshiftsync.JenkinsUtils deleteRun Unable to delete run testbug-cd/testbug-cd-release-master #183:Unable to delete '/var/lib/jenkins/jobs/testbug-cd/jobs/testbug-cd-release-master/builds/.183'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
  | 2022-03-09 13:05:35 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider gracefulClose Not closing io.fabric8.kubernetes.client.DefaultKubernetesClient@76b3c497: there are still running (1) or queued (0) calls

clemensutschig · 2022-03-09T17:12:51Z

ok - more tries (only keeping 2 failed runs, which the belows are):
All data is Container memory based on ocp console::

restart: 3246m mem used
17:27: start FIRST no JIRA NO DOCS rm run
17:32: run done - 3440m mem used (+155m)

17:43: start SECOND no JIRA NO DOCS rm run (agent pods were removed)
17:49: run done - 3515m mem used (+75m)
17:58: 3510m mem used (GC: - 4m)

17:59: start THIRD no JIRA NO DOCS rm run (agent pods were removed)
18:06: run done - 3565m mem used (+50m)
18:15: 3561m mem used (GC: - 4m)

18:15: start FOURTH no JIRA NO DOCS rm run (agent pods were removed)
18:21: run done - 3634 m mem used (+70m)

I am trying to understand why this is steadly growing - and I really CANNOT .. .

clemensutschig · 2022-03-09T18:10:16Z

ps - in case someone curious - I added

    } finally {
      logger.debug('-- SHUTTING DOWN RM (..) --')
      logger.resetStopwatch()
      project.clear()
      ServiceRegistry.instance.clear()
      UnirestConfig.shutdown()
      project = null
      git = null
      logger = null
      repos = null
      steps = null
    }

to the odsOrchestrationStage - at the almost very root ... :( and still nothing changes

clemensutschig · 2022-03-10T11:15:44Z

I am really completely utterly confused ... : xmx = 1g (top shows java 1gig mem) and I am running happily component jobs ..

however my full container? memory is still growing ... ?! (container memory set min4 max5g) - 1pod and ONE container only

clemensutschig · 2022-03-10T11:24:21Z

how come java goes to 1.3?

clemensutschig · 2022-03-10T12:52:23Z

ok - now I used max 3g (limits 3gi) - with 0,5 heap percentage and -Xmx1g.

Top: 1,3g Java and overall pod memory at 1600m still growing ...

from the logs:

exec java -server -Xmx1g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Duser.home=/var/lib/jenkins -Djavamelody.application-name=JENKINS -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.tasks.MailSender.SEND_TO_UNKNOWN_USERS=true -Dhudson.tasks.MailSender.SEND_TO_USERS_WITHOUT_READ=true -jar /usr/lib/jenkins/jenkins.war
--
  | Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true
  | Running from: /usr/lib/jenkins/jenkins.war
  | webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")

clemensutschig · 2022-03-11T09:43:39Z

VM flags (based on xms / xmx = 1g and metaspace 1g):

-XX:CICompilerCount=2 -XX:CompressedClassSpaceSize=1040187392 -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824 -XX:MaxMetaspaceSize=1048576000 -XX:MaxNewSize=357892096 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=357892096 -XX:NonNMethodCodeHeapSize=5825164 -XX:NonProfiledCodeHeapSize=122916538 -XX:OldSize=715849728 -XX:ProfiledCodeHeapSize=122916538 -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:+UnlockExperimentalVMOptions -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseSerialGC

more diagnosis .. (I have one testcomponent Ods job that runs over and over again)

vs a couple minutes later

vs a couple mins later

stitakis · 2022-03-11T10:30:42Z

@clemensutschig I'd like to do an analysis to identify memory leaks, how can I get the heap dump file? Did you generate a heap dump?

clemensutschig · 2022-03-11T12:32:53Z

native memory tracking:

12:21 UTC
Total: reserved=2773758KB +72012KB, committed=1508758KB +131592KB

12:30 UTC
Total: reserved=2759677KB +57931KB, committed=1506845KB +129679KB

I am now forcing a maxOfRam with 90% .. lets see where that brings us .. see below - class is going up reserved and also committed - although its the exact same job running, from the exact same commit!

-                     Class (reserved=1287679KB +79246KB, committed=267303KB +87822KB)
                            (classes #31186 +4625)
                            (  instance classes #29931 +4546, array classes #1255 +79)
                            (malloc=7679KB +1422KB #178934 +31441)
                            (mmap: reserved=1280000KB +77824KB, committed=259624KB +86400KB)
                            (  Metadata:   )
                            (    reserved=231424KB +77824KB, committed=230184KB +78080KB)
                            (    used=188192KB +46996KB)
                            (    free=41992KB +31084KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=29440KB +8320KB)
                            (    used=20731KB +3677KB)
                            (    free=8709KB +4643KB)
                            (    waste=0KB =0.00%)

the same 10 mins later:

-                     Class (reserved=1287869KB +79436KB, committed=267493KB +88012KB)
                            (classes #32246 +5685)
                            (  instance classes #30974 +5589, array classes #1272 +96)
                            (malloc=7869KB +1612KB #182833 +35340)
                            (mmap: reserved=1280000KB +77824KB, committed=259624KB +86400KB)
                            (  Metadata:   )
                            (    reserved=231424KB +77824KB, committed=230184KB +78080KB)
                            (    used=197988KB +56792KB)
                            (    free=32196KB +21288KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=29440KB +8320KB)
                            (    used=21590KB +4536KB)
                            (    free=7850KB +3784KB)
                            (    waste=0KB =0.00%)

10 mins later:

                     Class (reserved=1296679KB +88246KB, committed=276303KB +96822KB)
                            (classes #35240 +8679)
                            (  instance classes #33927 +8542, array classes #1313 +137)
                            (malloc=8487KB +2230KB #193209 +45716)
                            (mmap: reserved=1288192KB +86016KB, committed=267816KB +94592KB)
                            (  Metadata:   )
                            (    reserved=239616KB +86016KB, committed=238376KB +86272KB)
                            (    used=223608KB +82412KB)
                            (    free=14768KB +3860KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=29440KB +8320KB)
                            (    used=24045KB +6991KB)
                            (    free=5395KB +1329KB)
                            (    waste=0KB =0.00%)

and now 20 mins later - something starts to collect

-                     Class (reserved=1299973KB +91541KB, committed=279981KB +100501KB)
                            (classes #30457 +3896)
                            (  instance classes #29216 +3831, array classes #1241 +65)
                            (malloc=7685KB +1429KB #178768 +31275)
                            (mmap: reserved=1292288KB +90112KB, committed=272296KB +99072KB)
                            (  Metadata:   )
                            (    reserved=243712KB +90112KB, committed=242216KB +90112KB)
                            (    used=180605KB +39409KB)
                            (    free=61611KB +50703KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=30080KB +8960KB)
                            (    used=20172KB +3118KB)
                            (    free=9908KB +5842KB)
                            (    waste=0KB =0.00%)

take a look at the class increase ... !!!!
( instance classes #29931 +4546, array classes #1255 +79) vs:
( instance classes #30974 +5589, array classes #1272 +96) vs:
( instance classes #33927 +8542, array classes #1313 +137) vs ÁFTER COLLECT:
( instance classes #29216 +3831, array classes #1241 +65)

what's interesting:
metadata reserved=243712KB +90112KB grows steadily ...

clemensutschig · 2022-03-11T12:48:41Z

https://support.cloudbees.com/hc/en-us/articles/360029574172-Metaspace-Leak-Due-to-classes-not-being-cleaned-up

but unfortunately - we are using [Jenkins 2.289.1] :(

clemensutschig · 2022-03-11T20:08:01Z

                     Class (reserved=818996KB +191857KB, committed=352564KB +213617KB)
                            (classes #42887 +21410)
                            (  instance classes #41126 +20650, array classes #1761 +760)
                            (malloc=12084KB +7537KB #243483 +135258)
                            (mmap: reserved=806912KB +184320KB, committed=340480KB +206080KB)
                            (  Metadata:   )
                            (    reserved=303104KB +184320KB, committed=301568KB +184064KB)
                            (    used=253997KB +145475KB)
                            (    free=47571KB +38589KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=503808KB, committed=38912KB +22016KB)
                            (    used=28052KB +14529KB)
                            (    free=10860KB +7487KB)
                            (    waste=0KB =0.00%)

running classloader stats: VM.classloader_stats

Total = 6316                                                 43487  317526016  293961896  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

with now another run (with shared lib)

Total = 6749                                                 45205  331855872  307339496  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

and now withOUT shared lib import ´(just echo ...)

Total = 6763                                                 45212  331902976  307372080  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

and now with shared lib import (and echo .... )

Total = 6787                                                 45219  331950080  307417400  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

sooo this is a classloader reference leak - something hangs on to classloaders - and it reproduces W/O any shared lib - it just gets worse with shared libs ... :/

clemensutschig · 2022-03-11T20:19:28Z

ok - some more testing with classloader stats

after start (NO running jobs):

Total = 897                                                  21312  135653376  123512712  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

after running a naked job (NO shared lib - just echo)

Total = 1010                                                 22726  144394240  132021864  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

wait a few mins ....

Total = 1038                                                 22744  144835584  132366424  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

after running a job (with shared lib import and echo)

Total = 1139                                                 22870  145810432  133105960  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

running the same job again

Total = 1183                                                 22907  146402304  133509704  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

this is really scary ... :(

clemensutschig · 2022-03-12T10:28:32Z

I started to play with the heap dump - and classloaders .... and all big classloaders have the common ancestor:

if I look into the classloader - this is definetely ours ... (with grapes being there ... )

but there is a second guy holding (the first is the finalizer of the vm that tries to clean it) - that starts to spark my interest ... (via incoming references)

this is an enum .. hmmmm

FlavioCampanaBIExt · 2022-10-05T09:14:03Z

Hi,
right now in DEV environment Jenkins memory is behaving better, even without forcing a gc run:

I'll double check against the stable, production branch of the shared library to spot any difference like we talked with @jafarre-bi

FlavioCampanaBIExt · 2022-10-11T10:03:17Z

Run again the tests with the shared library both with and without the changes for the next release with standard jenkins master and agent images and compared the results, after the build end memory get freed thanks to Clemens clean up code.
During testing we found another problem, that sometime the job get stuck waiting for a pod from openshift: without the final cleanup memory wasn't freed at all, and in case of high enough concurrency it might lead to memory problems. It seems to be a completely independent problem from the memory leak and happens both with and without the code for the new release. We didn't notice a clear difference with and without the latest changes, and it also implies we saw no negative effect.

clemensutschig · 2023-03-11T11:22:04Z

@oalyman - here is the two nice commands

bsaeline:

jcmd  $(jps 2>&1 | grep -iv '^[0-9]*\s*Jps\s*$' | cut -d ' ' -f1) VM.native_memory baseline=true

and to track metrics:

echo -n "lsof: " ; lsof -p $(jps 2>&1 | grep -iv '^[0-9]*\s*Jps\s*$' | cut -d ' ' -f1) 2>/dev/null | wc -l ; echo -n "classloader amount: "; jcmd  $(jps 2>&1 | grep -iv '^[0-9]*\s*Jps\s*$' | cut -d ' ' -f1) VM.classloader_stats | wc -l ; echo "CleanGroovyClassLoader live: "; jcmd  $(jps 2>&1 | grep -iv '^[0-9]*\s*Jps\s*$' | cut -d ' ' -f1) VM.classloader_stats | grep CleanGroovyClassLoader ; echo "Native memory summary (and diff to baseline): "; jcmd  $(jps 2>&1 | grep -iv '^[0-9]*\s*Jps\s*$' | cut -d ' ' -f1) VM.native_memory summary.diff

and what's helpful - is to trigger gc as well - with GC.run

clemensutschig · 2023-03-12T08:26:32Z

I stumbled across a few things though while looking thru the heap dump (with the dangling loaders) - and that sort of brings me back to my original thoughts (on it must be something else then green path EDP rm execution)

My cleanup code is only implemented in the RM (odsOrchestrationStage.finally) - and not in the component / quickstarter pipeline. As the classloaders dangling don't have a name (which is set by the cleanup code to the jobname+run id), they MUST come either from provisioning or from component pipelines.
These danglers consume a LOT of memory, thru various means (inclusive Groovy AST type tables .... ) outside of the "basic classloader loaded libs, and other junk"

So there is two HOT topics that are the smokers ... (a) the component pipeline and (b) the component pipeline ESPECIALLY in the context of a [ci ckip] - which is triggered from the RM commit(s)

Quickly verified - in neither of the two cases the cleanup code is called (and I would also guess that it's incomplete in any case)

[gol] ***** Finished ODS Pipeline ***** (took 176811 ms)
[Pipeline] }
[Pipeline] // stage
[Pipeline] echo
Collecting test results, if available ...
[...]
Recording test results
[Checks API] No suitable checks publisher found.
[Pipeline] stash
Stashed 1 file(s)
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline

I will provide a fix for this in a branch for people to try it out - now that we can repro it :) (it's reoccuring promote to dev builds - that create all sorts of commits, which cause downstream CI skip runs) ..

clemensutschig · 2023-03-12T18:33:03Z

with the fix I see less class etc growth ... which is good .. interestingly (at full swing, and close to class space limit) - I see a

heap is not fully consumed (kind of cool!) - but growing if I leave it over time (mem leak on heap)?
lot of thread mem, growing
class space stays super stable - so we don't have classloader leak it seems

otal: reserved=6215755KB **+83832KB**, committed=4571171KB **+605732KB**

-                 Java Heap (reserved=4194304KB, committed=3817472KB **+543744KB**)
                            (mmap: reserved=4194304KB, committed=3817472KB +543744KB)
 
-                     Class (reserved=1267231KB +1335KB, committed=255099KB +1335KB)
                            (classes #33030 +4437)
                            (  instance classes #31638 +4293, array classes #1392 +144)
                            (malloc=9759KB +1335KB #210437 +21350)
                            (mmap: reserved=1257472KB, committed=245340KB)
                            (  Metadata:   )
                            (    reserved=217088KB, committed=216156KB)
                            (    used=186447KB +27830KB)
                            (    free=29709KB -27830KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=29184KB)
                            (    used=20897KB +3019KB)
                            (    free=8287KB -3019KB)
                            (    waste=0KB =0.00%)
 
-                    Thread (reserved=168199KB **+67107KB**, committed=15119KB +6191KB)
                            (thread #163 +65)
                            (stack: reserved=167428KB +66796KB, committed=14348KB +5880KB)
                            (malloc=582KB +235KB #980 +390)
                            (arena=189KB +76 #324 +130)
 
-                      Code (reserved=260716KB +2245KB, committed=172160KB **+21145KB**)
                            (malloc=13032KB +2245KB #43135 +7134)
                            (mmap: reserved=247684KB, committed=159128KB +18900KB)

clemensutschig · 2023-03-12T18:37:53Z

what's kind of interesting is how long pods stay alive (Even after jenkins run is done and gone) - which may be the reason for the thread mem size growing ..

and secondly - although jvm mem is pretty stable - I still see pod mem consumption growing ..

@oalyman - maybe you can take a look into that ..

clemensutschig · 2023-03-13T07:54:09Z

Courtesy @oalyman - simple is beautifull.
-XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90

Rerunning my testsuite. One thing that people that debug / test with it - should also add is -XX:NativeMemoryTracking=summary

oalyman · 2023-03-13T11:20:01Z

The settings @clemensutschig mentioned above originate from
https://github.com/openshift/jenkins/blob/master/2/contrib/s2i/run#L346

  # See https://developers.redhat.com/blog/2014/07/22/dude-wheres-my-paas-memory-tuning-javas-footprint-in-openshift-part-2/ .
  # The values are aggressively set with the intention of relaxing GC CPU time
  # restrictions to enable it to free as much as possible, as well as
  # encouraging the GC to free unused heap memory back to the OS.
  JAVA_GC_OPTS="-XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90"

Which refers to
https://developers.redhat.com/blog/2014/07/22/dude-wheres-my-paas-memory-tuning-javas-footprint-in-openshift-part-2

clemensutschig · 2023-03-13T17:01:17Z

latest settings based on jan's - I added
-XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M

albertpuente · 2023-03-13T17:44:53Z

Executed test on a real-size project with many components, e2e and large docs:
Memory appears to converge to 1.5Gi for a container with request 4Gi and limit 6Gi.

GC OPS:

-XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:NativeMemoryTracking=summary -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M

JAVA_MAX_HEAP_PARAM:

-Xms1024m -Xmx4g

I computed the baseline to compare with further executions overnight.

clemensutschig · 2023-03-13T20:33:24Z

That's kind of interesting - looks like most collectors dont give back to the OS (at least not a lot) .. :(

The other piece is to figure out if we leak somewhere else - as documented here .. https://stackoverflow.com/questions/64593968/why-does-res-memory-keep-slowly-growing-for-java-processes-even-for-out-of-the-b

I leave my testcase running now overnight - to see what's going on .. and if native memory does increase.

albertpuente · 2023-03-13T21:26:22Z

Native memory summary (and diff to baseline): 
444:

Native Memory Tracking:

Total: reserved=6029549KB -3496KB, committed=908405KB +73504KB

-                 Java Heap (reserved=4194304KB, committed=272384KB +67072KB)
                            (mmap: reserved=4194304KB, committed=272384KB +67072KB)
 
-                     Class (reserved=1278319KB -3098KB, committed=265895KB -3098KB)
                            (classes #30318 -840)
                            (  instance classes #28967 -843, array classes #1351 +3)
                            (malloc=8559KB -1050KB #191136 -12321)
                            (mmap: reserved=1269760KB -2048KB, committed=257336KB -2048KB)
                            (  Metadata:   )
                            (    reserved=229376KB -2048KB, committed=228152KB -2048KB)
                            (    used=170471KB -421KB)
                            (    free=57681KB -1627KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=29184KB)
                            (    used=19004KB -571KB)
                            (    free=10180KB +571KB)
                            (    waste=0KB =0.00%)
 
-                    Thread (reserved=73347KB -1033KB, committed=10279KB +1659KB)
                            (thread #71 -1)
                            (stack: reserved=72976KB -1028KB, committed=9908KB +1664KB)
                            (malloc=258KB -4KB #428 -6)
                            (arena=113KB -1 #140 -2)
 
-                      Code (reserved=259715KB +739KB, committed=148583KB +7707KB)
                            (malloc=12027KB +739KB #36747 +1463)
                            (mmap: reserved=247688KB, committed=136556KB +6968KB)
 
-                        GC (reserved=161035KB +39KB, committed=148435KB +307KB)
                            (malloc=7791KB +39KB #5789 -687)
                            (mmap: reserved=153244KB, committed=140644KB +268KB)
 
-                  Compiler (reserved=715KB +19KB, committed=715KB +19KB)
                            (malloc=582KB +19KB #2946 +146)
                            (arena=133KB #5)
 
-                  Internal (reserved=3700KB +781KB, committed=3700KB +781KB)
                            (malloc=3668KB +781KB #3071 +98)
                            (mmap: reserved=32KB, committed=32KB)
 
-                     Other (reserved=543KB +36KB, committed=543KB +36KB)
                            (malloc=543KB +36KB #86 +3)
 
-                    Symbol (reserved=26754KB +3KB, committed=26754KB +3KB)
                            (malloc=24540KB +3KB #385785 +3924)
                            (arena=2214KB #1)
 
-    Native Memory Tracking (reserved=10070KB -173KB, committed=10070KB -173KB)
                            (malloc=35KB +4KB #457 +62)
                            (tracking overhead=10036KB -177KB)
 
-        Shared class space (reserved=17412KB, committed=17412KB)
                            (mmap: reserved=17412KB, committed=17412KB)
 
-               Arena Chunk (reserved=467KB +289KB, committed=467KB +289KB)
                            (malloc=467KB +289KB)
 
-                   Logging (reserved=4KB, committed=4KB)
                            (malloc=4KB #192)
 
-                 Arguments (reserved=19KB, committed=19KB)
                            (malloc=19KB #522)
 
-                    Module (reserved=2614KB -984KB, committed=2614KB -984KB)
                            (malloc=2614KB -984KB #10507 -3035)
 
-              Synchronizer (reserved=523KB -114KB, committed=523KB -114KB)
                            (malloc=523KB -114KB #4388 -974)
 
-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB)

Successive run increased memory to 2Gi

clemensutschig · 2023-03-14T07:29:06Z

so news from the night .. I let the testcase run overnight ..

and the VM is super stable (with heap growth)

21:42 -> run #132
												
top: 1.8						
lsof: 736
classloader amount: 4305

Total: reserved=3938477KB +7042KB, committed=937293KB -24666KB

-                 Java Heap (reserved=2097152KB, committed=320512KB -46080KB)
                            (mmap: reserved=2097152KB, committed=320512KB -46080KB)
 
-                     Class (reserved=1292484KB +156KB, committed=282984KB +156KB)
                            (classes #34586 -17)
                            (  instance classes #33130 -21, array classes #1456 +4)
                            (malloc=10436KB +156KB #219353 +5742)
							
08:27 -> run 209

top:2.6
lsof: 744
classloader amount: 4284

Total: reserved=3934908KB +3473KB, committed=1291532KB +329573KB

-                 Java Heap (reserved=2097152KB, committed=663040KB +296448KB)
                            (mmap: reserved=2097152KB, committed=663040KB +296448KB)
 
-                     Class (reserved=1292564KB +236KB, committed=283064KB +236KB)
                            (classes #34460 -143)
                            (  instance classes #32998 -153, array classes #1462 +10)

clemensutschig · 2023-03-14T11:14:24Z

We deleted all the "pending" CI-skip builds that are triggered because of the commits (assemble to dev).. immediately heap is reduced by ~350MB which also transpires in OC container mem metrics to go down by 200m?! ... interesting

oalyman · 2023-03-14T13:44:29Z

Here is a pretty comprehensive source of information of the memory footprint of the JVM.
Maybe this will be helpful to further track down the problem with off heap memory / native memory etc.
Enjoy!
Memory footprint of a Java process by Andrei Pangin:
https://www.youtube.com/watch?v=c755fFv1Rnk&list=PLdJHXvFCih9ZkK6xw8Pi-wmhRzYkAp6yI&index=3

albertpuente · 2023-03-15T10:03:16Z

Here is a summary of the current state:

Changing JVM parameters have shown to massively improve the memory consumption. Currently using

        - name: JAVA_GC_OPTS
          value: >-
            -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10
            -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90
            -XX:NativeMemoryTracking=summary -XX:MaxMetaspaceSize=500m
            -XX:MetaspaceSize=256M
        - name: JAVA_MAX_HEAP_PARAM
          value: '-Xms256m -Xmx1g'

In a container with 2-3Gi (request-limit).

Forcing the clean up of resources after running component pipelines with ci-skip (done on all releases, as ODS commits to all repos). have reduced the amount of memory leaked, and the GC seems to provide a semi-stable memory consumption (if left enough time to clean up).
Memory fragmentation seems to have been reduced (top and jvm native memory are much similar) by restricting the number of memory arenas via MALLOC_ARENA_MAX=4 in the Jenkins master. Testing on a large project.

To investigate why OpenShfit reports a larger memory usage than anything which can be observed in the container (+50-80%).
Potentially related with https://stackoverflow.com/questions/61839318/kubectl-top-nodes-reporting-more-memory-utilisation-than-linux-system-commands

clemensutschig · 2023-03-16T07:06:39Z

With xmx2g & malloc_arena=4 and various code fixes (shared lib and ods core one)
it's now 100s of runs, component and RM

13:07pm -> top java 1.2g, container 1.76g  
Total: reserved=4021736KB +572KB, committed=1159624KB -1836KB

-                 Java Heap (reserved=2097152KB, committed=492544KB -1024KB)
                            (mmap: reserved=2097152KB, committed=492544KB -1024KB)
 
-                     Class (reserved=1358506KB +8KB, committed=356522KB +520KB)
                            (classes #50777 +70)
                            (  instance classes #48643 +69, array classes #2134 +1)
  
14:09 -> top java 1.5g, container 2g
Total: reserved=4193651KB +172487KB, committed=1404671KB +243211KB

-                 Java Heap (reserved=2097152KB, committed=536064KB +42496KB)
                            (mmap: reserved=2097152KB, committed=536064KB +42496KB)
 
-                     Class (reserved=1493774KB +135276KB, committed=514574KB +158572KB)
                            (classes #49290 -1417)
                            (  instance classes #47214 -1360, array classes #2076 -57)
							
15:07 -> top java 1.6g, container 2.09g
Total: reserved=4156835KB +135671KB, committed=1459119KB +297659KB

-                 Java Heap (reserved=2097152KB, committed=598016KB +104448KB)
                            (mmap: reserved=2097152KB, committed=598016KB +104448KB)
 
-                     Class (reserved=1494222KB +135725KB, committed=515022KB +159021KB)
                            (classes #51594 +887)
                            (  instance classes #49412 +838, array classes #2182 +49)
							
16:07 -> top java 1.5, container 2.15g
Total: reserved=4147528KB +126364KB, committed=1410132KB +248672KB

-                 Java Heap (reserved=2097152KB, committed=539648KB +46080KB)
                            (mmap: reserved=2097152KB, committed=539648KB +46080KB)
 
-                     Class (reserved=1496097KB +137600KB, committed=516897KB +160896KB)
                            (classes #60270 +9563)
                            (  instance classes #57729 +9155, array classes #2541 +408)
							
17:07 -> top java 1.5, container 2.05g
Total: reserved=4143960KB +122796KB, committed=1358088KB +196628KB

-                 Java Heap (reserved=2097152KB, committed=488960KB -4608KB)
                            (mmap: reserved=2097152KB, committed=488960KB -4608KB)
 
-                     Class (reserved=1496290KB +137792KB, committed=515354KB +159352KB)
                            (classes #49404 -1303)
                            (  instance classes #47347 -1227, array classes #2057 -76)

18:07 -> top java 1.5, container 2.17g
Total: reserved=4150144KB +128980KB, committed=1431712KB +270252KB

-                 Java Heap (reserved=2097152KB, committed=557056KB +63488KB)
                            (mmap: reserved=2097152KB, committed=557056KB +63488KB)
 
-                     Class (reserved=1496934KB +138436KB, committed=515998KB +159996KB)
                            (classes #53835 +3128)
                            (  instance classes #51580 +3006, array classes #2255 +122)

19:07 -> top java 1.6, container 2.27g
Total: reserved=4187618KB +166454KB, committed=1529634KB +368174KB

-                 Java Heap (reserved=2097152KB, committed=642560KB +148992KB)
                            (mmap: reserved=2097152KB, committed=642560KB +148992KB)
 
-                     Class (reserved=1497052KB +138555KB, committed=516116KB +160115KB)
                            (classes #53913 +3206)
                            (  instance classes #51640 +3066, array classes #2273 +140)
							
20:02 -> top java 1.6, container 2.19g
Total: reserved=4157916KB +136752KB, committed=1476348KB +314888KB

-                 Java Heap (reserved=2097152KB, committed=595456KB +101888KB)
                            (mmap: reserved=2097152KB, committed=595456KB +101888KB)
 
-                     Class (reserved=1497216KB +138718KB, committed=516280KB +160278KB)
                            (classes #54131 +3424)
                            (  instance classes #51861 +3287, array classes #2270 +137)							
							
21:02 -> top java 1.5, container 2.12g
Total: reserved=4144397KB +123233KB, committed=1381501KB +220041KB

-                 Java Heap (reserved=2097152KB, committed=504832KB +11264KB)
                            (mmap: reserved=2097152KB, committed=504832KB +11264KB)
 
-                     Class (reserved=1495869KB +137371KB, committed=514933KB +158931KB)
                            (classes #46314 -4393)
                            (  instance classes #44378 -4196, array classes #1936 -197)

21:02 -> top java 1.5, container 2.25g
Total: reserved=4145257KB +124093KB, committed=1447309KB +285849KB

-                 Java Heap (reserved=2097152KB, committed=572416KB +78848KB)
                            (mmap: reserved=2097152KB, committed=572416KB +78848KB)
 
-                     Class (reserved=1493248KB +134751KB, committed=512312KB +156311KB)
                            (classes #32559 -18148)
                            (  instance classes #31191 -17383, array classes #1368 -765)

metmajer · 2023-03-16T07:40:39Z

@oalyman @albertpuente @clemensutschig this is huge, absolutely insane. fantastic job!

albertpuente · 2023-03-16T10:53:47Z

A did a bit more digging into the difference between the memory used by java and what Openshift reports.

The metric we see in Openshift corresponds to container_memory_working_set_bytes, and it is what the OOM killer looks for (source).

This metric is understood as a heuristic for the minimum size of memory the app needs to work and corresponds to: memory.usage_in_bytes - inactive_file (from memory.stat). Source.

This pretty much matches what I see in my example:

$ oc exec jenkins-mem-test-6-gg8tt -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
2535804928
$ oc exec jenkins-mem-test-6-gg8tt -- cat /sys/fs/cgroup/memory/memory.stat | grep inactive_file 
inactive_file 334565376
total_inactive_file 334565376

$ expr 2535804928 - 334565376
2201239552
# This is 2099,265625 Gi

$ kubectl top pod jenkins-mem-test-6-gg8tt  # Same as seen in OpenShift metrics
NAME                       CPU(cores)   MEMORY(bytes)
jenkins-mem-test-6-gg8tt   48m          2099Mi

Now, memory.usage_in_bytes seems to be the resident set size + cache (source, 5.5) and then we substract the inactive_file to it, which is the pagecache memory that can be reclaimed without huge performance impact (source).

clemensutschig · 2023-03-16T10:55:21Z

the questions is if this will ever be reclaimed,.. or does that indicate a file leak, or something? sorry for the stupid beginner question

albertpuente · 2023-03-16T11:33:06Z

For reference, this image summarizes the memory excess which cannot be attributed to the container processes (as seen as when using top).

clemensutschig · 2023-03-18T13:25:53Z

as we are using a LOT of jar and zzip files (grape! and doc gen) - maybe https://stackoverflow.com/questions/44273737/how-to-use-ulimit-with-java-correctly and we have to go -Dsun.zip.disableMemoryMapping=true. Trying now on master (as we know everything else is stable :)

clemensutschig · 2023-03-19T09:05:20Z

ok the -Dsun.zip.disableMemoryMapping=true does not seem to bring much, but on the way I have found something else - that I baked into the shared lib codebase now - and that is a the "full" cleanup of CI skip / RM commits in a given component (in case the commit message is coming from the RM).. this seems to bring quite a bit .. (more data on this to come tmrw).

But I found yet another jenkins bug (changeset related) - for some reason the "dummy" commits in a test component dont show up - so these builds will dangle around for sure (one build following every RM build) :(

clemensutschig · 2023-03-20T07:18:30Z

ok another night - and I guess we made it as good as we can :)

below is the "naked" increase of the overhead - which has gotten now much better, in fact almost flag .. :)

albertpuente · 2023-03-20T07:34:54Z

Here are the metrics for 48h of continuously building a relatively complex app (Mkp):
OCP:

VM:

Naked "increase", surprisingly goes down with time:

oalyman · 2023-03-20T10:40:03Z

as we are using a LOT of jar and zzip files (grape! and doc gen) - maybe https://stackoverflow.com/questions/44273737/how-to-use-ulimit-with-java-correctly and we have to go -Dsun.zip.disableMemoryMapping=true. Trying now on master (as we know everything else is stable :)

This is already set by default if one does not set JAVA_TOOL_OPTIONS.
See:
https://github.com/openshift/jenkins/blob/master/2/contrib/s2i/run#L302

clemensutschig · 2023-03-20T20:48:05Z

if I read into https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d?gi=f77e14a09937 we may be now really on the save side ... I will try to force it to limits and see whats happening ... :D

clemensutschig · 2023-04-18T07:39:12Z

I am down to xmx=1, maxMeta=512m and requestLimit=2 :) with 2 components and and ods-test following - after 24h, its still happily running.

Now going after 3 components (+ 1 test) - to see how this actually scales

clemensutschig added the bug Something isn't working label Feb 24, 2022

clemensutschig changed the title ~~releasemanager (with EDP) . memory usage not reclaimed after run~~ releasemanager (with EDP) . memory used not reclaimed / freed up after run Feb 24, 2022

clemensutschig changed the title ~~releasemanager (with EDP) . memory used not reclaimed / freed up after run~~ releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run Feb 24, 2022

clemensutschig added this to todo in Veeva learnings Feb 25, 2022

metmajer mentioned this issue Mar 2, 2022

$itr serialization error occurs again in 4.x #861

Open

clemensutschig mentioned this issue Mar 12, 2023

4.x Memory leak cleanup code - for component pipeline #980

Merged

clemensutschig mentioned this issue Mar 15, 2023

memory malloc arena fix - 4.x opendevstack/ods-core#1217

Merged

releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run #857

releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run #857

Comments

clemensutschig commented Feb 24, 2022 • edited

metmajer commented Mar 2, 2022 • edited

metmajer commented Mar 2, 2022

metmajer commented Mar 2, 2022

clemensutschig commented Mar 2, 2022

metmajer commented Mar 2, 2022 • edited

victorpablosceruelo commented Mar 3, 2022

clemensutschig commented Mar 8, 2022 • edited

metmajer commented Mar 8, 2022

clemensutschig commented Mar 8, 2022 • edited

clemensutschig commented Mar 9, 2022

clemensutschig commented Mar 9, 2022

metmajer commented Mar 9, 2022

clemensutschig commented Mar 9, 2022

clemensutschig commented Mar 9, 2022 • edited

clemensutschig commented Mar 9, 2022

clemensutschig commented Mar 10, 2022

clemensutschig commented Mar 10, 2022

clemensutschig commented Mar 10, 2022 • edited

clemensutschig commented Mar 11, 2022

stitakis commented Mar 11, 2022 • edited

clemensutschig commented Mar 11, 2022 • edited

clemensutschig commented Mar 11, 2022

clemensutschig commented Mar 11, 2022

clemensutschig commented Mar 11, 2022

clemensutschig commented Mar 12, 2022 • edited

FlavioCampanaBIExt commented Oct 5, 2022

FlavioCampanaBIExt commented Oct 11, 2022

clemensutschig commented Mar 11, 2023

clemensutschig commented Mar 12, 2023 • edited

clemensutschig commented Mar 12, 2023 • edited

clemensutschig commented Mar 12, 2023 • edited

clemensutschig commented Mar 13, 2023 • edited

oalyman commented Mar 13, 2023 • edited

clemensutschig commented Mar 13, 2023

albertpuente commented Mar 13, 2023

clemensutschig commented Mar 13, 2023

albertpuente commented Mar 13, 2023

clemensutschig commented Mar 14, 2023 • edited

clemensutschig commented Mar 14, 2023

oalyman commented Mar 14, 2023 • edited

albertpuente commented Mar 15, 2023

clemensutschig commented Mar 16, 2023

metmajer commented Mar 16, 2023

albertpuente commented Mar 16, 2023

clemensutschig commented Mar 16, 2023

albertpuente commented Mar 16, 2023 • edited

clemensutschig commented Mar 18, 2023 • edited

clemensutschig commented Mar 19, 2023 • edited

clemensutschig commented Mar 20, 2023 • edited

albertpuente commented Mar 20, 2023

oalyman commented Mar 20, 2023 • edited

clemensutschig commented Mar 20, 2023

clemensutschig commented Apr 18, 2023 • edited

clemensutschig commented Feb 24, 2022 •

edited

metmajer commented Mar 2, 2022 •

edited

metmajer commented Mar 2, 2022 •

edited

clemensutschig commented Mar 8, 2022 •

edited

clemensutschig commented Mar 8, 2022 •

edited

clemensutschig commented Mar 9, 2022 •

edited

clemensutschig commented Mar 10, 2022 •

edited

stitakis commented Mar 11, 2022 •

edited

clemensutschig commented Mar 11, 2022 •

edited

clemensutschig commented Mar 12, 2022 •

edited

clemensutschig commented Mar 12, 2023 •

edited

clemensutschig commented Mar 12, 2023 •

edited

clemensutschig commented Mar 12, 2023 •

edited

clemensutschig commented Mar 13, 2023 •

edited

oalyman commented Mar 13, 2023 •

edited

clemensutschig commented Mar 14, 2023 •

edited

oalyman commented Mar 14, 2023 •

edited

albertpuente commented Mar 16, 2023 •

edited

clemensutschig commented Mar 18, 2023 •

edited

clemensutschig commented Mar 19, 2023 •

edited

clemensutschig commented Mar 20, 2023 •

edited

oalyman commented Mar 20, 2023 •

edited

clemensutschig commented Apr 18, 2023 •

edited