Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some data gaps in collector and graphs, NaNs in rrd files #1157

Open
EvgenyKislyy opened this issue Apr 28, 2023 · 5 comments
Open

Some data gaps in collector and graphs, NaNs in rrd files #1157

EvgenyKislyy opened this issue Apr 28, 2023 · 5 comments

Comments

@EvgenyKislyy
Copy link

EvgenyKislyy commented Apr 28, 2023

Hello, we have JavaMelody 1.92, a few apps, and a collector
Very often, we see some data gaps in graphs

https://ibb.co/7ghth6X

https://ibb.co/ZWgXwvs

In the logs collector gathers the logs

Apr 27 09:18:42 ip-10-4-142-70 javamelody-collector[29092]: 20230427 09:18:42.863 INFO  [pool-2-thread-5] n.b.javamelody  - http call done in 488 ms with 459 KB read for https://javamelody:password@app2.com/fs/javamelody?collector=stop&format=serialized
Apr 27 09:26:53 ip-10-4-142-70 javamelody-collector[29092]: 20230427 09:26:53.998 INFO  [pool-2-thread-6] n.b.javamelody  - http call done in 447 ms with 334 KB read for https://javamelody:password@app2.com/fs/javamelody?collector=stop&format=serialized
Apr 27 09:33:47 ip-10-4-142-70 javamelody-collector[29092]: 20230427 09:33:47.881 INFO  [pool-2-thread-1] n.b.javamelody  - http call done in 622 ms with 586 KB read for https://javamelody:password@app2.com/fs/javamelody?collector=stop&format=serialized
Apr 27 09:39:52 ip-10-4-142-70 javamelody-collector[29092]: 20230427 09:39:52.343 INFO  [pool-2-thread-6] n.b.javamelody  - http call done in 501 ms with 130 KB read for https://javamelody:password@app2.com/fs/javamelody?collector=stop&format=serialized
Apr 27 09:46:23 ip-10-4-142-70 javamelody-collector[29092]: 20230427 09:46:23.150 INFO  [pool-2-thread-3] n.b.javamelody  - http call done in 516 ms with 127 KB read for https://javamelody:password@app2.com/fs/javamelody?collector=stop&format=serialized
Apr 27 09:51:59 ip-10-4-142-70 javamelody-collector[29092]: 20230427 09:51:59.483 INFO  [pool-2-thread-5] n.b.javamelody  - http call done in 274 ms with 131 KB read for https://javamelody:password@app2.com/fs/javamelody?collector=stop&format=serialized

And everything looks fine, with no exceptions (or very rare)

But when I am trying to transform the rrd files into XML (with help of JRobin libs)

Is see NaNs


         <!-- Thu Apr 27 08:21:00 EEST 2023 / 1682572860 -->
         <row>
            <v>NaN</v>
         </row>
         <!-- Thu Apr 27 08:22:00 EEST 2023 / 1682572920 -->
         <row>
            <v>NaN</v>
         </row>
         <!-- Thu Apr 27 08:23:00 EEST 2023 / 1682572980 -->
         <row>
            <v>NaN</v>
         </row>

Could you please give me the advice how/where can I locate the problem?

@evernat
Copy link
Member

evernat commented May 1, 2023

The logs above show http calls to fetch data which are sometimes 8 minutes after the previous call or sometimes 5 minutes after the previous calls.
It is supposed to have more values over time in order to have enough data without gaps in graphs, in particular for week graphs which aggregates values only when there are enough values to be representative.

javamelody uses jrobin for graphs which are stored as RRD files or Round Robin Databases. And Round Robin Databases use a xff or “x-files factor”: If we are trying to consolidate a number of samples into one, there is a chance that there could be gaps where a value wasn’t collected (the device was down, etc.). In that case, the value would be UNKNOWN. This factor determines how many of the samples can be UNKNOWN for the consolidated sample is considered UNKNOWN.

So there should be more http calls to have more values to aggregate without gaps. Default in collector server is one http call per monitored app every 60 seconds. It can be configured with the parameter -Djavamelody.resolution-seconds=60 in the collector server.

Have you changed that parameter ?
If you know, why do you have a http call sometimes after 8 minutes, sometimes after 5 minutes on app2 ?
Where are the logs of the few apps monitored by the collector other than app2 and are the http calls on these other apps much too long (you may look at the logs long before that extract) ?
Do you know that you could use the javamelody monitoring page from each app without using a collector server at all ?

@EvgenyKislyy
Copy link
Author

EvgenyKislyy commented May 2, 2023

Thanks for your answer @evernat

/bin/java -Xms3g -server -Djavamelody.storage-directory=/mnt/javamelody -Djavamelody.monitoring-path=/javamelody -Djavamelody.resolution-seconds=60 -Djavamelody.authorized-users=admin:password -jar /home/ec2-user/bin/javamelody-collector-server.war --httpPort=8080

We use this command for running the collector instance, and we have it here
-Djavamelody.resolution-seconds=60
But we have 8 minutes (and no errors) for some reason. We do not know why.

Yes, we know that we can see the javamelody monitoring page from each app, but we want to use javamelody collector because we want use some autoscaling in our app, so using one centralized point is much more comfortable.

Now we see if we go the app instance
Do not use this monitoring page if you use the optional collector server: either use the monitoring page from the webapp, without a collector server, or use the page from the collector server message

Few other app logs are collected together at the same time, same interval and there are lines about them in the logs, we just skipped them to show that the logs are collected with the different intervals than we set it with '-Djavamelody.resolution-seconds=60'
The gaps are the same for all our apps, so looks like all of them at the same time are a presented (or not)

One more interesting/strange issue - if we open an empty graph
and we click (bottom right corner)
last value - always correct number, always changing
text value - all NaNs

1682416800:  NaN  
1682420400:  NaN  
1682424000:  NaN  
1682427600:  NaN  
1682431200:  NaN  
1682434800:  NaN  

XML value - no NaNs at all

So I suspect maybe it's a conversion/formatting issue?

Java version OpenJDK Runtime Environment (Red_Hat-11.0.18.0.10-1.amzn2.0.1) (build 11.0.18+10-LTS)
locale en_US.UTF-8.

BTW how to enable DEBUG log for the collector? To see a possible reasons for such behavior
I couldn't find it in the docs (only debug log in my java app where javamelody is working and collecting)

/WEB-INF/classes/log4j2.xml

As I understand, I should modify this file in the war, right? No env variables?

@evernat
Copy link
Member

evernat commented May 5, 2023

Yes, we know that we can see the javamelody monitoring page from each app, but we want to use javamelody collector because we want use some autoscaling in our app, so using one centralized point is much more comfortable.

ok.

Now we see if we go the app instance
Do not use this monitoring page if you use the optional collector server: either use the monitoring page from the webapp, without a collector server, or use the page from the collector server message

That's normal with a collector server.

BTW how to enable DEBUG log for the collector? To see a possible reasons for such behavior

You can copy /WEB-INF/classes/log4j2.xml from the war file to the file system. Then modify it and add the following to your java command line :
-Dlog4j.configurationFile=/path/to/log4j2.xml

Log4J2 manual : https://logging.apache.org/log4j/2.x/manual/configuration.html

@evernat
Copy link
Member

evernat commented May 5, 2023

So I suspect maybe it's a conversion/formatting issue?

I don't think so, because we see in the logs that there are not enough http calls.

Do you have checked cpu usage on the collector server ? Maybe the cpu cores (or memory) are just too weak to run faster.

How many other apps are monitored in the collector server ? The http calls in the collector server for the other apps take about 500 ms like for the first app ? (The collector server has 10 parallel threads max to do the http calls.)

Otherwise, stop and restart the collect server and then copy all logs for a few minutes in this issue (all http calls are fast every 60 seconds ? no app is missing in the http calls ?)

@evernat
Copy link
Member

evernat commented Aug 2, 2023

any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants