Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create LDMS stream publish for phase data #2183

Open
lifflander opened this issue Aug 10, 2023 · 12 comments · May be fixed by #2184
Open

Create LDMS stream publish for phase data #2183

lifflander opened this issue Aug 10, 2023 · 12 comments · May be fixed by #2184
Assignees

Comments

@lifflander
Copy link
Collaborator

https://ovis-hpcreadthedocs.readthedocs.io/en/latest/ldms-streams.html#how-to-make-a-data-connector

You'll need to include/import the following files: ldms.h, ldmsd_stream.h, util.h
For example, in C++ code, add the following:

#include <ldms/ldms.h> 
#include <ldms/ldmsd_stream.h>
#include <ovis_util/util.h>

You'll also need the object:
ldms_t* ldms

The function you'll need to add for publishing messages is:

ldmsd_stream_publish( (*ldms), <NAME_OF_SCHEMA>, <TYPE_OF_MSG>, <MSG_OBJECT>)

So, for example, if someone wanted to send Kokkos data as a JSON to their database, the function would look like this:

ldmsd_stream_publish( (*ldms), "kokkos-perf-data", LDMSD_STREAM_JSON,

@lifflander lifflander self-assigned this Aug 10, 2023
@lifflander lifflander linked a pull request Aug 10, 2023 that will close this issue
@lifflander lifflander linked a pull request Aug 10, 2023 that will close this issue
@PhilMiller
Copy link
Member

Title should read LDMS?

@PhilMiller
Copy link
Member

Also, vt's internal diagnostics seem to be perfect for feeding out to LDMS.

@nlslatt nlslatt changed the title Create LDMA stream publish for phase data Create LDMS stream publish for phase data Aug 11, 2023
@Snell1224
Copy link

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

@lifflander
Copy link
Collaborator Author

lifflander commented Aug 21, 2023

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

I'm getting this compile-time error when I just include the three files listed above:

root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
                 from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
  401 |         app_ctxt_free_fn app_ctxt_free_fn;
      |                          ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
  649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
      |                ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

This is the script I used to install LDMS in the container:

https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh

@Snell1224
Copy link

Snell1224 commented Aug 22, 2023

Can you please try to run the following and see if the issue still occurs?
cd ovis
./autogen.sh
./packaging/make-all-top.sh

I've never encountered this kind of error and usually use the "make-all-top.sh" to build LDMS. This script automatically configures LDMS with the common flags that our team uses (build is under .../ovis/LDMS_install).

In the meantime, I'm going to reach out others who are more experienced with this kind of error.

@Snell1224
Copy link

UPDATE: What version of LDMS is being installed and what is the output of g++ --version of the container?

@JacobDomagala
Copy link
Contributor

I was able to successfully build that LDMS and test it with vt (locally). Next I'll try to do the same within our Docker containers.

@JacobDomagala
Copy link
Contributor

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

I'm getting this compile-time error when I just include the three files listed above:

root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
                 from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
  401 |         app_ctxt_free_fn app_ctxt_free_fn;
      |                          ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
  649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
      |                ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

This is the script I used to install LDMS in the container:

https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh

I get the same error when using 4.3.11 version (or older). Issue is no longer present when using OVIS-4 branch source code.

@Snell1224
Copy link

Snell1224 commented Oct 5, 2023

The URL for the LDMS documentation has recently been updated to: https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#how-to-make-a-data-connector

I'm getting this compile-time error when I just include the three files listed above:

root@b86898199925:/build/vt# ninja
[1/2] Building CXX object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
FAILED: examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o
/usr/bin/ccache /usr/lib/ccache/g++ -DJSON_USE_IMPLICIT_CONVERSIONS=1 -I/vt/lib/CLI -I/vt/lib/json/include -I/vt/lib/brotli/c/include -I/vt/lib/libfort/lib -I/build/vt/release -I/vt/src -I/build/vt/lib/checkpoint/src -I/vt/lib/checkpoint/src -isystem /vt/lib/fmt/include -isystem /vt/lib/EngFormat-Cpp/include -O3 -DNDEBUG -fdiagnostics-color=always -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -Werror -std=c++17 -MD -MT examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -MF examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o.d -o examples/hello_world/CMakeFiles/hello_world.dir/hello_world.cc.o -c /vt/examples/hello_world/hello_world.cc
In file included from /usr/local/include/ldms/ldmsd_stream.h:6,
                 from /vt/examples/hello_world/hello_world.cc:47:
/usr/local/include/ldms/ldms_xprt.h:401:26: error: declaration of 'void (* ldms_xprt::app_ctxt_free_fn)(void*)' changes meaning of 'app_ctxt_free_fn' [-fpermissive]
  401 |         app_ctxt_free_fn app_ctxt_free_fn;
      |                          ^~~~~~~~~~~~~~~~
In file included from /vt/examples/hello_world/hello_world.cc:46:
/usr/local/include/ldms/ldms.h:649:16: note: 'app_ctxt_free_fn' declared here as 'typedef void (* app_ctxt_free_fn)(void*)'
  649 | typedef void (*app_ctxt_free_fn)(void *ctxt);
      |                ^~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

This is the script I used to install LDMS in the container:
https://github.com/DARMA-tasking/vt/blob/2183-create-ldma-stream-publish-for-phase-data/ci/deps/ldms.sh

I get the same error when using 4.3.11 version (or older). Issue is no longer present when using OVIS-4 branch source code.

@JacobDomagala Thank you catching this and letting me know. The LDMS team and I will look into it. Feel free to reach out if you come across any more issues!

@lifflander
Copy link
Collaborator Author

@Snell1224 @vsurjadidjaja

Here is a screenshot of the form the data will take from our current JSON statistics file. This data will be incrementally submitted phase-by-phase as the data is computed. A phase is roughly equivalent to a timestep in an application. After a phase runs, the load balancer might be run depending on the configuration. Thus, we always have pre-LB statistics and we might have a migration count and post-LB statistics depending on whether it ran or not.

Screenshot 2023-11-08 at 12 45 07

So after each phase completes, we will submit this:

{
  "id": 4,    // A unique phase ID
  "ts": 40.0, // The timestamp
  "migration count": 1, // number of migrations [optional]
  "pre-LB": {
     "Object_comm": { },
     "Object_load_modeled": { },
     "Object_load_raw": { },
     "Rank_comm": { },
     "Rank_load_modeled": { },
     "Rank_load_raw": {	}
  },
  "post-LB": { // [optional]
     // Same as pre-LB
  }
}

Each one of the keys (Object_comm, Object_load_modeled, ...) in pre- and post-LB will include the following statistics:

"avg": 7190.222222222223, // mean
"car": 9.0, // cardinality
"imb": 0.2739522808752626, // imbalance (max/avg-1)
"kur": -1.7815486080524885, // kurtosis
"max": 9160.0, // maximum 
"min": 5880.0, // minimum
"npr": 9.0,
"skw": 0.515228148796637, // skewness
"std": 1310.1329733467003, // standard deviation
"sum": 64712.0, // sum
"var": 1716448.4078502655 // variance

For the stream publish key, I propose "vtLBStats".

@lifflander
Copy link
Collaborator Author

@Snell1224 @vsurjadidjaja I'm a little confused as to how I should convert the output of gettime() to be consistent with what you need.

@Snell1224
Copy link

Snell1224 commented Nov 9, 2023

@Snell1224 @vsurjadidjaja I'm a little confused as to how I should convert the output of gettime() to be consistent with what you need.

We use epoch time for analyzing streams data so we send this in the JSON message. As for when to record/get the time, that's more of a preference thing. I'm not too familiar with VT but if you don't need to monitor the start/duration/end time of each phase, then getting the time whenever you send the JSON message will work.

The example below shows what we do for Darshan and how we collect the end time of an I/O event (again this is just a preference):

static inline struct timespec abs_timespec(void)
{
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return(tp);
}

struct timespec tspec_start, tspec_end;

tspec_start = abs_timespec()
// IO stuff happening here
tspec_end = abs_timespec()
// Do other stuff and send message

micro_s = tspec_end.tv_nsec/1.0e3;
sprintf(jb11,"{.....,\"timestamp\":%lu.%.6lu}]}", ....., tspec_end.tv_sec, micro_s);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants