Skip to content
Jonathan R. Madsen edited this page Nov 29, 2020 · 4 revisions

Why should I choose timemory instead of X for performance analysis?

Does X have an API? If so, don't think of it as a choice between timemory and X, think of it as timemory being able to provide X in addition to a whole lot of other things. If X is not already provided, create a feature request or a pull request with the implementation.

How do I compare the difference between two runs?

  • If you don't have the first set of results yet, it is often convenient to place them in a known directory, e.g. TIMEMORY_OUTPUT_PATH=baseline
  • Once you have the first set of results, set the environment variable TIMEMORY_INPUT_PATH to the directory containing the results
  • Enable TIMEMORY_TIME_OUTPUT -- this will create another sub-folder with the time-stamp of the run
  • Enable TIMEMORY_DIFF_OUTPUT -- this will instruct timemory to search the input path, load any results found, and compute the difference
  • Run the executable again and when the output is generated, there will be additional outputs reporting the differences.
$ export TIMEMORY_OUTPUT_PATH=baseline
$ ./myexe
$ ls -1 baseline/
wall.txt  
wall.json
wall.jpeg
$ export TIMEMORY_INPUT_PATH=baseline
$ export TIMEMORY_TIME_OUTPUT=ON
$ export TIMEMORY_DIFF_OUTPUT=ON
$ ./myexe
$ ls -1 baseline/
wall.txt  
wall.json 
wall.jpeg
2020-08-27_04.12_PM/
$ ls -1 baseline/2020-08-27_04.12_PM/
wall.txt
wall.json
wall.jpeg
wall.diff.txt
wall.diff.json
wall.diff.jpeg

Wall-clock Baseline Results

baseline/wall.jpeg

Wall-clock Results from 2020-08-27_04.12_PM

baseline/2020-08-27_04.12_PM/wall.jpeg

Wall-clock Difference between Baseline and 2020-08-27_04.12_PM

baseline/2020-08-27_04.13_PM/wall.diff.jpeg

How Do I Develop a New Component For Timemory?

It will probably be most productive if you don't modify the source code initially. In C++ with the template API, there is no significant difference between creating a component in an external project vs. the component being defined within the timemory source -- the only thing gained by building a component within the source is that the component can then be assigned an enumeration ID (timemory/enums.h) and can then be mapped into C, Python, and Fortran but that's not critical in the early stages -- the most important thing starting out is getting the contents of the component definition written and doing that in a stand-alone executable makes things very clean, straight-forward, and productive.

Recommended Steps:

  1. create a folder,component-dev
  2. create a simple component-dev/CMakeLists.txt
  • component-dev/ex_component_dev.cpp file which is just your:
    • main
    • Component definition
    • Some code in the main that uses your component around some code that will produce data for it to collect.

component-dev/CMakeLists.txt

cmake_minimum_required(VERSION 3.11 FATAL_ERROR)

project(component-dev LANGUAGES CXX)

set(timemory_FIND_COMPONENTS_INTERFACE timemory-component-dev)
find_package(timemory REQUIRED COMPONENTS headers cxx)

add_executable(ex_component_dev ex_component_dev.cpp)
target_link_libraries(ex_component_dev timemory-component-dev)

# find/add any include-dirs, libs, etc. that you need for your component
find_library(TPL_LIBRARY
     NAMES some-library
     # ... etc.
)
target_link_libraries(ex_component_dev ${PDH_LIBRARY})

component-dev/ex_component_dev.cpp

#include "timemory/timemory.hpp"

using namespace tim::component;

TIMEMORY_DECLARE_COMPONENT(component_dev)   // forward declare your component
using dev_bundle_t = tim::component_tuple<component_dev>; // use an alias like this to call the component

int main(int argc, char** argv)
{
    tim::timemory_init(argc, argv);

    // create a measurement instance and give it the label "work"
    dev_bundle_t _obj("work");
    _obj.start();    // start recording

    // ... do something that produces data your component can collect ...

    _obj.stop();      // stop recording

    tim::timemory_finalize();
    return EXIT_SUCCESS;
}

namespace tim
{
namespace component
{
struct component_dev : public base<component_dev, some_data_type>
{
      static std::string label() { return "component_dev"; }
      static std::string description() { return "collects some component data"; }

      auto record() { ... something ... }
      auto start() { ... something ... }
      auto stop() { ... something ... }
};
}
}

// this is only really necessary if threads are used 
TIMEMORY_INITIALIZE_STORAGE(component_dev)

Once you have this set up, just try to encapsulate taking one measurement as one instance of component_dev, where:

  • void start() starts the measurement and stores the measurement data as member data for that instance
  • void stop() stops the measurement and computes the difference between the current measurement and the measurement in start()

A couple things to note:

  • The some_data_type in base<component_dev, some_data_type> above doesn't have to be the "final" data type reported by the component. You should set that data type to whatever data type is optimal for between start/stop
    • The base-class will provide some_data_type value and some_data_type accum. In general, it is recommended to record to value in start() and then in stop(), record value as value = (record() - value) and then accum += value. This way, start() and stop() can be called multiple times without issue: value is represents the most recent measurement or delta and accum is the sum/max/etc. of one or more phases.
  • The "final" data type is what is returned the "get() const" member function, e.g. for the wall-clock timer:
    • some_data_type is int64_t and the values are always in nanoseconds
    • the get() function is double get() const and it takes accum and converts it seconds.

For the most part, I would just recommend looking at the existing components in the components.hpp files for the folders in source/timemory/components. In particular, the timing/components.hpp, rusage/components.hpp, and io/components.hpp and modeling your component definition after the component which is most similar. Those component definitions are really the gist of what needs to be done; all the other stuff in */types.hpp, etc. is just window dressing and enhancements (like adding support for statistics, unit conversion, mapping the type to strings and enums, etc.) which is not at all necessary for an initial implementation.

When you have a simple stand-alone implementation like ex_component_dev.cpp above, you can open a PR and we can tell you how to migrate it into the actual source code so that is then becomes universally available in C, C++, Python, and Fortran.

Also, this is a good minimalistic example of writing a new component which leverages other components

How do I use timemory to count user defined events?

NOTE: Although data_tracker is recommended below, you can always create a custom component which will accept any arguments desired and store the data however necessary.

There is a templated tim::component::data_tracker<Tp, Tag> where Tp is the data type you want to track and Tag is just an arbitrary struct to differentiate component X that tracks ints and component Y that tracks ints. You basically put the data_tracker in a bundle and call the store(...) function and pass in any data that you want to store but beware of implicit conversions when using multiple data trackers. Three concrete (i.e. non-templated) implementations are provided:

  1. data_tracker_integer
  2. data_tracker_unsigned
  3. data_tracker_floating

It can be quite useful to create a dedicated auto_tuple bundle for the just the components which track data bc you can append in a single line because the auto_* bundles call start() and stop() at construction/destruction.

This is probably best demonstrated with an example.

    using namespace tim::component;

    // component_tuple requires explicit start, stop is optional (will call stop if started when it gets destroyed)
    using bundle_t  = tim::component_tuple<wall_clock, data_tracker_integer>;

    // auto_tuple automatically calls start and stop
    using tracker_t = tim::auto_tuple<data_tracker_integer>;

    // auto_* bundles will write to stdout when destroyed
    tim::settings::destructor_report() = true;

    bundle_t _obj{ "example" };
    _obj.start();

    for(int i = 0; i < 10; ++i)
    {
        long ans = fibonacci(10) + fibonacci(10 + (i % 3));

        // store a single iteration
        tracker_t{ TIMEMORY_JOIN("", _obj.key(), "#", i % 3) }.store(ans);
        // join macro is like pythons "#".join(...)

        // accumulate into parent bundle and demonstrate using lambda
        // specifying how to update variable
        _obj.store([](long cur, long upd) { return cur + upd; }, ans);
    }

The result:

>>>  example#0 :          110 data_integer [laps: 1]
>>>  example#1 :          144 data_integer [laps: 1]
>>>  example#2 :          199 data_integer [laps: 1]
>>>  example#0 :          110 data_integer [laps: 1]
>>>  example#1 :          144 data_integer [laps: 1]
>>>  example#2 :          199 data_integer [laps: 1]
>>>  example#0 :          110 data_integer [laps: 1]
>>>  example#1 :          144 data_integer [laps: 1]
>>>  example#2 :          199 data_integer [laps: 1]
>>>  example#0 :          110 data_integer [laps: 1]
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.json'...
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.tree.json'...
[data_integer]|0> Outputting 'timemory-ex-derived-output/data_integer.txt'...

|---------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                       STORES SIGNED INTEGER DATA W.R.T. CALL-GRAPH                                                      |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|
|       LABEL         |   COUNT    |   DEPTH    |    METRIC    |   UNITS    |    SUM     |    MEAN    |    MIN     |    MAX     |   STDDEV   |   % SELF   |
|---------------------|------------|------------|--------------|------------|------------|------------|------------|------------|------------|------------|
| >>> example         |          1 |          0 | data_integer |            |       1469 |       1469 |       1469 |       1469 |          0 |          0 |
| >>> |_example#0     |          4 |          1 | data_integer |            |        440 |        110 |        110 |        110 |          0 |        100 |
| >>> |_example#1     |          3 |          1 | data_integer |            |        432 |        144 |        144 |        144 |          0 |        100 |
| >>> |_example#2     |          3 |          1 | data_integer |            |        597 |        199 |        199 |        199 |          0 |        100 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|

The default behavior is to add entries together but the member function store of the data tracker supports using a binary-function or lambda:

using namespace tim::component;

struct A_data {}; // differentiator type for data of type "A"
struct B_data {}; // differentiator type for data of type "B"

using A_tracker = data_tracker<int, A_data>;
using B_tracker = data_tracker<double, B_data>;
using bundle_t  = tim::auto_tuple<A_tracker, B_tracker>;

void foo(int A, double B)
{
    // update lambda for B
    auto B_update = [](double current, double incoming) { 
        return std::max(current, incoming); 
    };

    bundle_t obj{ "foo" };
    obj.store(A); // adds A instances
    obj.store(B_update, B); // records max 
}

Can I use timemory to to profile how timing changes over the duration of a long running process?

Timemory is supported by a python project called Hatchet which converts the *.tree.json output files into a Pandas data-frames for extended analysis. In the longer term, we will be developing a framework for performance unit testing which will automate historical comparisons but, at present, the roll-your-own-solution would be to just import multiple jsons into Python and do a comparison there. The regular .json files (w/o .tree) provide the results in a flat JSON array so it is relatively easy to traverse (the json trees require recursion to process).

Also, you can specify TIMEMORY_INPUT_PATH=/path/to/some/folder/of/results + TIMEMORY_DIFF_OUTPUT=ON in the environment (or the corresponding tim::settings::input_path(), etc.) and when the results are finalized, timemory will search that input folder, try to find a corresponding input, and then generate additional *.diff.* files in the output folder with the difference between any matches it can find.

I need to run multiple processes simultaneously, how do I give each process a unique output path?

One options is to prefix the files with some key (such as the PID):

tim::settings::output_prefix() = std::to_string(tim::process::get_id()) + "-";

Another options is to specify the TIMEMORY_TIME_OUTPUT=ON in the env (tim::settings::time_output()) and that will create subdirectories which are time-stamped. This time-stamp gets fixed for the process the first time settings::get_global_output_prefix() is called and time-output is enabled. Also, there is a time_format() in settings which allows you to customize the time-stamping format using the strftime format.

How do I output during a running process?

First off, do not call timemory_finalize() and then try use timemory after that point. That routine is designed to delete a lot of things and seg-faults are basically a certainty. However, you can pretty much dump the current state of the call-stack storage at any point for any thread on any process any time you want as long as you are fine with the side-effects. The only known issue is flushing the output on the primary thread while secondary threads are attempting to update their data. Using this approach, you can have full control of where the files get written. The side-effects are:

  • Any of the components that have been pushed onto the call-stack will be popped off the call-stack unless you set settings::stack_clearing() = false
    • Setting this to false may cause issues. You might get nonsensical results for components bc they might represent a single sample as a phase measurement. You might get zero in the laps columns, etc.
  • If you use the higher level storage<Tp>::instance()->get() on the master thread, it will merge in the child threads and clear their storage
  • If you use the higher level storage<Tp>::instance()->dmp_get() (distributed memory parallelism, i.e. MPI or UPC++) on rank zero it will return the storage from all processes.
  • You will likely artificially inflate the values for components measuring memory usage if those are still collecting when you retrieve the storage of, say, the timers.

The best references for how to dump output is in the array-of-bundles example.

Within python, the process is quite easy:

import timemory

def write_json(fname, specific_types = [], hierarchy = False):
    data = timemory.get(hierarchy=hierarchy, components=specific_types)
    with open(fname, "w") as f:
        f.write(data)