Skip to content
Andre Merzky edited this page Jan 24, 2015 · 7 revisions

Synapse - SYNthetic APplicationS Emulator

Goal: emulate an application's runtime behavior as realistically as possible

  • emulate the application's execution structure (components and relations)
  • consume same amount of resources (CPU, Mem, Disk, Network)

Once a Synapse instance is configured to emulate a specific application instance, it remains parametrizable, so that experiments can be performed under various, fine-tuned application loads, without needing to tweak actual application code.

Synapse parametrization can be static, dynamic, according to some distribution (tbi[*]), etc.

Initial parameters are obtained by profiling applications (synapse.utils.profile_function). Synapse emulation runs are also profiled themselves, to verify correct emulation -- see Figure 1.

Figure 1: Mandelbrot as master-worker implementation -- measure TTC on a single worker instance with varying problem size (sub-image size), and compare to a synapse emulation of the same worker. The synapse data include times for the individually contributing load types (disk, mem, cpu). For small problem sizes, noise in the load generation, startup overhead and self-profiling overhead are clearly visible -- for larger problem sizes (>10 seconds ttc) that constant overhead becomes negligible.

Profiling

Synapse uses linux command line tools for profiling -- some of which need kernel support, so this is not generally portable to arbitrary machines. But the resulting parameters are machine independent, and thus apply to application emulation on different machines than the one the profile was taken on.

The profiling mechanism often differs for code which can be instrumented (or at least wrapped), and for code which has to be taken as-is, and be watched while executing. Instrumentation is often more detailed, exact and portable -- but ultimately both methods should provide comparable results.

  • /usr/bin/time -v for max memory consumption:

    $ /usr/bin/time -v      python -c 'for i in range (1,10000000): j = i*3.1415926'
    	Command being timed: "python -c  for i in range (1,10000000): j = i*3.1415926"
    	User time (seconds): 1.84
    	System time (seconds): 0.46
    	Percent of CPU this job got: 46%
    	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        ...
    	Maximum resident set size (kbytes): 322072
        ...
    
    • memory consumption reports seem correct - but do not detail distribution over time, looks like one chunk.
    • wallclocktime includes profiling time, better to measure wallclock separately
    • on instrumentable code, use synapse.utils.get_mem_usage, which evaluates /proc/[pid]/status.
  • /usr/bin/time -f %e for TTC:

    $ /usr/bin/time -f %e python -c 'for i in range (1,10000000): j = i*3.1415926'
    1.82
    
  • /usr/bin/perf stat for CPU utilization (needs kernel support):

    $ /usr/bin/perf stat            python -c 'for i in range (1,10000000): j = i*3.1415926'
     Performance counter stats for 'python -c  for i in range (1,10000000): j = i*3.1415926':
    
           1928.356169 task-clock                #    0.993 CPUs utilized          
                   185 context-switches          #    0.096 K/sec                  
                    64 CPU-migrations            #    0.033 K/sec                  
                80,648 page-faults               #    0.042 M/sec                  
         6,158,591,568 cycles                    #    3.194 GHz                     [83.25%]
         2,427,203,057 stalled-cycles-frontend   #   39.41% frontend cycles idle    [83.25%]
         1,758,381,453 stalled-cycles-backend    #   28.55% backend  cycles idle    [66.65%]
         8,898,332,744 instructions              #    1.44  insns per cycle        
                                                 #    0.27  stalled cycles per insn [83.26%]
         2,037,169,952 branches                  # 1056.428 M/sec                   [83.44%]
            28,412,079 branch-misses             #    1.39% of all branches         [83.51%]
    
           1.941766011 seconds time elapsed
    
    • perf is quick (only reads kernel counters)
    • 8 instructions ~~ 1 FLOP (architecture dependent)
    • CPU efficiency is also estimated, based on the idle cycle counters, but that estimate is very dependent on the system load -- profiling should happen on an otherwise idle system.
    • emulation of the exact CPU consumption structure is difficult (branching, cache misses, idle cycles) -- we use assembler instead of C, to have a little more control...
  • cat /proc/[pid]/io for disk I/O counters:

    $ python -c 'for i in range (1,10000000): j = i*3.1415926' &  cat /proc/$!/io
    [3] 2110
    rchar: 7004
    wchar: 0
    syscr: 13
    syscw: 0
    read_bytes: 0
    write_bytes: 0
    cancelled_write_bytes: 0
    
    • timing is problematic, needs constant watching, as it disappears with the process
    • timing is less critical if code can be instrumented (synapse.utils.get_io_usage)
  • complete profile command:

    sh -c '/usr/bin/time -v /usr/bin/perf stat /usr/bin/time -f %e python mandelbrot.py'

  • For self_profiling, we use getrusage(2), which is embedded into the synapse atoms.

Emulation:

The synapse emulator incarnation looks like this:

import synapse.atoms
import pprint
import time

start = time.time ()
sa_c  = synapse.atoms.Compute ()
sa_m  = synapse.atoms.Memory  ()
sa_s  = synapse.atoms.Storage ()

sa_c.run (info={'n'   : 1100})   # consume  1.1 GFlop Cycles
sa_m.run (info={'n'   :  322})   # allocate 0.3 GByte memory
sa_s.run (info={'n'   :    0,    # write    0.0 GByte to disk
          'tgt' : '%(tmp)s/synapse_storage.tmp.%(pid)s'})

# atoms are now working in separate threads

info_c = sa_c.wait ()
info_m = sa_m.wait ()
info_s = sa_s.wait ()

stop = time.time ()

# info now contains self-profiled information for the atoms

print "t_c: %.2f" % info_c['timer']
print "t_m: %.2f" % info_m['timer']
print "t_s: %.2f" % info_s['timer']
print "ttc: %.2f" % (stop - start)

which will result in

t_c: 1.84
t_m: 1.38
t_s: 0.03
ttc: 1.85

Atom Implementation

  • framework / controller in python (see example above)

  • atom cores as small snippets of C and Assembler code

    Python code has significant overhead, and is hard to predict what operation results in how many instructions. Controlling memory consumption is even more difficult -- thus the decision for C/ASM

  • code is ANSI-C, and is compiled on the fly -- that results in a tiny overhead on first invocation for each atom type:

    $ /usr/bin/time -f %e cc -O0 synapse/atoms/synapse_storage.c 
    0.06
    

    (same for all atoms, dominated by CC startup and parsing)

  • for actual code, see synapse/atoms/synapse_{compute,memory,storage}.c -- very small and accessible (IMHO), rusage report is about 30% of it, in total about 60 lines of code in total, for each atom:

    $ sloccount synapse/atoms/synapse_{compute,memory,storage}.c | grep ansic
    ansic:          170 (100.00%)
    
  • alternative assembler based compute atom can better reproduce CPU utilization -- we are still working on making CPU consumption patterns more tunable (eg. in terms of CPU efficiency, branch misses, etc.).

  • code may grow for better tuning (memory and disk I/O chunksize, CPU instruction types, etc)

Future Plans

  • complete network atom (currently only covers to point-to-point communication)
  • add network profiling
  • add MPI atom
  • improve composability, via control files
  • add support for statistic load distributions (simple, on python level)

[*] tbi: to be implemented :P