-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Goal: emulate an application's runtime behavior as realistically as possible
- emulate the application's execution structure (components and relations)
- consume same amount of resources (CPU, Mem, Disk, Network)
Once a Synapse instance is configured to emulate a specific application instance, it remains parametrizable, so that experiments can be performed under various, fine-tuned application loads, without needing to tweak actual application code.
Synapse parametrization can be static, dynamic, according to some distribution (tbi[*]), etc.
Initial parameters are obtained by profiling applications
(synapse.utils.profile_function
). Synapse emulation runs are also profiled themselves, to
verify correct emulation -- see Figure 1.
Figure 1: Mandelbrot as master-worker implementation -- measure TTC on a single worker instance with varying problem size (sub-image size), and compare to a synapse emulation of the same worker. The synapse data include times for the individually contributing load types (disk, mem, cpu). For small problem sizes, noise in the load generation, startup overhead and self-profiling overhead are clearly visible -- for larger problem sizes (>10 seconds ttc) that constant overhead becomes negligible.
Synapse uses linux command line tools for profiling -- some of which need kernel support, so this is not generally portable to arbitrary machines. But the resulting parameters are machine independent, and thus apply to application emulation on different machines than the one the profile was taken on.
The profiling mechanism often differs for code which can be instrumented (or at least wrapped), and for code which has to be taken as-is, and be watched while executing. Instrumentation is often more detailed, exact and portable -- but ultimately both methods should provide comparable results.
-
/usr/bin/time -v
for max memory consumption:$ /usr/bin/time -v python -c 'for i in range (1,10000000): j = i*3.1415926' Command being timed: "python -c for i in range (1,10000000): j = i*3.1415926" User time (seconds): 1.84 System time (seconds): 0.46 Percent of CPU this job got: 46% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00 ... Maximum resident set size (kbytes): 322072 ...
- memory consumption reports seem correct - but do not detail distribution over time, looks like one chunk.
- wallclocktime includes profiling time, better to measure wallclock separately
- on instrumentable code, use
synapse.utils.get_mem_usage
, which evaluates/proc/[pid]/status
.
-
/usr/bin/time -f %e
for TTC:$ /usr/bin/time -f %e python -c 'for i in range (1,10000000): j = i*3.1415926' 1.82
-
/usr/bin/perf stat
for CPU utilization (needs kernel support):$ /usr/bin/perf stat python -c 'for i in range (1,10000000): j = i*3.1415926' Performance counter stats for 'python -c for i in range (1,10000000): j = i*3.1415926': 1928.356169 task-clock # 0.993 CPUs utilized 185 context-switches # 0.096 K/sec 64 CPU-migrations # 0.033 K/sec 80,648 page-faults # 0.042 M/sec 6,158,591,568 cycles # 3.194 GHz [83.25%] 2,427,203,057 stalled-cycles-frontend # 39.41% frontend cycles idle [83.25%] 1,758,381,453 stalled-cycles-backend # 28.55% backend cycles idle [66.65%] 8,898,332,744 instructions # 1.44 insns per cycle # 0.27 stalled cycles per insn [83.26%] 2,037,169,952 branches # 1056.428 M/sec [83.44%] 28,412,079 branch-misses # 1.39% of all branches [83.51%] 1.941766011 seconds time elapsed
-
perf
is quick (only reads kernel counters) - 8 instructions ~~ 1 FLOP (architecture dependent)
- CPU efficiency is also estimated, based on the idle cycle counters, but that estimate is very dependent on the system load -- profiling should happen on an otherwise idle system.
- emulation of the exact CPU consumption structure is difficult (branching, cache misses, idle cycles) -- we use assembler instead of C, to have a little more control...
-
-
cat /proc/[pid]/io
for disk I/O counters:$ python -c 'for i in range (1,10000000): j = i*3.1415926' & cat /proc/$!/io [3] 2110 rchar: 7004 wchar: 0 syscr: 13 syscw: 0 read_bytes: 0 write_bytes: 0 cancelled_write_bytes: 0
- timing is problematic, needs constant watching, as it disappears with the process
- timing is less critical if code can be instrumented (
synapse.utils.get_io_usage
)
-
complete profile command:
sh -c '/usr/bin/time -v /usr/bin/perf stat /usr/bin/time -f %e python mandelbrot.py'
-
For self_profiling, we use
getrusage(2)
, which is embedded into the synapse atoms.
The synapse emulator incarnation looks like this:
import synapse.atoms
import pprint
import time
start = time.time ()
sa_c = synapse.atoms.Compute ()
sa_m = synapse.atoms.Memory ()
sa_s = synapse.atoms.Storage ()
sa_c.run (info={'n' : 1100}) # consume 1.1 GFlop Cycles
sa_m.run (info={'n' : 322}) # allocate 0.3 GByte memory
sa_s.run (info={'n' : 0, # write 0.0 GByte to disk
'tgt' : '%(tmp)s/synapse_storage.tmp.%(pid)s'})
# atoms are now working in separate threads
info_c = sa_c.wait ()
info_m = sa_m.wait ()
info_s = sa_s.wait ()
stop = time.time ()
# info now contains self-profiled information for the atoms
print "t_c: %.2f" % info_c['timer']
print "t_m: %.2f" % info_m['timer']
print "t_s: %.2f" % info_s['timer']
print "ttc: %.2f" % (stop - start)
which will result in
t_c: 1.84
t_m: 1.38
t_s: 0.03
ttc: 1.85
-
framework / controller in python (see example above)
-
atom cores as small snippets of C and Assembler code
Python code has significant overhead, and is hard to predict what operation results in how many instructions. Controlling memory consumption is even more difficult -- thus the decision for C/ASM
-
code is ANSI-C, and is compiled on the fly -- that results in a tiny overhead on first invocation for each atom type:
$ /usr/bin/time -f %e cc -O0 synapse/atoms/synapse_storage.c 0.06
(same for all atoms, dominated by CC startup and parsing)
-
for actual code, see
synapse/atoms/synapse_{compute,memory,storage}.c
-- very small and accessible (IMHO),rusage
report is about 30% of it, in total about 60 lines of code in total, for each atom:$ sloccount synapse/atoms/synapse_{compute,memory,storage}.c | grep ansic ansic: 170 (100.00%)
-
alternative assembler based compute atom can better reproduce CPU utilization -- we are still working on making CPU consumption patterns more tunable (eg. in terms of CPU efficiency, branch misses, etc.).
-
code may grow for better tuning (memory and disk I/O chunksize, CPU instruction types, etc)
- complete network atom (currently only covers to point-to-point communication)
- add network profiling
- add MPI atom
- improve composability, via control files
- add support for statistic load distributions (simple, on python level)
[*] tbi: to be implemented :P