Skip to content

Threads

Randall O'Reilly edited this page Aug 22, 2020 · 2 revisions

Threads

The Go language natively supports a convenient form of parallel processing, using goroutines. Emergent uses these to distribute computation across multiple processors. This is all algorithm-specific, and the following applies to the leabra model.

To simplify the implementation, processing is distributed at the Layer level, which prevents the need for locks, as everything is local to a given layer and its sending projections (note: receiving projections can not be processed in parallel, as everything is sender-based in leabra!)

Configuration is simple: just call ly.SetThread(n) to set the given layer to be processed by given thread number n.

The challenge is figuring out how many such n should be used, and which layers should be split across which threads!

ThreadAlloc: automated attempt

The Network.ThreadAlloc method generates a number of random possible assignments of layers to a given n number of threads, and selects the one that has the most even distribution of computational load (i.e., the lowest variance across threads). This is easy but may not actually produce optimal results, because our ability to statically estimate computational load is imperfect -- many variables will affect real-world processing time in complex ways, that actually vary over the course of training. This algorithm just makes its best guess based on numbers of synapses and neurons. It is almost certainly better than pure guessing, but tuning based on actual compute time will likely work even better.

Manual tuning using TimingReport and ThreadReport

To do manual tuning, start with a given configuration (e.g., as generated by ThreadAlloc or your own best guess), and run a few epochs, calling Network.TimingReport after every epoch. Call Network.ThreadReport in ConfigNet after net.Build(). These two reports give you the information you need to tune the thread allocations.

  • at the end of TimingReport, you'll see a report of how much time each thread spent, e.g.,:
  Thr Total Secs  Pct
    0    87.27   27.41
    1    79.23   24.89
    2    75.77    23.8
    3    76.08    23.9
  • This is an already-tuned network, with each of 4 threads taking roughly 25% of the total load -- that's the goal.

  • ThreadReport shows you the estimated computational load of each layer within a thread, as in example below. Note that thread summary data is after the layers in that thread -- also this is for the TimingReport shown above, so you can see that the cost estimates are imperfect -- many layers in thread 0 are clamped input layers and thus less expensive, etc.:

Network: WWI3D Auto Thread Allocation for 4 threads:
	           V1m: cost: 1537 K 	 neur: 384 K 	 syn: 1153 K
	           V1h: cost: 3584 K 	 neur: 1536 K 	 syn: 2048 K
	           LIP: cost: 659 K 	 neur: 307 K 	 syn: 352 K
	         LIPCT: cost: 577 K 	 neur: 307 K 	 syn: 270 K
	          LIPP: cost: 21 K 	 neur: 19 K 	 syn: 2 K
	         MTPos: cost: 20 K 	 neur: 19 K 	 syn: 1 K
	        EyePos: cost: 1035 K 	 neur: 132 K 	 syn: 903 K
	       SacPlan: cost: 160 K 	 neur: 36 K 	 syn: 123 K
	       Saccade: cost: 160 K 	 neur: 36 K 	 syn: 123 K
	        ObjVel: cost: 284 K 	 neur: 36 K 	 syn: 247 K
	         TEOCT: cost: 18944 K 	 neur: 480 K 	 syn: 18464 K
	          TEOP: cost: 1920 K 	 neur: 1152 K 	 syn: 768 K
	          TECT: cost: 6080 K 	 neur: 480 K 	 syn: 5600 K
	           TEP: cost: 1600 K 	 neur: 960 K 	 syn: 640 K
Thread: 0 	 cost: 36583 K 	 neur: 5886 K 	 syn: 30697 K
	            V2: cost: 12902 K 	 neur: 1920 K 	 syn: 10982 K
	          V2CT: cost: 2278 K 	 neur: 1920 K 	 syn: 358 K
	           V2P: cost: 1280 K 	 neur: 768 K 	 syn: 512 K
	          V3CT: cost: 3366 K 	 neur: 480 K 	 syn: 2886 K
	            DP: cost: 360 K 	 neur: 30 K 	 syn: 330 K
	          DPCT: cost: 200 K 	 neur: 30 K 	 syn: 170 K
	           DPP: cost: 50 K 	 neur: 30 K 	 syn: 20 K
Thread: 1 	 cost: 20437 K 	 neur: 5178 K 	 syn: 15259 K
	            V3: cost: 6502 K 	 neur: 480 K 	 syn: 6022 K
	           V3P: cost: 1120 K 	 neur: 672 K 	 syn: 448 K
	            V4: cost: 9601 K 	 neur: 480 K 	 syn: 9121 K
	          V4CT: cost: 4704 K 	 neur: 480 K 	 syn: 4224 K
	           V4P: cost: 1120 K 	 neur: 672 K 	 syn: 448 K
	            TE: cost: 3041 K 	 neur: 480 K 	 syn: 2561 K
Thread: 2 	 cost: 26089 K 	 neur: 3264 K 	 syn: 22825 K
	           TEO: cost: 23521 K 	 neur: 480 K 	 syn: 23041 K
Thread: 3 	 cost: 23521 K 	 neur: 480 K 	 syn: 23041 K

Basically, you just look at any imbalances in the timing per thread, and move appropriate-sized layers across the threads to try to get it to balance. If there are big imbalances, move higher-cost layers, etc. Be sure to run a few epochs as things will change over training.