Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store computational throughput and latency figures in repository #59

Open
6 tasks
felker opened this issue Jan 7, 2020 · 0 comments
Open
6 tasks

Store computational throughput and latency figures in repository #59

felker opened this issue Jan 7, 2020 · 0 comments

Comments

@felker
Copy link
Member

felker commented Jan 7, 2020

Related to #58, #52, and #51.

We should to add a continually-updated record of the examples/second, second/batch, and other statistics discussed in #51 to a new file docs/Benchmarking.md (or ComputationalEfficiency.md, etc.).

AFAIK, neither Kates-Harbeck et al (2019) or Svyatkovskiy (2017) discussed single-node or single GPU computational efficiency, since they focused on the scaling of multi-node parallelism (CUDA-aware MPI).

Given that we have multiple active users of the software distributed across the country (world?), it would be good for collaboration to provide easily-accessible metrics of performance expectations. The absence of these figures has already caused some confusion when we got access to V100 GPUs on the Princeton Traverse cluster.

We need to establish a benchmark or set of benchmarks for FRNN in order to measure and communicate consistent and useful metrics. E.g. we could store measurements from only a single benchmark consisting of 0D and 1D d3d signal data with our LSTM architecture on a single GPU/device with batch_size=256. Then, a user would have to extrapolate the examples/second to the simpler network but the longer average pulse lengths on JET if using jet_data_0d.

The conf.yaml configuration choices that have first-order effects on performance include:

  • Network architecture (LSTM vs. TCN vs. Transformer, inclusion of 1D data via convolutional layers, etc.)
  • Hyperparameters (number of layers, hidden units per layer, LSTM length, batch size, etc.)
  • Data set: pulse length of shots, number of features per timestep in the input vector, etc.

Similar to #41, these figures will be useless in the long run unless we store details of their context, including:

  • SHA1 of Git version of the repository
  • Conda environment
  • CUDA, MPI, etc. libraries (Apex?)
  • Specific hardware details, including computer name, interconnect, specific model of GPU

Summary of hardware we have/had/will have access to for computational performance measurements:

  • K80 (OLCF Titan, ALCF Cooley, Princeton Tiger 1)
  • P100 (Princeton Tiger 2)
  • V100 (Princeton Traverse, OLCF Summit)
  • Intel KNL 7230 (ALCF Theta)

Even when hardware is retired (e.g. OLCF Titan), it would be good to keep those figures for posterity.

  • Also store MPI scaling metrics as discussed in the above papers?
  • Track memory usage on the GPU?
  • examples/second and sec/batch should already be independent of the pulse length. I.e. the size of an example depends on the sampling frequency dt and LSTM length (TRNN</> in the Nature paper). But the gross throughput statistics such as seconds/epoch could be normalized by pulse length.
  • This whole issue is focused on training speed, but should we track and store inference time? See Resolve differences in shot counts from Nature paper; improve storage of details of the shot and signal sets #60.
    • Also real-time inference via ONNX, keras2c, etc.?
    • What about guarantee_preprocessed.py runtimes? Should at least quote a single approximate runtime expectation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant