Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User experience and performance improvements for pipeline demonstrator #64

Open
7 of 36 tasks
alexander-held opened this issue Apr 27, 2022 · 1 comment
Open
7 of 36 tasks
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed implementation concerns analysis implementation

Comments

@alexander-held
Copy link
Member

alexander-held commented Apr 27, 2022

This collects various user experience and performance related aspects that the CMS Open Data pipeline demonstration at the AGC 2022 workshop revealed.

Completeness of pipeline

  • add a machine learning component (e.g. ttbar reconstruction), frequently requested and relevant for many analyses being done in practice

User experience

ServiceX+coffea

ServiceX

coffea

coffea-casa

  • dask manual scaling settings seem to not be accepted
  • ServiceX dashboard

func_adl

  • find ways to format queries in a way that helps understand the "layer" at which a given operation acts

processor design

  • avoid stacking masks of different shapes together (when built after initial filtering), hard to keep track of shapes (perhaps keepdims=True, or masking with None)
  • improve systematics loop, potentially streamline everything to use the same pattern, or find a way to automatically track which columns change when, and automatically expand observable with systematics dimensions, avoid scaling of jet properties via helper array

Performance

ServiceX+coffea

ServiceX

  • DID finder becomes a bottleneck when running over a large amount of files

coffea

servicex-databinder approach

  • avoid bottleneck with file conversion / copying (feed data straight to Skyhook?)

coffea-casa

  • understand issues showing up in dask task stream (file access?)
  • possibility of guaranteeing fixed number of workers for performance benchmarking

func_adl

cabinetry

  • cabinetry.templates.collect method takes a lot of time when introducing more channels (i.e. 45.3 seconds for 20 channels)
  • cabinetry.model_utils.prediction(model, fit_results=fit_results) causes notebook to crash due to memory issues on model with many channels. -> potentially related: Memory requirement of ak.sum vs np.sum scikit-hep/awkward#2480
@alexander-held alexander-held added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Oct 5, 2022
@ekauffma
Copy link
Collaborator

Validation model which causes notebook to crash when getting post-fit prediction (cabinetry.model_utils.prediction(model, fit_results=fit_results): https://gist.github.com/ekauffma/b9fcbba5bb6f1ba411b6be37d8586db6

@alexander-held alexander-held added the implementation concerns analysis implementation label May 1, 2023
@alexander-held alexander-held pinned this issue May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed implementation concerns analysis implementation
Projects
None yet
Development

No branches or pull requests

2 participants