Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Goals and Running Notes #4

Open
matthewfeickert opened this issue Jun 8, 2020 · 3 comments
Open

Project Goals and Running Notes #4

matthewfeickert opened this issue Jun 8, 2020 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@matthewfeickert
Copy link
Member

This Issue can serve as running list of discussion on project goals and needs, as well as a point to have discussions for future reference.

@matthewfeickert matthewfeickert added documentation Improvements or additions to documentation question Further information is requested labels Jun 8, 2020
@matthewfeickert
Copy link
Member Author

What is the test dataset?

The dataset is arbitrary and should be defined by the user at runtime. We want to both be able to benchmark the performance of the different backends on published ATLAS likelihoods like is done in the example, as well as be able to create arbitrarily large/difficult workspaces to fit to find edge cases.

What computation is needed?

In general, to be able to perform a hypothesis test (pyhf.infer.hypotest).

For the benchmark, you just expected to record the time CPU or GPU used for computation. What else?

We care primarily about benchmarking the fit time but if there are other things that are important for benchmarking then these should be added too.

@coolalexzb
Copy link
Contributor

From the example.py and tutorial I read, I find two ways to generate data:
1 Generate data from the workspace (workspace file might be local or be downloaded online)
2 Generate data using Python scripts

Question: Is there any other approach to generate data? (This is related to the first part of the data pipeline)

Question: When I read pyhf document, I still feel confused about some terms, such as channel, modifier, signal, background, etc. All of these might be related to physics, can you help me understand some terms easily? I noticed that some variables in Python scripts might be related to these terms. If I can understand some terms, I think I can work in higher efficiency.

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Jun 9, 2020

From the example.py and tutorial I read, I find two ways to generate data:
1 Generate data from the workspace (workspace file might be local or be downloaded online)
2 Generate data using Python scripts

Question: Is there any other approach to generate data? (This is related to the first part of the data pipeline)

If "data" here means the workspace, then realistically no. No one is really going to make them by hand, and while you can create them using the pyhf xml2json CLI tool from a binary file type called .root we aren't going to be doing that here. So it is safe to assume that we will either be using pre-existing files or generating pathological cases with Python.

If you want a model that has many bins that is quite easy to do with Python (see the [pyhf] Stack Overflow tag for examples). If you want something that is difficult to fit that will depend on what sorts of systematics and things you put in the model spec. We can discuss this more later.

Question: When I read pyhf document, I still feel confused about some terms, such as channel, modifier, signal, background, etc. All of these might be related to physics, can you help me understand some terms easily? I noticed that some variables in Python scripts might be related to these terms. If I can understand some terms, I think I can work in higher efficiency.

Everything should be covered in the Intro documentation. Can you tell us what parts of that section aren't clear (this would be good to know in general)? Background and Signal just mean "the processes we already know should happen due to known physics" (background) and "processes that are predicted to happen that we are looking for statistical evidence for" (signal) — they are just different components of a statistical model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants