Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagen bns changes #466

Open
wants to merge 34 commits into
base: bns
Choose a base branch
from
Open

Datagen bns changes #466

wants to merge 34 commits into from

Conversation

rafia17
Copy link
Contributor

@rafia17 rafia17 commented Nov 27, 2023

Following changes are implemented:

scripts/waveforms.py:

  • a signal_type variable is added. Based on its value, we will call generate_gw() for BBH and generate_gw_bns for BNS.
  • the call for generate_gw_bns is made through concurrent.futures.ProcessPoolExecutor to speed up waveform generation
  • some logging info is added to prompt when waveform generation is finished etc

utils/injection.py:

  • generate_gw_bns() is added. Its similar to generate_gw() except that it gets rid of the wraparound of the coalescence to the front of the signal.

projects/sandbox/pyproject.toml:

  • added signal_type variable with default set to "bbh"

@rafia17 rafia17 changed the base branch from main to bns November 27, 2023 23:05
@rafia17 rafia17 self-assigned this Nov 27, 2023
detector_frame_prior,
)
if signal_type == "bns":
with ProcessPoolExecutor(140) as exe:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of parallelizing this, but I don't think this is doing what you expect: This will submit a single job that will generate all the requested waveforms in one process.

Copy link
Contributor Author

@rafia17 rafia17 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was a long discussion with Alec on a thread that incidentally did not include you. I am going to forward that thread to you and hopefully you can see the entire convo.
The crux was that, using concurrent.futures did reduce the waveform generation from days to under 46 min on the hanford box.
Let me know if you can access the following thread on slack:
https://fastml.slack.com/archives/C05EHNRU8AK/p1695772374684639

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ethan's right. This is only helpful if you submit a job for each choice of parameters, submitting one job that generates waveforms for all the parameters won't multiprocess anything, it will just generate all the waveforms in serial in one process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so I get your point. Not sure why it reduced the waveform generation time so drastically. Or could it be that the hanford box just behaved rather nicely at that particular run. It's kind of puzzling.
So I will remove the concurrent.futures for now. If we run into bottlenecks in generating BNS in future, we can revisit at that time.

waveform_approximant: str,
detector_frame_prior: bool = False,
):
padding = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for this? In general should avoid "magic numbers" - maybe make this a parameter with a default value

Copy link
Contributor Author

@rafia17 rafia17 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is the issue where thebilby generated waveforms have coalescence at timestamp=0 sec and that's on purpose I think. We need to move the coalescence to the end of the waveform at time stamp = 16 sec (lets say). To do this I played with several waveforms and found that if we roll the waveforms by 200 datapoints to the left, then we do get most of the coalescence at the very end. There are some ring-down remnants in some cases and to deal with that, we chop off the first sec of the waveform. So the padding is set to 1 sec and the waveform is generated "longer "by an amount equal to padding. After rolling and chopping off the first sec, the resultant waveform is of the intended length and the coalescence is nicely at the end of it.
I will add some comments to the code to explain this properly.
So in the specific context that it is used, I don't think its value will be changed or can be used elsewhere. Given its limited scope, I don't think that we need to put this in pyproject.toml


# just shift the coalescence to the left by 200 datapoints
# to cancel wraparound in the beginning
dt = -200
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, another magic number: why is this 200? Is there a first principles motivation for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see above

Comment on lines 150 to 153
if i == (n_samples / 4):
logging.info("Generated polarizations : {}".format(i))
elif i == (n_samples / 2):
logging.info("Generated polarizations : {}".format(i))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue this reduces readability more than it helps with logging. If you wan't to keep track maybe something cleaner would be

# every 10th waveform
if not i % 10:
    # note the logging.debug so that it's only called if verbose=True.
    logging.debug(f"{i + 1} polarizations generated") 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will change

elif i == (n_samples / 2):
logging.info("Generated polarizations : {}".format(i))

logging.info("Finished Generated polarizations")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Finished Generated polarizations" --> "Finished generating polarizations"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will change

responses, swap_indices = self.swapper(responses)
responses, mute_indices = self.muter(responses)
X[mask] += responses
if N > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me when this edge case would be reached? In what instance would we not wan't to inject waveforms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also something that Alec had proposed. We ran into this issue when we had to decrease the batch size to like 8 for BNS and that would lead to N coming to 0 on some instances, with waveform prob set to 0.277. For smaller batch sizes, the chances of N coming to zero, even for reasonable values of waveform probs is pretty high. So Alec proposed to add this check to aframe

detector_frame_prior,
)
signals = future.result()
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is going into a dedicated bns branch, I think it might make sense to just assume we are generating bns waveforms, and not have this if else. We can move to generalizing everything down the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to create a separate repo for BNS, it will be most ideal and efficient, if we take care of this now rather than later. And write code keeping in mind that it works seamlessly with the main pipeline. Otherwise it will be a huge issue later to merge bns branch into main aframe branch.

Copy link
Collaborator

@EthanMarx EthanMarx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff, left some comments and questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants