Datagen bns changes #466

rafia17 · 2023-11-27T23:04:26Z

Following changes are implemented:

scripts/waveforms.py:

a signal_type variable is added. Based on its value, we will call generate_gw() for BBH and generate_gw_bns for BNS.
the call for generate_gw_bns is made through concurrent.futures.ProcessPoolExecutor to speed up waveform generation
some logging info is added to prompt when waveform generation is finished etc

utils/injection.py:

generate_gw_bns() is added. Its similar to generate_gw() except that it gets rid of the wraparound of the coalescence to the front of the signal.

projects/sandbox/pyproject.toml:

added signal_type variable with default set to "bbh"

adding the non spin bns prior

EthanMarx · 2023-11-28T18:17:15Z

projects/sandbox/datagen/datagen/scripts/waveforms.py

-        detector_frame_prior,
-    )
+    if signal_type == "bns":
+        with ProcessPoolExecutor(140) as exe:


I like the idea of parallelizing this, but I don't think this is doing what you expect: This will submit a single job that will generate all the requested waveforms in one process.

So this was a long discussion with Alec on a thread that incidentally did not include you. I am going to forward that thread to you and hopefully you can see the entire convo.
The crux was that, using concurrent.futures did reduce the waveform generation from days to under 46 min on the hanford box.
Let me know if you can access the following thread on slack:
https://fastml.slack.com/archives/C05EHNRU8AK/p1695772374684639

Ethan's right. This is only helpful if you submit a job for each choice of parameters, submitting one job that generates waveforms for all the parameters won't multiprocess anything, it will just generate all the waveforms in serial in one process.

Yes, so I get your point. Not sure why it reduced the waveform generation time so drastically. Or could it be that the hanford box just behaved rather nicely at that particular run. It's kind of puzzling.
So I will remove the concurrent.futures for now. If we run into bottlenecks in generating BNS in future, we can revisit at that time.

EthanMarx · 2023-11-28T18:17:48Z

projects/sandbox/datagen/datagen/utils/injection.py

+    waveform_approximant: str,
+    detector_frame_prior: bool = False,
+):
+    padding = 1


What is the motivation for this? In general should avoid "magic numbers" - maybe make this a parameter with a default value

So there is the issue where thebilby generated waveforms have coalescence at timestamp=0 sec and that's on purpose I think. We need to move the coalescence to the end of the waveform at time stamp = 16 sec (lets say). To do this I played with several waveforms and found that if we roll the waveforms by 200 datapoints to the left, then we do get most of the coalescence at the very end. There are some ring-down remnants in some cases and to deal with that, we chop off the first sec of the waveform. So the padding is set to 1 sec and the waveform is generated "longer "by an amount equal to padding. After rolling and chopping off the first sec, the resultant waveform is of the intended length and the coalescence is nicely at the end of it.
I will add some comments to the code to explain this properly.
So in the specific context that it is used, I don't think its value will be changed or can be used elsewhere. Given its limited scope, I don't think that we need to put this in pyproject.toml

EthanMarx · 2023-11-28T18:18:56Z

projects/sandbox/datagen/datagen/utils/injection.py

+
+        # just shift the coalescence to the left by 200 datapoints
+        # to cancel wraparound in the beginning
+        dt = -200


Again, another magic number: why is this 200? Is there a first principles motivation for this?

Please see above

EthanMarx · 2023-11-28T18:30:09Z

projects/sandbox/datagen/datagen/utils/injection.py

+        if i == (n_samples / 4):
+            logging.info("Generated polarizations     : {}".format(i))
+        elif i == (n_samples / 2):
+            logging.info("Generated polarizations     : {}".format(i))


I would argue this reduces readability more than it helps with logging. If you wan't to keep track maybe something cleaner would be

# every 10th waveform if not i % 10: # note the logging.debug so that it's only called if verbose=True. logging.debug(f"{i + 1} polarizations generated")

agree, will change

EthanMarx · 2023-11-28T18:30:40Z

projects/sandbox/datagen/datagen/utils/injection.py

+        elif i == (n_samples / 2):
+            logging.info("Generated polarizations     : {}".format(i))
+
+    logging.info("Finished Generated polarizations")


"Finished Generated polarizations" --> "Finished generating polarizations"

agree, will change

EthanMarx · 2023-11-28T18:32:25Z

projects/sandbox/train/train/augmentor.py

-        responses, swap_indices = self.swapper(responses)
-        responses, mute_indices = self.muter(responses)
-        X[mask] += responses
+        if N > 0:


Remind me when this edge case would be reached? In what instance would we not wan't to inject waveforms?

This was also something that Alec had proposed. We ran into this issue when we had to decrease the batch size to like 8 for BNS and that would lead to N coming to 0 on some instances, with waveform prob set to 0.277. For smaller batch sizes, the chances of N coming to zero, even for reasonable values of waveform probs is pretty high. So Alec proposed to add this check to aframe

EthanMarx · 2023-11-28T18:33:43Z

projects/sandbox/datagen/datagen/scripts/waveforms.py

+                detector_frame_prior,
+            )
+            signals = future.result()
+    else:


Since this is going into a dedicated bns branch, I think it might make sense to just assume we are generating bns waveforms, and not have this if else. We can move to generalizing everything down the line.

If we don't want to create a separate repo for BNS, it will be most ideal and efficient, if we take care of this now rather than later. And write code keeping in mind that it works seamlessly with the main pipeline. Otherwise it will be a huge issue later to merge bns branch into main aframe branch.

EthanMarx

Nice stuff, left some comments and questions!

This reverts commit 474ddaa.

rafia17 and others added 15 commits October 3, 2023 12:08

added check for N=0

8eb5b60

added N>0

a70b426

add check for psd length >= window length

068af79

pre-commit checks were failing

708f137

pre-commit checks were failing

433a6ac

pre-commit checks were failing

a92ae7c

pre-commit checks were failing

da26566

pre-commit checks were failing

f18371f

pre-commit checks were failing

f577295

fix for N=0 and proper psd length

faaaf51

added bns nonspin prior

a4caaeb

Merge branch 'add_bns_prior' into bns

f9d9720

adding the non spin bns prior

added imports for bns nonspin prior

6955fc3

Merge branch 'add_bns_prior' into bns

f532652

datagen bns changes

93ec5b2

rafia17 changed the base branch from main to bns November 27, 2023 23:05

rafia17 self-assigned this Nov 27, 2023

Rafia Omer added 7 commits November 27, 2023 15:11

added signal_type variable

4481b65

datagen bns changes

cb29e6c

datagen changes

611eb29

datagen changes

0c168d2

injection changes

71fe8dc

datagen changes with fixes

a25cf64

datagen bns changes

d832644

rafia17 requested a review from EthanMarx November 28, 2023 16:52

datagen bns changes

b95db3d

EthanMarx reviewed Nov 28, 2023

View reviewed changes

EthanMarx requested changes Nov 28, 2023

View reviewed changes

Rafia Omer added 11 commits November 28, 2023 12:13

Fixes as per reviewer's comments

5499892

review changes

474ddaa

Revert "review changes"

669e0db

This reverts commit 474ddaa.

fixes based on review comments

a1315bf

fixed unit test errors

93717f8

resolving flake8 test issues

e398992

random merge conflicts

f048fc4

datagen changes

dedb7a3

datagen bns changes

ea8086d

datagen bns changes

1ed207a

datagen changes

6e7cc95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datagen bns changes #466

Datagen bns changes #466

rafia17 commented Nov 27, 2023 •

edited

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023 •

edited

alecgunny Nov 28, 2023

rafia17 Nov 29, 2023

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023 •

edited

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023

EthanMarx Nov 28, 2023

rafia17 Nov 28, 2023

EthanMarx left a comment

Datagen bns changes #466

Are you sure you want to change the base?

Datagen bns changes #466

Conversation

rafia17 commented Nov 27, 2023 • edited

Choose a reason for hiding this comment

rafia17 Nov 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafia17 Nov 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EthanMarx left a comment

Choose a reason for hiding this comment

rafia17 commented Nov 27, 2023 •

edited

rafia17 Nov 28, 2023 •

edited

rafia17 Nov 28, 2023 •

edited