Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to efficiently find long period signals in data with long gaps? #88

Open
jpdeleon opened this issue Jan 14, 2021 · 20 comments
Labels
bug Something isn't working

Comments

@jpdeleon
Copy link

Can you suggest a strategy how to efficiently find transits for light curves with long gap e.g. 1 year as in the case of EPICs observed in K2 Campaign 16 and again in C18? I may perhaps limit the period search space using period_min and period_max but I wonder if you have a better solution.

@jpdeleon jpdeleon added the bug Something isn't working label Jan 14, 2021
@hippke
Copy link
Owner

hippke commented Jan 14, 2021

Do you have a good example to test? Code + data

@martindevora
Copy link

martindevora commented Mar 16, 2021

Hi. I was coming here to ask something similar. The point is that grid.py uses time_span to determine the sampling of the period_grid. When long time gaps are present (e.g. TICs with sector 1 and 27), it'd seem that the period_grid becomes heavily dense (like 400,00 data points) and the computation effort increases significantly.

Do you think that it would make sense to calculate the time_span in a different way? I know that we can modify the period_grid density by reducing the oversampling, but the minimum value for this input is 1, which could also be insufficient to prevent the computational cost increment.

@martindevora
Copy link

Can you suggest a strategy how to efficiently find transits for light curves with long gap e.g. 1 year as in the case of EPICs observed in K2 Campaign 16 and again in C18? I may perhaps limit the period search space using period_min and period_max but I wonder if you have a better solution.

Just as a suggestion and maybe @hippke has a better idea or can refute what I'll suggest. Maybe you could set the oversampling_factor parameter to 1 (which defaults to 3 in tls_constants.py). That would reduce the density of your period grid and make your computations lighter.

@hippke
Copy link
Owner

hippke commented Mar 16, 2021

Ah, I see. At the time I tested TLS, I used a lot of Kepler and early TESS data with only minor gaps. So that's a use case I had not considered. Unfortunately, there is an inevitable computational burden when you go to longer time baseline (for a corresponding gain in sensitivity to longer period planets). So there's no easy solution, i.e. no free lunch. Making TLS run faster will make it less sensitve. However, that may still be OK or even desired, depending on the use case.
Can you provide a full working example that I can check out myself?

@martindevora
Copy link

martindevora commented Mar 16, 2021

I think that it would be feasible without much effort to add the option to the user to provide the period grid. What do you think @hippke ? If not provided then TLS would use the Renault strategy used nowadays.

@hippke
Copy link
Owner

hippke commented Mar 16, 2021

Sure, but what's it good for? The user can already request a grid with any [from, to] in any log-spacing they want?

@martindevora
Copy link

Do you mean by using the oversampling factor? I thought that the lower limit for that value was 1 because numpy gave me some error where it said that only integers were allowed. I will try to reproduce that situacion, but if true, the grid density would be restricted

@hippke
Copy link
Owner

hippke commented Mar 16, 2021

Yes, you should be able to use any floating point value for oversampling_factor, e.g. 0.5

@martindevora
Copy link

Do you mean by using the oversampling factor? I thought that the lower limit for that value was 1 because numpy gave me some error where it said that only integers were allowed. I will try to reproduce that situacion, but if true, the grid density would be restricted

I'm unable to reproduce it. Therefore I agree with you that p_min, p_max and oversampling (given that it can be lower than 1) are enough for us the users to tune the period grid. I hope that it suffices the OP needs.

@martindevora
Copy link

martindevora commented Mar 16, 2021

Hi. I reproduced it. See attached error traceback for oversampling = 3.75 as example:

arrays used as indices must be of integer (or boolean) type
Traceback (most recent call last):
    results = model.power(**power_args)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/main.py", line 263, in power
    SR, power_raw, power, SDE_raw, SDE = spectra(chi2, self.oversampling_factor)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/stats.py", line 119, in spectra
    my_median = running_median(power_raw, kernel)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/helpers.py", line 96, in running_median
    med = numpy.median(data[idx], axis=1)
IndexError: arrays used as indices must be of integer (or boolean) type
arrays used as indices must be of integer (or boolean) type

Process finished with exit code 0

With this in mind, we cannot go down oversampling_factor = 1 and we'd be restricted.

@martindevora
Copy link

martindevora commented Mar 18, 2021

Do you think that the error found above is a bug or is it expected?

@hippke
Copy link
Owner

hippke commented Mar 20, 2021

Looks like a bug. Can you share the code to reproduce? I will look into it and make a bugfix if possible.

@martindevora
Copy link

martindevora commented Mar 20, 2021

Sure. Here it is for transitleastsquares version 1.0.25 and using lightkurve v2:

import lightkurve
import transitleastsquares as tls
import multiprocessing

lcf = lightkurve.search_lightcurve("TIC 305048087", mission="TESS", cadence="short", sector=[2], author="SPOC")\
    .download_all()
lc = lcf.stitch().remove_nans()
model = tls.transitleastsquares(lc.time.value, lc.flux.value)
results = model.power(period_min=0.5, period_max=20, use_threads=multiprocessing.cpu_count(), oversampling_factor=3.5)

And the entire output:

Transit Least Squares TLS 1.0.25 (04 June 2020)
Creating model cache for 38 durations
/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/transit.py:157: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  lc_arr = numpy.array(lc_arr)
  0%|          | 0/2919 periods | 00:00<?Searching 18317 data points, 2919 periods from 0.601 to 13.702 days
Using all 8 CPU threads
100%|██████████| 2919/2919 periods | 01:14<00:00
Traceback (most recent call last):
  File "/home/martin/git_repositories/sherlockpipe/test.py", line 9, in <module>
    results = model.power(period_min=0.5, period_max=20, use_threads=multiprocessing.cpu_count(), oversampling_factor=3.5)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/main.py", line 263, in power
    SR, power_raw, power, SDE_raw, SDE = spectra(chi2, self.oversampling_factor)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/stats.py", line 119, in spectra
    my_median = running_median(power_raw, kernel)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/helpers.py", line 96, in running_median
    med = numpy.median(data[idx], axis=1)
IndexError: arrays used as indices must be of integer (or boolean) type

Kind regards.

@hippke
Copy link
Owner

hippke commented Mar 20, 2021

Fixed in version 1.0.26, which is also available via pip.
oversampling_factor now accepts any floating point value.

@martindevora
Copy link

Great! I would sayas that we can explore long data gaps with the same computacional cost than one after another sectors. To me this should be clear enough and answers the question. However I am not the OP.

@martindevora
Copy link

martindevora commented Mar 21, 2021

Hi. I've already tried the new version. Sorry for bringing this up again. It works sometimes but I found a case where it fails with a new error. My code is:

import lightkurve
import transitleastsquares as tls
import multiprocessing

lcf = lightkurve.search_lightcurve("TIC 352315023", mission="TESS", cadence="short", sector=[13, 27], author="SPOC")\
    .download_all()
lc = lcf.stitch().remove_nans()
lc = lc.remove_outliers(sigma_lower=float('inf'), sigma_upper=3)
model = tls.transitleastsquares(lc.time.value, lc.flux.value)
results = model.power(period_min=0.45, period_max=40, use_threads=multiprocessing.cpu_count() - 1,
                      oversampling_factor=1.1119355997446583, T0_fit_margin=0.05, duration_grid_step=1.1)

And the error:

Transit Least Squares TLS 1.0.26 (20 March 2021)
Creating model cache for 45 durations
/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/transit.py:157: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  lc_arr = numpy.array(lc_arr)
  0%|          | 0/15929 periods | 00:00<?Searching 33168 data points, 15929 periods from 0.602 to 39.985 days
Using all 8 CPU threads
100%|██████████| 15929/15929 periods | 13:26<00:00
Traceback (most recent call last):
  File "/home/martin/git_repositories/sherlockpipe/test.py", line 10, in <module>
    results = model.power(period_min=0.45, period_max=40, use_threads=multiprocessing.cpu_count(),
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/main.py", line 263, in power
    SR, power_raw, power, SDE_raw, SDE = spectra(chi2, self.oversampling_factor)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/stats.py", line 119, in spectra
    my_median = running_median(power_raw, kernel)
  File "/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/helpers.py", line 97, in running_median
    med = numpy.median(data[idx], axis=1)
IndexError: index 15929 is out of bounds for axis 0 with size 15929

Process finished with exit code 1

@martindevora
Copy link

Hi. Sorry for bothering again, just wanted to keep the issue alive so we can get a fix as soon as you find it possible.
Kind regards.

@hippke
Copy link
Owner

hippke commented Mar 25, 2021

Sorry you are having trouble. Looking into it right now.

The runtime of your script is ~1 minute on my Core-i7 7700k using 7 threads (and considerable background load). Your runtime shows as 13 minutes. Are you using a slow machine, or is there something else going on?

Why not use oversampling_factor=1?

I have made a change to the code which fixes the issue, but raises some nasty warnings. Before I release a new version, can you test the new fix? You need to replace the file helpers.py in your folder /home/martin/.local/lib/python3.8/site-packages/transitleastsquares/ with the version attached.
helpers.zip

@martindevora
Copy link

martindevora commented Mar 25, 2021

Hi. Because oversampling_factor=1 is also still suboptimal and the period_grid is still denser than desired. The test I was doing was just using an automatically calculated oversampling factor. This value will probably be lower than 1 for other targets (or whenever we change the equation calculating the value) and therefore I would like to stick to float values instead of integers.
The runtime I reported was very long probably because I stopped the code with a breakpoint for a while and was debugging a little bit.
I will try your suggested file and come back with the output.

@martindevora
Copy link

Hi, here are the results (it seems that the execution worked fine a.k.a. no errors):

Transit Least Squares TLS 1.0.26 (20 March 2021)
Creating model cache for 45 durations
/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/transit.py:157: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  lc_arr = numpy.array(lc_arr)
  0%|          | 0/15929 periods | 00:00<?Searching 33168 data points, 15929 periods from 0.602 to 39.985 days
Using all 8 CPU threads
100%|██████████| 15929/15929 periods | 03:49<00:00
/home/martin/.local/lib/python3.8/site-packages/numpy/core/_methods.py:261: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/home/martin/.local/lib/python3.8/site-packages/numpy/core/_methods.py:221: RuntimeWarning: invalid value encountered in true_divide
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
/home/martin/.local/lib/python3.8/site-packages/numpy/core/_methods.py:253: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/martin/.local/lib/python3.8/site-packages/transitleastsquares/main.py:404: UserWarning: 292 of 329 transits without data. The true period may be twice the given period.
  warnings.warn(text)

Those ugly warnings were already appearing in a fresh installation that I was running from another project which also uses TLS. I don't know what is the cause of them but they probably need a fix too. Thanks for being working on this! And please keep me posted whenever you consider it is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants