Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with generalizing tests beyond GCRCatalogs #234

Open
wmwv opened this issue Jun 2, 2022 · 3 comments
Open

Experiment with generalizing tests beyond GCRCatalogs #234

wmwv opened this issue Jun 2, 2022 · 3 comments

Comments

@wmwv
Copy link
Contributor

wmwv commented Jun 2, 2022

GCRCatalogs is somewhat tightly integrated into DESCQA, at least by assumption, and sometimes in the code.

In this Issue I (@wmwv) plan to explore how much current DESCQA tests can be separated into

  1. Part that loads data
  2. Part that does computation.

In brief this is easy for data that fits in memory, but becomes complicated if there is some process by which to do the loading:

Some more detailed thoughts in the SRV issue:
LSSTDESC/SRV-planning#14

I'm currently pursuing this in the branch u/wmwv/test-logic-separation

@wmwv
Copy link
Contributor Author

wmwv commented Jun 2, 2022

I think @yymao summarized the key issue well in the SRC thread:
"""
how we can minimally change the function so that it accepts both a GCRCatalogs iterator and a pandas/dask dataframe. The test would supposedly need very different code to be applied to these two cases. And I think that's the main issue we are facing.
"""

@wmwv
Copy link
Contributor Author

wmwv commented Jun 2, 2022

I've pulled out this specific example from descqa.basic_tests.SkyArea to help think through this. This is a simple function that takes all of the RA, Dec values, maps them to healpixels and then keeps the set of filled pixels. The code goes through the data in chunks using a GCRCatalogs get_quantities iterator.

    def calc_healpix_set(self, catalog_instance, nside, ra_col, dec_col):
        """
        Calculating the healpixel for all of the data is the I/O intensive step.
        so we separate out this here into its own function.
        """
        pixels = set()
        for d in catalog_instance.get_quantities([ra_col, dec_col], return_iterator=True):
            pixels.update(hp.ang2pix(nside, d[ra_col], d[dec_col], lonlat=True))

        return pixels

In my branch I broke this out a bit into its own (notionally free) function to isolate the core details.

The rest of the run_on_single_catalog function takes the pixels and counts the fraction area covered and plots a map. These are operations done on the aggregated set from the above and so aren't central to the question of how to access the data.

@wmwv
Copy link
Contributor Author

wmwv commented Jun 2, 2022

This makes me think of a scatter-gather pattern, but the scatter part (going through each chunk and identifying the unique healpixel numbers) is done serially.

I'm starting to wonder how the question of how to access data in chunks relates to the question of how to process data in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant