Add global discovery #470

msdemlei · 2023-08-07T15:05:01Z

This PR should eventually add a facility for global dataset discovery to pyVO. Specifically, people should be able to say "give me all images in the VO showing M42 around 42nm on MJD 54125."

Realistically, we ought to allow intervals on input and probably multiple positions (though we should refuse SIA1 for multiple objects).

At the moment, none of this will work; I'm missing two very basic building
blocks that I won't have time to locate before I'm going on holiday, both related to the way pyvo treats query results:

I need to remove rows from registry query results. That's nasty because RegistryResults reads from the astropy.VOTable instance, and that doesn't support dropping rows so far for all I can see.
I need to turn whatever comes back from SIA1 and TAP services to sia2.ObsCoreRecord-s. Again, as implemented so far, this wants a VOTable to pick stuff from.

If there is no better solution for both, we probably need to turn things into astropy tables, manipulate them, and then turn them back to VOTables. It'd be great if there were a less convoluted way to do this. Is there?

Even if that's there, this won't be ready for prime time; we will at least need defined timeouts on all services involved, because there's always some services in the VO hanging clients for almost arbitrary times.

Once we have the image case, a similar modules for spectrum discovery (in SSA and Obscore for now) should be simple.

We probably should talk to the IVOA to make throwing out capabilities leading to the same data possible/more robust; I think we ought to suggest that SIAv2 services over data that's also there in Obscore services should have an IsDerivedFrom set to the respective TAP service (or Obscore resource).

So, there's a metric ton of TODOs – I'm grateful for commits fixing any of them.

Ah – and before you ask: Testing code like this that talks to services all over the internet is of course a particular challenge; I think I have a plan, so don't worry about that particular issue too much at this point.

codecov · 2023-08-07T15:07:30Z

Codecov Report

Attention: Patch coverage is 37.90323% with 77 lines in your changes are missing coverage. Please review.

Project coverage is 79.13%. Comparing base (befc7bc) to head (bb716b4).
Report is 73 commits behind head on main.

❗ Current head bb716b4 differs from pull request most recent head 5acf4b0. Consider uploading reports for the commit 5acf4b0 to get more accurate results

Files	Patch %	Lines
pyvo/discover/image.py	23.00%	77 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #470      +/-   ##
==========================================
- Coverage   81.24%   79.13%   -2.11%     
==========================================
  Files          69       54      -15     
  Lines        7129     6155     -974     
==========================================
- Hits         5792     4871     -921     
+ Misses       1337     1284      -53

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/utils/testing.rst

This attempts to address astropy#470 (review)

service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.

msdemlei · 2024-02-26T12:09:06Z

So... I think this could now be worth a look. See also my blog post on this: https://blog.g-vo.org/global-dataset-discovery-in-pyvo.html.

The CI failures look interesting – I half fear they are related to implementation details changing the request hashes that utils.testing generates. Perhaps we'll have to properly scaffold these things after all? Anyway, I'll look at this some other day; it shouldn't impact discovery as such.

Meanwhile, on the usual codestyle failures I have to say I'm fairly unhappy that I'll now have to say foo / 2 rather than foo/2 – does that really help someone? And do people here in pyVO actually prefer the former?

If we have to have these checks in place: Can someone recommend a formatter/pretty-printer that I can run halfway blindly on the sources? Frankly, I don't think I'd like to get used to typing foo / 2 or 1 + 2, so if this kind of thing can happen in a pre-commit hook, that'd make my life easier...

bsipocz · 2024-03-06T16:11:46Z

f we have to have these checks in place: Can someone recommend a formatter/pretty-printer that I can run halfway blindly on the sources? Frankly, I don't think I'd like to get used to typing foo / 2 or 1 + 2, so if this kind of thing can happen in a pre-commit hook, that'd make my life easier...

even old basic tools could fix this, autopep8, or I have flymake set up on my emacs to check on the codestyle (and it should pick up the setting from the repo)

msdemlei · 2024-04-15T11:29:36Z

So... I think modulo review points and perhaps a few style matters, this could be merged, with a caveat that the API is probably not stable yet. It does something interesting, and although data providers still can do quite a bit to make this work better, it's a lot more likely they will do that if there is merged code in pyVO.

From my perspective, the main problem is testing. Since there is so much network activity going on here, my plan has been to ship frozen network answers as part of the source (see utils.testing for the general scheme).

It turns out that that is nowhere near as simple as I had hoped. The test failures with old versions of python are almost certainly due to some subtle difference in the serialisation of query parameters; they might be fixable by normalising these a bit more aggressively, but perhaps they are not; I'd have to pull up systems with these old versions to look.

But: Is this the right way to go about this? Perhaps the instrumentation should sit in TAPService, where the parameters are much less dependent on versions than where requests.requests has written its payload? Or should we, especially after the xz disaster, try harder to avoid what looks like binary blobs in the VCS in the first place? But then how do we do the scaffolding for these tests without going insane?

Thoughts and/or ideas?

ManonMarchand · 2024-04-16T13:46:12Z

Hi Markus,
With which python version do the tests pass without access to remote data? Maybe there was a change in the way hashs are calculated that we could identify in python's changelogs?

msdemlei · 2024-04-16T14:14:41Z

On Tue, Apr 16, 2024 at 06:46:34AM -0700, Manon Marchand wrote: With which python version do the tests pass without access to remote data? Maybe there was a change in the way hashs are calculated that we could identify in python changelogs?

I'm rather sure it's not related to hash computations as such; these are fairly stable. In the current checks, things already fail for python 3.9. But good thing you ask, because it made me realise that for python 3.8, things already fail because there is no functools.cache there. At least that I can fix. Perhaps we can learn something from the emerging pattern.

msdemlei · 2024-04-16T14:28:34Z

Ah well... failures up and down. I'll dig a bit deeper tomorrow, but this does look as if we're going to need a giant scaffold. Oh dang.

msdemlei · 2024-04-18T06:59:39Z

Oh yikes. I've tried this on two different Debian relases (so, I'm not reproducing github's environment). The first instance for differing hashes is because astropy includes something like "Produced with astropy.io.votable version 5.2.1" in its VOTables (which get uploaded in the course of the tests). Updating the cached files depending on the underlying astropy release obviously is not an option.

I think the testing plan with intercepting and caching https requests and responses and recovering the responses based on hashes of the requests is fundamentally broken for something as complex as the global discovery.

I suppose we'll have to mock at a higher level. If someone has good ideas how to go about this: I'd be most grateful.

bsipocz · 2024-04-18T16:25:04Z

I suppose we'll have to mock at a higher level. If someone has good ideas how to go about this: I'd be most grateful.

I don't have any suggestions without diving into this properly, but will try to find time to do so next week.

service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.

This will keep records that don't have STC coverage declared. We need this for global dataset discovery.

Also, new method set_services on _ImageDiscoverer to pass in a custom resource list (for now, for testability) Also, exposing RegistryResource and RegistryResults in the registry API; they're really central in using the stuff, and we'll need them in type annotations a lot.

service. Also, improving discovery logging a bit Also, adding an admonition to not use utils.testing outside of pyVO. (which attemtps to address astropy#470 (review)) Also, removing stale test inputs for global discovery.

Also, some changes to make registry requests more predictable (i.e., sorting the extra tables to have a deterministic order).

This is a temporary measure until we have moved the VO to a saner obscore registration (i.e., obscore_new). When the Registry shows everyone has moved, obscore_new should become obscore, and the current obscore probably just dropped.

In global image discovery, using this to elide TAP services. Updating test data.

access URLs.

Basically, you can now clear the queue of services to query. Once the currently running query finishes, that will also end query_services. I've thought about canceling the currently running query, too, but that seems to be hard.

Also, fixing a crash in relationship discovery when there are no records to discover relationships for.

This is relevant when, for one reason or other, multiple capabilities of a type are present in a registry record. For our purposes here (sia, ssap, tap...), it is highly unlikely that these different capabilities correspond to different data (that's as different from VizieR SCS, where different capabilities correspond to different tables; but that's a pain and we don't want anything like that here).

SIA1 has all the various bandpass definitions optional (whether we like it or not). This commit makes our result construction resilient against services that don't provide it.

Also, we're sneaking in a change in SIA1 bandpass behaviour: We now accept the standard strings for bandpass units and assume m if the bandpass declaration is missing (that's against the standard, but in line with acutal practice).

This is mainly about using 1=contains rather than distance in the obscore queries; we also had to update the RegTAP capability.xml in the registry test to bring it in line with what the discover tests do. A better isolation of these test suites would be desirable; but then we don't want forsake capabilities caching altogether either... Also, sneaking in a few formatting problems.

It turns out that hashing actual network requests is too fragile -- due to version strings in headers and payloads, it is probably impossible to pull that off across python versions. I'm also dropping utils.testing; I think it might be useful to retain this in pyvo's history -- perhaps it will be useful as a basis for something comparable. Meanwhile, I have made the discovery tests remote_data. More unit tests ought to mock on a higher level, presumably by monkeypatching registry.search and the dal classes involved (yikes!).

When resetting the services and hence emptying the queue, the iterator currently active would keep returning the queriables that were in there before. That's no longer the case now. Also, adding a few tests to improve coverage.

msdemlei · 2024-05-15T06:54:10Z

After some deliberation I have decided that doing local tests properly for many of the features is really hard (and would hide a lot of interesting bugs). Hence, I have, with a solid helping of regret, made the more interesting tests remote. They're still relatively quick.

So... Volunteers for reviewing this? It'd be lovely to have it merged into main for improved visibility, even though there will be quite a bit of refactoring once we also do global spectrum discovery.

The codecov analysis is way off (I suspect they don't do remote tests). In my local pytest --cov, I only have 14 uncovered lines, and most of these are odd-ish (well, the obscore vs. tap-obscore thing should eventually be covered, but that's not easy and I hope this code will be obsolete rather soon anyway).

The doctest failures... I think that's a separate matter independent of this PR. We really should think about dropping a few of them or at least figure out why the doctests take so long in the first place.

Documentation building has some fairly odd messages, some of which may be introduced with the PR. I'd appreciate if someone with more automodapi-fu than me could have a look at that and figure out how to silence/fix these diagnostics.

bsipocz · 2024-05-16T03:58:12Z

So... Volunteers for reviewing this? It'd be lovely to have it merged into main for improved visibility, even though there will be quite a bit of refactoring once we also do global spectrum discovery.

I know that there are some disagreements about this approach, maybe we should clear that up first next week, then I can put this on my list to review.

The codecov analysis is way off (I suspect they don't do remote tests). In my local pytest --cov, I only have 14 uncovered lines, and most of these are odd-ish (well, the obscore vs. tap-obscore thing should eventually be covered, but that's not easy and I hope this code will be obsolete rather soon anyway).

I'll check this once we do the code review, but recently there were some codecov issues elsewhere, too, so it may very well be a red herring.

The doctest failures... I think that's a separate matter independent of this PR. We really should think about dropping a few of them or at least figure out why the doctests take so long in the first place.

separate, I'm opening the fix for it.

Documentation building has some fairly odd messages, some of which may be introduced with the PR. I'd appreciate if someone with more automodapi-fu than me could have a look at that and figure out how to silence/fix these diagnostics.

I volunteer for fixing these, once we go ahead with the code review.

bsipocz mentioned this pull request Aug 16, 2023

REFACTOR: revisit registry examples to explore other ways for servicetype aliases NASA-NAVO/navo-workshop#130

Open

bsipocz reviewed Oct 24, 2023

View reviewed changes

docs/utils/testing.rst Outdated Show resolved Hide resolved

msdemlei added a commit to msdemlei/pyvo that referenced this pull request Oct 25, 2023

Adding an admonition to not use utils.testing outside of pyVO.

4735346

This attempts to address astropy#470 (review)

msdemlei mentioned this pull request Nov 3, 2023

New Concept: Data Discovery astrothesaurus/UAT#365

Open

msdemlei force-pushed the add-global-discovery branch from 4735346 to 159aa5c Compare February 5, 2024 13:58

bsipocz added enhancement component: registry labels Feb 8, 2024

msdemlei changed the title ~~[PRE-DRAFT] Add global discovery~~ [DRAFT] Add global discovery Feb 26, 2024

msdemlei force-pushed the add-global-discovery branch 2 times, most recently from 2142430 to e6ca7bc Compare February 26, 2024 11:57

msdemlei force-pushed the add-global-discovery branch from e6ca7bc to bba06d9 Compare February 26, 2024 12:09

bsipocz mentioned this pull request Mar 6, 2024

MAINT: do fewer style checks #531

Merged

msdemlei changed the title ~~[DRAFT] Add global discovery~~ Add global discovery Apr 15, 2024

msdemlei force-pushed the add-global-discovery branch 2 times, most recently from 5f89f41 to d143384 Compare May 6, 2024 14:14

msdemlei force-pushed the add-global-discovery branch from d143384 to a2f7d8a Compare May 7, 2024 11:02

Adding inclusive keyword arguments to the STC constraints.

d02d9e2

This will keep records that don't have STC coverage declared. We need this for global dataset discovery.

msdemlei added 25 commits May 15, 2024 08:14

Updating global discovery test data

c3bd920

Also, some changes to make registry requests more predictable (i.e., sorting the extra tables to have a deterministic order).

Minor fix in SIA1 branch of global image discovery

403c663

New datamodel obscore-new to locate obscore tables.

b91efce

In global image discovery, using this to elide TAP services. Updating test data.

Global discovery now elides duplicate images found based on their

10cac07

access URLs.

Adding timeouts to global discovery

1a20017

Including utils.testing docs in the main doc

25a4d5d

Adding some documentation on global dataset discovery.

7066cc3

You can now pass a result of registry.search to discover.image_globally.

322914e

global discovery: the time constraint can now be an interval, too.

ccf03d0

Adding a method for service stats to the discoverer

e6a711c

Adding basic cancelability to global discovery.

a9188a5

Basically, you can now clear the queue of services to query. Once the currently running query finishes, that will also end query_services. I've thought about canceling the currently running query, too, but that seems to be hard.

Global discovery records now record the service of origin.

c30fbe2

Global discovery: Downgrading obscore query to ADQL 2.0

811cb0f

Also, fixing a crash in relationship discovery when there are no records to discover relationships for.

SIA1: warding against NULL-s in the bandpass definition

1552ffd

SIA1 has all the various bandpass definitions optional (whether we like it or not). This commit makes our result construction resilient against services that don't provide it.

Whitespace edits to by-and-large placate flake8

7778f73

functools.cache is 3.9+; replace it

d60b905

global discovery: Making query cancelling work.

8fcdde6

When resetting the services and hence emptying the queue, the iterator currently active would keep returning the queriables that were in there before. That's no longer the case now. Also, adding a few tests to improve coverage.

global discovery: watchers now receive a discoverer instance, too.

b37ed04

msdemlei force-pushed the add-global-discovery branch from 1ced6d5 to d2769e4 Compare May 15, 2024 06:15

global discovery: more test coverage.

5acf4b0

msdemlei force-pushed the add-global-discovery branch from d2769e4 to 5acf4b0 Compare May 15, 2024 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add global discovery #470

Add global discovery #470

msdemlei commented Aug 7, 2023 •

edited

codecov bot commented Aug 7, 2023 •

edited

msdemlei commented Feb 26, 2024

bsipocz commented Mar 6, 2024

msdemlei commented Apr 15, 2024

ManonMarchand commented Apr 16, 2024 •

edited

msdemlei commented Apr 16, 2024 via email

msdemlei commented Apr 16, 2024

msdemlei commented Apr 18, 2024

bsipocz commented Apr 18, 2024

msdemlei commented May 15, 2024

bsipocz commented May 16, 2024

Add global discovery #470

Are you sure you want to change the base?

Add global discovery #470

Conversation

msdemlei commented Aug 7, 2023 • edited

codecov bot commented Aug 7, 2023 • edited

Codecov Report

msdemlei commented Feb 26, 2024

bsipocz commented Mar 6, 2024

msdemlei commented Apr 15, 2024

ManonMarchand commented Apr 16, 2024 • edited

msdemlei commented Apr 16, 2024 via email

msdemlei commented Apr 16, 2024

msdemlei commented Apr 18, 2024

bsipocz commented Apr 18, 2024

msdemlei commented May 15, 2024

bsipocz commented May 16, 2024

msdemlei commented Aug 7, 2023 •

edited

codecov bot commented Aug 7, 2023 •

edited

ManonMarchand commented Apr 16, 2024 •

edited