Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: only output user-requested columns in votable.parse #15959

Merged

Conversation

neutrinoceros
Copy link
Contributor

@neutrinoceros neutrinoceros commented Jan 27, 2024

Description

Fixes #14943
I took me a couple hours to pass my test and all other tests locally, so I'm opening the PR while iron is hot, but I expect I'll probably want to refactor a bit before I open to review.

Since I needed to adjust existing tests which were actively checking for previous behaviour, I'm marking this as an API change, but in my opinion it's borderline a bug fix. In other words, I'm aiming astropy 6.1.0, not 6.0.1.

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

Copy link

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@neutrinoceros neutrinoceros changed the title BUG: only parse user-requested columns in votable.parse BUG: only output user-requested columns in votable.parse Jan 27, 2024
@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch from 7b5722c to b1b0ee1 Compare January 27, 2024 18:22
@neutrinoceros neutrinoceros changed the title BUG: only output user-requested columns in votable.parse API/BUG: only output user-requested columns in votable.parse Jan 27, 2024
@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch from b1b0ee1 to b1e192c Compare January 28, 2024 09:56
@neutrinoceros
Copy link
Contributor Author

I'm not convinced that failures are related. In anywise I don't reproduce them locally. Might have something to do with pytest 8.0.0...

@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch from b1e192c to 370c2e1 Compare January 29, 2024 10:06
Copy link
Member

@eerovaher eerovaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not doing a thorough review, I'm just commenting on a few things that stood out to me.

# Deal with a subset of the columns, if requested.
if not columns:
colnumbers = list(range(len(fields)))
colnumbers = list(range(len(all_fields)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the conversion from range to list necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because range() produces a 1-time consumable iterator, but this variable is definitely used more than once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limit to how many times a range instance can be iterated over:

>>> one_to_three = range(1, 4)
>>> for i in one_to_three:
...     for j in one_to_three:
...             print(i, j)
... 
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know that. Was it always true ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know this was true even for xrange in Python 2 (but I can't claim to know Python 2 very well).

I am willing to guess that quite a few people might believe a range to be single-use because they know that range does lazy evaluation and from that they conclude (mistakenly) that range must be a generator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would make sense, I probably fell in that trap too.

astropy/io/votable/tree.py Outdated Show resolved Hide resolved
Comment on lines 1 to 3
Only output user-requested columns in ``votable.parse``.
Previously, unselected columns would just be masked (and unallocated).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended way to read VOTables is through the unified I/O interface, so I'd expect the change log entry to describe how that is affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased

astropy/io/votable/tests/test_vo.py Outdated Show resolved Hide resolved
@pllim pllim added this to the v6.1.0 milestone Jan 29, 2024
@neutrinoceros
Copy link
Contributor Author

@eerovaher This is far from redu for review (I actually left it in the middle of a refactor). I'll answer your questions tomorrow when I'm back at my desk !

@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch 4 times, most recently from 63fd1ee to cab61cf Compare February 1, 2024 09:35
@neutrinoceros
Copy link
Contributor Author

It took me a while to complete, but I'm almost done with refactoring my initial wonky patch. After I'm sure that the refactor still survives CI, I'll squash my history. Next thing on the list is to make sure that TableElement._parse_binary also works correctly when colnumbers is not None.

@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch 5 times, most recently from 19cddb9 to 5a26dbd Compare February 1, 2024 15:18
@neutrinoceros
Copy link
Contributor Author

Okay I think this should now be ready for review.

@neutrinoceros neutrinoceros marked this pull request as ready for review February 1, 2024 15:19
astropy/io/votable/tests/test_vo.py Outdated Show resolved Hide resolved
astropy/io/votable/tree.py Show resolved Hide resolved
@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch from 5a26dbd to 3bbb6d2 Compare February 1, 2024 18:00
@neutrinoceros
Copy link
Contributor Author

@eerovaher, do you have any other suggestions or requests ? also pinging @pllim as a package expert !

@pllim
Copy link
Member

pllim commented Feb 22, 2024

This would affect PyVO world the most, so I am asking a review from Tom D and maybe ping @lmichel too as a stakeholder.

@lmichel
Copy link
Contributor

lmichel commented Feb 22, 2024

The MIVOT annotation readout in Astropy does not care about the table columns at all. It is a simple extension to the resource parsing that extracts a piece of XML.
The connection between the MIVOT annotations and the data rows (table readout) is done in PyVO (PR #497).
Currently our PyVO implementation does not do any column filtering (although it should).
This fix you are proposing shouldn't affect my PyVO code, and if it does, I'll fix it as the PR is not merged.

# To remedy this issue, we sync back the public property upon access.
for field in self._fields:
if field not in self._all_fields:
self._all_fields.append(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more thought... Does the order matter, @tomdonaldson ? Just appending might not preserve original order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, order matters.
I'll admit that it's not evident that this bit works properly because it relies on other parts of the module to be correct too. The intention is that all_fields follows the order of on-disk fields and then the insertion order of manually added fields. In principle, self.fields can only contain a similarly ordered subset of self.all_fields, but a bug in how self.fields is constructed would cascade into an incorrect ordering in self.all_fields too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am starting to ponder if we should just bite the bullet and fix this properly as Tom pointed out in #15959 (comment) ... 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we agree to explore it in a follow up PR so this patch doesn't get too bloated with deprecation logic ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that the non-deterministic order would come back and bite us as a even more subtle bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if @tomdonaldson can tell me that my worry is unfounded, then I am okay with merging this and have the other idea as follow-up, though if we do that anyway, then this change would be mostly reverted, so is it worth merging and then reverting? 🤪

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is deterministic ? To be clear, I'm proposing to get the follow up deprecation patch going as soon as this one gets merged, not in an undetermined future, so there won't be much opportunities for this to break or even to stay brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't see your second message

though if we do that anyway, then this change would be mostly reverted, so is it worth merging and then reverting? 🤪

It can't be reverted until table.fields.append/extend is disallowed, which requires a deprecation period

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking at this and finding it non-trivial to decide how concerned to be. It's pretty clear that someone just parsing VOTables, even with a subset of columns selected via this feature, should not have a problem with ordering. This is the primary (and only) use case for almost all users.

The potential problems come for users building tables with the API, where they may have figured out a variety of ways to get things to work and we don't want to break those. It's in these scenarios that trying to keep all_fields and fields in sync. Ironically it's in those cases that they are extremely unlikely to be selecting a subset of columns somehow, but that doesn't protect us from getting these columns out of sync.

That said, in that latter use case it's hard to imagine how one would get both out of sync and out of order. When constructing a table, one would create the list of fields, then add the data, not bounce around creating some data, then inserting fields, etc.

So I'm not really concerned about the ordering. My main concern is about the complexity of keeping these things in sync and that that could be causing us to miss a real concern. FWIW I'm adding another comment to a potential syncing issue in create_arrays(), but I have a similar low concern level due to how this is likely being used.

nrows=0,
config=None,
*,
colnumbers=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this count as API change? If so, should mention this new keyword in the change log too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It also made me realize that the docstring is slightly wrong. Thanks !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done !

@pllim
Copy link
Member

pllim commented Apr 18, 2024

Also need a rebase to pick up RTD fix. Thanks!

@neutrinoceros neutrinoceros force-pushed the io/votable/bug/only_parse_requested_columns branch from 84a5c05 to 8865d51 Compare April 18, 2024 14:38
@neutrinoceros
Copy link
Contributor Author

rebased !

fields = self.fields
for i, field in enumerate(self.all_fields):
if i in colnumbers and field not in self.fields:
self.fields.append(field)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is called repeatedly with different colnumbers, self.fields could end up out of order and with fields not in colnumbers. Also, for future protection if we lock down public access to self.fields, the append should probably be on self._fields (e.g., if later we do something like replace .append() with .add_field()).

If we do anything to fix this now, it should probably be to totally rebuild self.fields from colnumbers, but I'm uncomfortable with these side effects. Ideally creating arrays shouldn't modify any attributes of the table but also ideally the encapsulation (public vs. private) aspects of constructing a table could be cleaned up quite a bit, so I'm hesitant to do anything radical in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Reconstructing self._fields entirely seems safer and isn't really more complicated. How about this ?

diff --git a/astropy/io/votable/tree.py b/astropy/io/votable/tree.py
index 08dade28db..5df03cdf0e 100644
--- a/astropy/io/votable/tree.py
+++ b/astropy/io/votable/tree.py
@@ -2607,9 +2607,12 @@ class TableElement(
         if colnumbers is None:
             colnumbers = list(range(len(self.all_fields)))
 
-        for i, field in enumerate(self.all_fields):
-            if i in colnumbers and field not in self.fields:
-                self.fields.append(field)
+        selected_fields = HomogeneousList(
+            Field,
+            values=[f for i, f in enumerate(self.all_fields) if i in colnumbers],
+        )
+        if self._fields != selected_fields:
+            self._fields = selected_fields
 
         fields = self.all_fields

We could also add a warning if reconstruction is triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That implementation looks reasonable to me. It's interesting (and fine) that references to all_fields ensure that all_fields is caught up with anything in fields just prior to syncing in the other direction. (I'm not a huge fan of properties since they make such side-effects non-obvious, but that's my own hangup.)

I can see the motivation for a warning in that it would be an usual and possibly unexpected circumstance, but I also wonder if we could come up with succinct wording that a user could understand and act on. Unless the reconstruction happens due to a logic error we haven't noticed, the user would be doing some fairly unusual things to trigger this (as we'll see if you add a test case for this), so I think a warning is probably not worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed it as is. Thank you so much for your thoughtful review !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing the work and being so responsive to questions and comments! Could you go ahead and create an issue for the proposed follow-on deprecation patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting late, but this will be top of my task list for tomorrow !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I opened #16314 for you. Feel free to add details there tomorrow as you see fit. Thanks!

Copy link
Contributor

@tomdonaldson tomdonaldson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the last flurry of activity, I think this is ready again. Thanks everyone for being so thorough!

@pllim pllim added the API change PRs and issues that change an existing API, possibly requiring a deprecation period label Apr 18, 2024
@pllim pllim merged commit 4ab38bc into astropy:main Apr 18, 2024
32 checks passed
@pllim
Copy link
Member

pllim commented Apr 18, 2024

Thanks, all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change PRs and issues that change an existing API, possibly requiring a deprecation period io.votable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

columns parameter not working as expected in parse_single_table/parse
6 participants