-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/BUG: only output user-requested columns in votable.parse #15959
API/BUG: only output user-requested columns in votable.parse #15959
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
👋 Thank you for your draft pull request! Do you know that you can use |
7b5722c
to
b1b0ee1
Compare
b1b0ee1
to
b1e192c
Compare
I'm not convinced that failures are related. In anywise I don't reproduce them locally. Might have something to do with pytest 8.0.0... |
b1e192c
to
370c2e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not doing a thorough review, I'm just commenting on a few things that stood out to me.
astropy/io/votable/tree.py
Outdated
# Deal with a subset of the columns, if requested. | ||
if not columns: | ||
colnumbers = list(range(len(fields))) | ||
colnumbers = list(range(len(all_fields))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the conversion from range
to list
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because range()
produces a 1-time consumable iterator, but this variable is definitely used more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no limit to how many times a range
instance can be iterated over:
>>> one_to_three = range(1, 4)
>>> for i in one_to_three:
... for j in one_to_three:
... print(i, j)
...
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't know that. Was it always true ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know this was true even for xrange
in Python 2 (but I can't claim to know Python 2 very well).
I am willing to guess that quite a few people might believe a range
to be single-use because they know that range
does lazy evaluation and from that they conclude (mistakenly) that range
must be a generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would make sense, I probably fell in that trap too.
Only output user-requested columns in ``votable.parse``. | ||
Previously, unselected columns would just be masked (and unallocated). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended way to read VOTables is through the unified I/O interface, so I'd expect the change log entry to describe how that is affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrased
@eerovaher This is far from redu for review (I actually left it in the middle of a refactor). I'll answer your questions tomorrow when I'm back at my desk ! |
63fd1ee
to
cab61cf
Compare
It took me a while to complete, but I'm almost done with refactoring my initial wonky patch. After I'm sure that the refactor still survives CI, I'll squash my history. Next thing on the list is to make sure that |
19cddb9
to
5a26dbd
Compare
Okay I think this should now be ready for review. |
5a26dbd
to
3bbb6d2
Compare
@eerovaher, do you have any other suggestions or requests ? also pinging @pllim as a package expert ! |
This would affect PyVO world the most, so I am asking a review from Tom D and maybe ping @lmichel too as a stakeholder. |
The MIVOT annotation readout in Astropy does not care about the table columns at all. It is a simple extension to the resource parsing that extracts a piece of XML. |
# To remedy this issue, we sync back the public property upon access. | ||
for field in self._fields: | ||
if field not in self._all_fields: | ||
self._all_fields.append(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more thought... Does the order matter, @tomdonaldson ? Just appending might not preserve original order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, order matters.
I'll admit that it's not evident that this bit works properly because it relies on other parts of the module to be correct too. The intention is that all_fields
follows the order of on-disk fields and then the insertion order of manually added fields. In principle, self.fields
can only contain a similarly ordered subset of self.all_fields
, but a bug in how self.fields
is constructed would cascade into an incorrect ordering in self.all_fields
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am starting to ponder if we should just bite the bullet and fix this properly as Tom pointed out in #15959 (comment) ... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we agree to explore it in a follow up PR so this patch doesn't get too bloated with deprecation logic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that the non-deterministic order would come back and bite us as a even more subtle bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if @tomdonaldson can tell me that my worry is unfounded, then I am okay with merging this and have the other idea as follow-up, though if we do that anyway, then this change would be mostly reverted, so is it worth merging and then reverting? 🤪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is deterministic ? To be clear, I'm proposing to get the follow up deprecation patch going as soon as this one gets merged, not in an undetermined future, so there won't be much opportunities for this to break or even to stay brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't see your second message
though if we do that anyway, then this change would be mostly reverted, so is it worth merging and then reverting? 🤪
It can't be reverted until table.fields.append/extend
is disallowed, which requires a deprecation period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking at this and finding it non-trivial to decide how concerned to be. It's pretty clear that someone just parsing VOTables, even with a subset of columns selected via this feature, should not have a problem with ordering. This is the primary (and only) use case for almost all users.
The potential problems come for users building tables with the API, where they may have figured out a variety of ways to get things to work and we don't want to break those. It's in these scenarios that trying to keep all_fields
and fields
in sync. Ironically it's in those cases that they are extremely unlikely to be selecting a subset of columns somehow, but that doesn't protect us from getting these columns out of sync.
That said, in that latter use case it's hard to imagine how one would get both out of sync and out of order. When constructing a table, one would create the list of fields, then add the data, not bounce around creating some data, then inserting fields, etc.
So I'm not really concerned about the ordering. My main concern is about the complexity of keeping these things in sync and that that could be causing us to miss a real concern. FWIW I'm adding another comment to a potential syncing issue in create_arrays()
, but I have a similar low concern level due to how this is likely being used.
nrows=0, | ||
config=None, | ||
*, | ||
colnumbers=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this count as API change? If so, should mention this new keyword in the change log too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. It also made me realize that the docstring is slightly wrong. Thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done !
Also need a rebase to pick up RTD fix. Thanks! |
…ific columns (unsupported case)
…ods (use the internal state instead)
Co-authored-by: P. L. Lim <2090236+pllim@users.noreply.github.com>
84a5c05
to
8865d51
Compare
rebased ! |
astropy/io/votable/tree.py
Outdated
fields = self.fields | ||
for i, field in enumerate(self.all_fields): | ||
if i in colnumbers and field not in self.fields: | ||
self.fields.append(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this method is called repeatedly with different colnumbers
, self.fields
could end up out of order and with fields not in colnumbers
. Also, for future protection if we lock down public access to self.fields
, the append should probably be on self._fields
(e.g., if later we do something like replace .append()
with .add_field()
).
If we do anything to fix this now, it should probably be to totally rebuild self.fields
from colnumbers
, but I'm uncomfortable with these side effects. Ideally creating arrays shouldn't modify any attributes of the table but also ideally the encapsulation (public vs. private) aspects of constructing a table could be cleaned up quite a bit, so I'm hesitant to do anything radical in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Reconstructing self._fields
entirely seems safer and isn't really more complicated. How about this ?
diff --git a/astropy/io/votable/tree.py b/astropy/io/votable/tree.py
index 08dade28db..5df03cdf0e 100644
--- a/astropy/io/votable/tree.py
+++ b/astropy/io/votable/tree.py
@@ -2607,9 +2607,12 @@ class TableElement(
if colnumbers is None:
colnumbers = list(range(len(self.all_fields)))
- for i, field in enumerate(self.all_fields):
- if i in colnumbers and field not in self.fields:
- self.fields.append(field)
+ selected_fields = HomogeneousList(
+ Field,
+ values=[f for i, f in enumerate(self.all_fields) if i in colnumbers],
+ )
+ if self._fields != selected_fields:
+ self._fields = selected_fields
fields = self.all_fields
We could also add a warning if reconstruction is triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That implementation looks reasonable to me. It's interesting (and fine) that references to all_fields
ensure that all_fields
is caught up with anything in fields
just prior to syncing in the other direction. (I'm not a huge fan of properties since they make such side-effects non-obvious, but that's my own hangup.)
I can see the motivation for a warning in that it would be an usual and possibly unexpected circumstance, but I also wonder if we could come up with succinct wording that a user could understand and act on. Unless the reconstruction happens due to a logic error we haven't noticed, the user would be doing some fairly unusual things to trigger this (as we'll see if you add a test case for this), so I think a warning is probably not worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed it as is. Thank you so much for your thoughtful review !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing the work and being so responsive to questions and comments! Could you go ahead and create an issue for the proposed follow-on deprecation patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's getting late, but this will be top of my task list for tomorrow !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I opened #16314 for you. Feel free to add details there tomorrow as you see fit. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the last flurry of activity, I think this is ready again. Thanks everyone for being so thorough!
Thanks, all! |
Description
Fixes #14943
I took me a couple hours to pass my test and all other tests locally, so I'm opening the PR while iron is hot, but I expect I'll probably want to refactor a bit before I open to review.
Since I needed to adjust existing tests which were actively checking for previous behaviour, I'm marking this as an API change, but in my opinion it's borderline a bug fix. In other words, I'm aiming astropy 6.1.0, not 6.0.1.