Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement nsplit argument in Array.split_axes with default="auto" (raise if variable) #1078

Open
gdementen opened this issue Sep 15, 2023 · 5 comments

Comments

@gdementen
Copy link
Contributor

gdementen commented Sep 15, 2023

When the number of separators is not the same in all labels, labels which have more separators have their last part dropped silently and any label which becomes a duplicate because of this is dropped silently too (it seems like the last duplicate wins).

>>> arr = la.ndtest("a_b=a0_b0,a0_b1_1,a0_b1_2")
>>> arr
a_b  a0_b0  a0_b1_1  a0_b1_2
         0        1        2
>>> arr.split_axes()
a\b   b0   b1
 a0  0.0  2.0

Also, when there is a label with fewer separators than expected, it raises a very weird error.

>>> arr = la.ndtest("a_b=a0_b0,a0b1,a0_b2")
>>> arr
a_b  a0_b0  a0b1  a0_b2
         0     1      2
>>> arr.split_axes()
ValueError: Value {a_b} axis is not present in target subset {a*}. A value can only have the same axes or fewer axes than the subset being targeted

The two issues are because the zip(*sequences) idiom that we use happily ignores sequences longer than the shortest of the sequences.

The code of Axis.split is roughly like this:

>>> split_labels = np.char.split(arr.a_b.labels, '_')
>>> indexing_labels = zip(*split_labels)
>>> list(indexing_labels)
[('a0', 'a0', 'a0'), ('b0', 'b1', 'b1')]
>>> split_axes = [Axis(unique_list(ax_labels), name) for ax_labels, name in zip(indexing_labels, names)]

Our options are :

  1. raise an error if fewer or too many separators detected. Safest (zero chance to have bad labels pass unnoticed) but not practical (it leaves the problem to the user)
  2. use maxsplit=len(names) - 1 in np.char.split. This would solve the extra separators issue but not the fewer separators issues. Also it could mask a problem if there are extra seps in some labels that the user does not know about. I can live with the later problem though.
  3. or configurable maxsplit, defaulting to the above. More flexible, but same problems as option 1.
  4. or force splitting using the maximum number of separators (i.e. complete the sequences with as many '' as necessary). In the presence of too many labels, it will probably not be the result the user wants in most cases.
  5. force splitting using len(names) - 1 (i.e use maxsplit but then complete the too short sequences)
  6. configurable force splitting (e.g. nsplit argument), using len(names) - 1 as default.
  7. configurable force splitting , but with a default which raises if not all sequence have the same length. I hesitate with this option. The selected option
@gdementen gdementen added bug question difficulty: low priority: high priority: BLOCKER cannot make a release without these issues resolved labels Sep 15, 2023
@gdementen gdementen added this to the 0.35 milestone Sep 15, 2023
@gdementen
Copy link
Contributor Author

@alixdamman : what do you think?

@alixdamman
Copy link
Collaborator

  1. raise an error if fewer or too many separators detected. Safest (zero chance to have bad labels pass unnoticed) but not practical (it leaves the problem to the user)
  2. use maxsplit=len(names) - 1 in np.char.split. This would solve the extra separators issue but not the fewer separators issues. Also it could mask a problem if there are extra seps in some labels that the user does not know about. I can live with the later problem though.
  3. or configurable maxsplit, defaulting to the above. More flexible, but same problems as option 1.
  4. or force splitting using the maximum number of separators (i.e. complete the sequences with as many '' as necessary). In the presence of too many labels, it will probably not be the result the user wants in most cases.
  5. force splitting using len(names) - 1 (i.e use maxsplit but then complete the too short sequences)
  6. configurable force splitting (e.g. nsplit argument), using len(names) - 1 as default. My preferred solution at this point.
  7. configurable force splitting , but with a default which raises if not all sequence have the same length. I hesitate with this option.

Honestly, without an example of the resulting axes for each option, it is difficult to me to evaluate what would be the best solution. I am usually in favor of somewhat "forcing" the users to clean their code (i.e. check the labels, dimensions of arrays, ...), so, by default, my answer is option 1. But if you think that there are potentially many cases in which option 1 could not be practical, I am OK to choose the option 6.

@gdementen
Copy link
Contributor Author

In that case, I'd go for option 7, which is basically option 1 with an easy way to deal with the exception when it happens. I will try to implement that.

@gdementen gdementen modified the milestones: 0.35, 0.34.2 Oct 4, 2023
@gdementen gdementen changed the title Array.split_axes silently drops parts of labels implement nsplit argument in Array.split_axes with default="auto" (raise if variable) Oct 5, 2023
@gdementen gdementen added priority: high and removed difficulty: low priority: BLOCKER cannot make a release without these issues resolved labels Oct 18, 2023
@gdementen gdementen removed this from the 0.34.2 milestone Oct 18, 2023
@gdementen
Copy link
Contributor Author

Implementing nsplit is not that urgent. Avoiding the silent failure is the urgent part so I extracted #1089 out of this and scaled down the priority.

@gdementen
Copy link
Contributor Author

Besides, to be complete, we should implement a way to split right (rsplit boolean argument)?

@gdementen gdementen added enhancement and removed bug labels Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants