Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: genfromtxt now ignores names=('column1','column2') when unpack=True #20855

Open
seberg opened this issue Jan 19, 2022 · 6 comments
Open
Labels
06 - Regression 57 - Close? Issues which may be closable unless discussion continued

Comments

@seberg
Copy link
Member

seberg commented Jan 19, 2022

This change means that genfromtxt now ignores names=('column1','column2') when unpack=True is passed. Instead of giving a named/rec array as before, a list of arrays with no names is given. Is this the correct behavior? It means that passing the names explicitly into genfromtxt does nothing and the output is not an array of arrays/columns, which was very useful. Lists can't be indexed the same way. Can I be guaranteed that the returned list is in the right order so I can manually assign the names? Should I open an issue about this?

The example I am thinking of is, say we have a simple csv or space delimited document with 4 columns:

val1 val2 valignore1 valignore2
val3 val4 valignore3 valignore4
val5 val6 valignore5 valignore6

I read this in using np.genfromtxt(data.dat, unpack=True, names=('column1', 'column2'), usecols=(0,1)), the result is a list and the names are ignored. Previously, the names would be attached and the result would be a two column array that can be indexed with the given names. The list obviously can't be.

Now we get [array(val1, val3, val5), array(val1, val3, val5)], previously it would be an array of arrays with names (used to be called recarray I think but maybe that name has changed).

Originally posted by @emirkmo in #16650 (comment)

@seberg seberg added this to the 1.22.2 release milestone Jan 19, 2022
@emirkmo
Copy link

emirkmo commented Jan 19, 2022

Background and identification

To be clear, setting unpack=False gives the correct output as a structured array, it's just opposite from what it did before when unpack is True.

Looking at #16650 and #4341 this seems intended to copy loadtxt. The reason it's a bit unfortunate is that when dynamically creating and reading files, one would hope that the arrays keep their dtype name field.

What I would expect is that the list of arrays should have the dtype include the name, so that they can be indexed by name (at a minimum).
The change that leads to this behavior is from:

https://github.com/AndrewEckart/numpy/blob/6b81583634c29ed9dd93d2bd654ae27862a68187/numpy/lib/npyio.py,

starting at line 2252:

    if unpack:
        if names is None:
            return output.T
        elif len(names) == 1:
            # squeeze single-name dtypes too
            return output[names[0]]
        else:
            # For structured arrays with multiple fields,
            # return an array for each field.
            return [output[field] for field in names]
    return output

Previously the last line would return a structured array instead of a list of arrays without names set. However, setting unpack to false does correct this and it does match np.loadtxt. It's just that, on top of being breaking and doing the opposite of previous behavior (as noted in the PR), it doesn't preserve the names.

Proposed solution

However, it is possible to return 1D arrays whose dtypes are named and so preserve the names (pseudocode follows)

[output[field].view(dtype=<correct_dtype_with_name>) for field in names]

where <correct_dtype_with_name> needs to include the given field name and otherwise preserve the dtype. Probably, there's a better way of doing this actually, since we already have this info in output (it's a structured array), we simply want to extract a 1D column without losing the info of the name.

Comments

Now I don't quite understand why it's returned as a list instead of a structured array, as I can't imagine the benefits of a scenario where numpy returns lists instead of efficient arrays. As far as I know, whether the output is a list of arrays or an array of arrays (2+D array), the resulting uses would behave the same. Parity with loadfromtxt doesn't seem to be a good enough reason to break the usual numpy paradigm of operations generally returning numpy arrays unless you're getting a single scalar output. However, even if parity is to be kept, I would expect the names to stick around so that after the unpacking, , the array can be repacked using the names.

It's possible to fix the bug with dtype=None while also still returning a structured array, is it not? Perhaps I am wrong and the way I am thinking about it in terms of rows and columns is the problem, as there is a transpose step.

@seberg
Copy link
Member Author

seberg commented Jan 20, 2022

Hmm, looking closer I doubt we should reconsider this change. We can't preserve names on dtypes with unpacking (We can, but then the array returned becomes annoying to work with, since it is not a "numeric" array, but still structured one – and it would be a change in behaviour).
dtypes simply don't have names unless they are "structured dtypes" with multiple columns ("multiple" can be a single one, but such a dtype does not work the same). What would make sense to me if we returned a dict or similar. But that is a too big change and would require some new API like unpack=dict or something :).

unpack=True has a clear meaning that it returns a list of arrays that can be directly unpacked into values, such as:

col1, col2, col2 = np.genfromtxt(filename)

and you can work with col1 without first indexing with the name. The branch that changed broke that contract, while the default unpack=False always did what you want.

@seberg
Copy link
Member Author

seberg commented Jan 20, 2022

That said, just because right now I think this is a valid bug fix, that doesn't mean that we can't pull it and/or do a transition via FutureWarning if this is causing problems downstream.

@emirkmo
Copy link

emirkmo commented Jan 27, 2022

The only counter I have is that the behavior of unpack=True would not change UNLESS you explicitly gave names (or maybe they were detected in the file), which you expect to be associated with a structured dtype. And if you are taking the effort of explicitly giving names, you want those names to be preserved. To be honest, I don't understand why the output arrays cannot be structured ones if names have been given, as that allows easily combining them into a single structured array in the future with different names (multiple structured dtypes). I haven't found major differences working with them over plain numpy arrays (which don't have to be numeric anyway), but I am sure they exist.

The behavior sort of mimics pandas Series vs. DataFrame, and it would be odd if Pandas data frame columns suddenly became lists when unpacking it instead of Series, just like it's odd that a structured array loses its structured dtype name when you unpack it. It would make more sense to ensure that structured arrays behave properly as 1D arrays instead of avoiding them when the user explicitly tries to create them. Is there a good reason to avoid them that we could start some issues on and start fixing? (You said they are annoying to work with).

Finally, the previous fix to genfromtxt was already quite breaking compared to previous behavior, and wasn't complained about, so maybe that's not such a huge worry. (Seems like it's not majorly relied upon in either way).

I do agree that a change like making it a dict does make sense, but would be very disruptive, so would need futurewarnings and more discussion. I'm not in favor of it to be honest. I just want the structured dtype names to not be lost whether I am reading writing from files or passing into numpy functions. Currently, this is the one place where the name is just shed without even a warning. Perhaps, at the very least, we can raise a warning when genfromtxt is used with unpack=True and names is not None, saying that the resulting output will not be a structured array.

@seberg
Copy link
Member Author

seberg commented Feb 1, 2022

Marking as close and will do so soon unless someone else chimes in, I think the situation is the following:

  • The old behaviour ignored the unpack=True request by the user.
  • unpack=True might be slightly awkward in the situation (as header is auto-detected), but I am unconvinced that it is helpful to ignore it sometimes. And there is no alternative.

I agree that this change should maybe have went through a FutureWarning, but at this point I am not sure anyone would be better off with such a transition? Anyone who runs into the issue is expected to run into some kind of error that is fixable by unpack=True.

If you/anyone would like to argue for reverting the change, we can do that however.

@seberg seberg added the 57 - Close? Issues which may be closable unless discussion continued label Feb 1, 2022
@seberg seberg removed this from the 1.22.3 release milestone Feb 1, 2022
@emirkmo
Copy link

emirkmo commented Feb 3, 2022

Could we create an issue to track adding a warning to genfromtxt reading when names is not None but unpack=True? (About header names being ignored). Or can I create an issue to ask for the arrays inside the list to be structured 1D ones that respect the names (dtype with a name)? Either is preferable to the silent dropping.

If either is not wanted, however, I am fine with closing. Do not want a revert as that is more broken.

Finally, At the least, a documentation update is warranted about the silent dropping, if nothing else is to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
06 - Regression 57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

No branches or pull requests

3 participants