Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_table fails with MultiIndex input and delim_whitespace=True #6893

Closed
mcwitt opened this issue Apr 16, 2014 · 18 comments · Fixed by #7591
Closed

read_table fails with MultiIndex input and delim_whitespace=True #6893

mcwitt opened this issue Apr 16, 2014 · 18 comments · Fixed by #7591
Labels

Comments

@mcwitt
Copy link
Contributor

mcwitt commented Apr 16, 2014

Related #6889

Example:

In [4]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [5]: pd.read_table(StringIO(text), delim_whitespace=True)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9

This (partially) works if delim_whitespace=True is replaced with sep='\s+', engine='python' (although columns A-D are lost):

In [6]: pd.read_table(StringIO(text), sep='\s+', engine='python')
Out[6]: 
                           E
one two three   four        
a   b   10.0032 5     0.3640
    q   20.0000 4     0.1744
x   q   30.0000 3     2.5838

[3 rows x 1 columns]
@jreback jreback added this to the 0.14.0 milestone Apr 16, 2014
@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

@mcwitt can you address for 0.14?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 28, 2014

Yep, I've made some progress on this. Should have a PR soon...

@mcwitt mcwitt changed the title read_table fails with multi-index columns and delim_whitespace=True read_table fails with implicit MultiIndex and delim_whitespace=True Apr 29, 2014
@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 29, 2014

OK, I've fixed the bug in PythonParser. The C parser seems to have its own code for this, and I'm not sure yet whether it had been designed to support specifying a MultiIndex in this way.

@mcwitt mcwitt changed the title read_table fails with implicit MultiIndex and delim_whitespace=True read_table fails with MultiIndex input and delim_whitespace=True Apr 29, 2014
@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

hmm. its only the header detection code which is pretty straightforward (e.g. you have to read the header line-by-line based on the spec header=[0,1] whatever, one odd thing is you may or may not have an extra row for the row names depending on how it was written), then turning that to a MI is all the same for both parsers (and is python code)

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 30, 2014

hmm. its only the header detection code

I think this is a separate issue from header detection. The original title of this issue was wrong, this doesn't have to do with MI columns. This has to do with automatically setting the index columns to the 2nd row when the sum of the 1st and 2nd rows equals the 3rd, i.e.

In [8]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [9]: pd.read_table(StringIO(text), sep='\s+', engine='python')
Out[9]: 
                           A       B       C        D       E
one two three   four                                         
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
    q   20.0000 4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30.0000 3    -0.6662 -0.5243 -0.3580  0.89145  2.5838

[3 rows x 5 columns]

The code for dealing with this in the python parser is in _get_index_name, but the C parser doesn't seem to have anything similar yet.

@mcwitt
Copy link
Contributor Author

mcwitt commented May 3, 2014

Made a PR for the fix for the python parser. The fix for the C parser will be a bit more involved so I think it would be good to do this in a separate PR. Essentially I think we need to duplicate the functionality in PythonParser._get_index_name in CParserWrapper initially, and then merge as much of the code as possible into a common function for both parsers.

@jreback
Copy link
Contributor

jreback commented May 5, 2014

can you detect the condition in c-parser and raise?

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014
@jreback
Copy link
Contributor

jreback commented May 5, 2014

moving this to 0.14.1 (though we'll merge your pythonparse fix after release notes).

and if you can detect the condition in the c-parser would be great (and just raise for now)

@jreback jreback modified the milestones: 0.14.0, 0.14.1 May 6, 2014
@jreback
Copy link
Contributor

jreback commented Jun 5, 2014

@mcwitt any thoughts on fixing the c-parser for this?

@mcwitt
Copy link
Contributor Author

mcwitt commented Jun 5, 2014

I'm still thinking about this. I think I will try to implement the fix described above, but I will be too busy to look at it for the next two weeks (traveling to a conference). I should be able to get it done for 0.14.1 though.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2014

gr8!

@mcwitt
Copy link
Contributor Author

mcwitt commented Jun 26, 2014

Update: I'm close to a fix for this! Hopefully can make a PR tomorrow.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2014

Reverted #7591, moving to 0.15 as for the failing tests: #7623

@jreback jreback reopened this Jul 2, 2014
@jreback jreback added this to the 0.15.0 milestone Jul 2, 2014
@jreback jreback removed this from the 0.14.1 milestone Jul 2, 2014
@jreback
Copy link
Contributor

jreback commented Jul 2, 2014

thanks.....easier to revert this PR, can address in 0.15.0 when you have time!

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@rmsilva1973
Copy link

Hey @jreback . Just received this bug for triage. It's still a reproducible issue as of '1.1.0.dev0+1247.g9c317322f'

@jreback
Copy link
Contributor

jreback commented Apr 13, 2020

and it’s an open issue if you want to submit a PR

@rmsilva1973
Copy link

rmsilva1973 commented Apr 24, 2020

So, I've been the last 4 days trying to figure out how this tokenizer function works. I've initially narrowed down the test case to a simples one:

import pandas as pd
from io import StringIO
text = """ A
one
c -13"""
pd.read_table(StringIO(text),delim_whitespace=True)

@jreback it doesn't seem related to MultiIndex specifically. Any ammount of indexes included on the data (is it called "implicit indexes"?) hits the bug. Do you think it's reasonable to change the bug title to something more appropriate?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

delim_whitespace is deprecated and will be removed in pandas 3.0 so closing as wont fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants