read_table fails with MultiIndex input and delim_whitespace=True #6893

mcwitt · 2014-04-16T21:13:08Z

Related #6889

Example:

In [4]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [5]: pd.read_table(StringIO(text), delim_whitespace=True)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9

This (partially) works if delim_whitespace=True is replaced with sep='\s+', engine='python' (although columns A-D are lost):

In [6]: pd.read_table(StringIO(text), sep='\s+', engine='python')
Out[6]: 
                           E
one two three   four        
a   b   10.0032 5     0.3640
    q   20.0000 4     0.1744
x   q   30.0000 3     2.5838

[3 rows x 1 columns]

The text was updated successfully, but these errors were encountered:

jreback · 2014-04-28T00:43:58Z

@mcwitt can you address for 0.14?

mcwitt · 2014-04-28T15:54:31Z

Yep, I've made some progress on this. Should have a PR soon...

mcwitt · 2014-04-29T23:35:17Z

OK, I've fixed the bug in PythonParser. The C parser seems to have its own code for this, and I'm not sure yet whether it had been designed to support specifying a MultiIndex in this way.

jreback · 2014-04-29T23:46:58Z

hmm. its only the header detection code which is pretty straightforward (e.g. you have to read the header line-by-line based on the spec header=[0,1] whatever, one odd thing is you may or may not have an extra row for the row names depending on how it was written), then turning that to a MI is all the same for both parsers (and is python code)

mcwitt · 2014-04-30T06:00:41Z

hmm. its only the header detection code

I think this is a separate issue from header detection. The original title of this issue was wrong, this doesn't have to do with MI columns. This has to do with automatically setting the index columns to the 2nd row when the sum of the 1st and 2nd rows equals the 3rd, i.e.

In [8]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [9]: pd.read_table(StringIO(text), sep='\s+', engine='python')
Out[9]: 
                           A       B       C        D       E
one two three   four                                         
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
    q   20.0000 4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30.0000 3    -0.6662 -0.5243 -0.3580  0.89145  2.5838

[3 rows x 5 columns]

The code for dealing with this in the python parser is in _get_index_name, but the C parser doesn't seem to have anything similar yet.

mcwitt · 2014-05-03T20:15:00Z

Made a PR for the fix for the python parser. The fix for the C parser will be a bit more involved so I think it would be good to do this in a separate PR. Essentially I think we need to duplicate the functionality in PythonParser._get_index_name in CParserWrapper initially, and then merge as much of the code as possible into a common function for both parsers.

jreback · 2014-05-05T00:04:14Z

can you detect the condition in c-parser and raise?

jreback · 2014-05-05T12:09:48Z

moving this to 0.14.1 (though we'll merge your pythonparse fix after release notes).

and if you can detect the condition in the c-parser would be great (and just raise for now)

jreback · 2014-06-05T15:15:17Z

@mcwitt any thoughts on fixing the c-parser for this?

mcwitt · 2014-06-05T17:23:49Z

I'm still thinking about this. I think I will try to implement the fix described above, but I will be too busy to look at it for the next two weeks (traveling to a conference). I should be able to get it done for 0.14.1 though.

jreback · 2014-06-05T17:28:19Z

gr8!

mcwitt · 2014-06-26T00:47:31Z

Update: I'm close to a fix for this! Hopefully can make a PR tomorrow.

jreback · 2014-07-02T14:19:03Z

Reverted #7591, moving to 0.15 as for the failing tests: #7623

jreback · 2014-07-02T14:20:02Z

thanks.....easier to revert this PR, can address in 0.15.0 when you have time!

rmsilva1973 · 2020-04-13T01:28:21Z

Hey @jreback . Just received this bug for triage. It's still a reproducible issue as of '1.1.0.dev0+1247.g9c317322f'

jreback · 2020-04-13T01:48:33Z

and it’s an open issue if you want to submit a PR

rmsilva1973 · 2020-04-24T10:29:35Z

So, I've been the last 4 days trying to figure out how this tokenizer function works. I've initially narrowed down the test case to a simples one:

import pandas as pd
from io import StringIO
text = """ A
one
c -13"""
pd.read_table(StringIO(text),delim_whitespace=True)

@jreback it doesn't seem related to MultiIndex specifically. Any ammount of indexes included on the data (is it called "implicit indexes"?) hits the bug. Do you think it's reasonable to change the bug title to something more appropriate?

mroeschke · 2024-05-10T17:49:23Z

delim_whitespace is deprecated and will be removed in pandas 3.0 so closing as wont fix

jreback added this to the 0.14.0 milestone Apr 16, 2014

jreback added Bug labels Apr 21, 2014

mcwitt changed the title ~~read_table fails with multi-index columns and delim_whitespace=True~~ read_table fails with implicit MultiIndex and delim_whitespace=True Apr 29, 2014

mcwitt changed the title ~~read_table fails with implicit MultiIndex and delim_whitespace=True~~ read_table fails with MultiIndex input and delim_whitespace=True Apr 29, 2014

mcwitt mentioned this issue May 3, 2014

BUG: fix reading multi-index data in python parser #7029

Merged

jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014

jreback modified the milestones: 0.14.0, 0.14.1 May 6, 2014

mcwitt mentioned this issue Jun 27, 2014

ENH: read_{csv,table} look for index columns in row after header with C engine #7591

Merged

jreback closed this as completed in #7591 Jun 30, 2014

jreback reopened this Jul 2, 2014

jreback added this to the 0.15.0 milestone Jul 2, 2014

jreback removed this from the 0.14.1 milestone Jul 2, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

mroeschke closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_table fails with MultiIndex input and delim_whitespace=True #6893

read_table fails with MultiIndex input and delim_whitespace=True #6893

mcwitt commented Apr 16, 2014

jreback commented Apr 28, 2014

mcwitt commented Apr 28, 2014

mcwitt commented Apr 29, 2014

jreback commented Apr 29, 2014

mcwitt commented Apr 30, 2014

mcwitt commented May 3, 2014

jreback commented May 5, 2014

jreback commented May 5, 2014

jreback commented Jun 5, 2014

mcwitt commented Jun 5, 2014

jreback commented Jun 5, 2014

mcwitt commented Jun 26, 2014

jreback commented Jul 2, 2014

jreback commented Jul 2, 2014

rmsilva1973 commented Apr 13, 2020

jreback commented Apr 13, 2020

rmsilva1973 commented Apr 24, 2020 •

edited

mroeschke commented May 10, 2024

read_table fails with MultiIndex input and delim_whitespace=True #6893

read_table fails with MultiIndex input and delim_whitespace=True #6893

Comments

mcwitt commented Apr 16, 2014

jreback commented Apr 28, 2014

mcwitt commented Apr 28, 2014

mcwitt commented Apr 29, 2014

jreback commented Apr 29, 2014

mcwitt commented Apr 30, 2014

mcwitt commented May 3, 2014

jreback commented May 5, 2014

jreback commented May 5, 2014

jreback commented Jun 5, 2014

mcwitt commented Jun 5, 2014

jreback commented Jun 5, 2014

mcwitt commented Jun 26, 2014

jreback commented Jul 2, 2014

jreback commented Jul 2, 2014

rmsilva1973 commented Apr 13, 2020

jreback commented Apr 13, 2020

rmsilva1973 commented Apr 24, 2020 • edited

mroeschke commented May 10, 2024

rmsilva1973 commented Apr 24, 2020 •

edited