Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The column path is always override by the last modification #98

Open
csimplestring opened this issue Jan 20, 2023 · 0 comments
Open

The column path is always override by the last modification #98

csimplestring opened this issue Jan 20, 2023 · 0 comments

Comments

@csimplestring
Copy link

csimplestring commented Jan 20, 2023

Describe the bug
I have a schema

                  message test {
			optional group a {
				optional group foo (MAP) {
					repeated group key_value {
						required binary key (STRING);
						optional binary value (STRING);
					}
				}
			}
		}

The problem is when I write the data into file, no error, seems ok. But. when I use. the 'parquet-tools' to cat the parquet file, it gives error:

java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1
	at org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:272)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:246)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:195)
	at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
	at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1

Unit test to reproduce
Described as above.

I guess the root cause is: in schema.go

func recursiveFix(col *Column, colPath ColumnPath, maxR, maxD uint16, alloc *allocTracker) {
	.......
	col.maxR = maxR
	col.maxD = maxD
        // at line 684, the append function internally always update the underlying array 
	col.path = append(colPath, col.name)
	if col.data != nil {
		col.data.reset(col.rep, col.maxR, col.maxD)
		return
	}

	for i := range col.children {
                 // so no matter how many children are, the colPath is alway the last child's path due to the bug in line 684
		recursiveFix(col.children[i], col.path, maxR, maxD, alloc)
	}
}

so the quick fix should be

         // copy the parent path first
	col.path = append([]string(nil), colPath...)
	col.path = append(col.path, col.name)

parquet-go specific details

  • What version are you using?
    0.12.0
  • Can this be reproduced in earlier versions?
    not sure.

Misc Details

  • Are you using AWS Athena, Google BigQuery, presto... ? No, just normal parquet file.
  • Any other relevant details... how big are the files / rowgroups you're trying to read/write? A very small file.
  • Does this behavior exist in other implementations? (link to spec/implementation please)
  • Do you have memory stats to share?
  • Can you provide a stacktrace?
  • Can you upload a test file?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant