Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error to read parquet with latest parquet-go #61

Open
tanyaofei opened this issue Mar 30, 2022 · 21 comments
Open

Error to read parquet with latest parquet-go #61

tanyaofei opened this issue Mar 30, 2022 · 21 comments

Comments

@tanyaofei
Copy link

tanyaofei commented Mar 30, 2022

  1. Create a file with python pandas
dataframe = pandas.DataFrame({
        "A": ["a", "b", "c", "d"],
        "B": [2, 3, 4, 1],
        "C": [10, 20, None, None]
    })

dataframe.to_parquet("1.parquet")

This file looks like:
image

  1. Read this file
func main() {
    ctx := context.Background()
    fr, _ := local.NewLocalFileReader("1.parquet")
    df, err := imports.LoadFromParquet(ctx, fr)
    if err != nil {
        panic(err)
    }
    fmt.Println(df)
}
  1. Got a unique name error
panic: names of series must be unique: 

goroutine 1 [running]:
github.com/rocketlaunchr/dataframe-go.NewDataFrame({0xc0001f8000, 0x3, 0xc000149a10?})
        .../rocketlaunchr/dataframe-go@v0.0.0-20211025052708-a1030444159b/dataframe.go:41 +0x33c
github.com/rocketlaunchr/dataframe-go/imports.LoadFromParquet({0x1497868, 0xc000020080}, {0x1498150?, 0xc00000e798?}, {0xc0000021a0?, 0xc000149f70?, 0x1007599?})
        .../go/pkg/mod/github.com/rocketlaunchr/dataframe-go@v0.0.0-20211025052708-a1030444159b/imports/parquet.go:110 +0x8ae
main.main()
        .../main.go:13 +0x78
  1. Following the stack, I found some useful informations
  • All series in method imports.LoadFromParquet with empty names

image

  • goFieldNameToActual
    each keys in this map with prefix "Scheme", but goName didn't, may be it's the reason why can't not find a name from this map

image

image

This's the first time I use golang to read parquet files. It is an error cause by parquet-go breaking changes or something else ?

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

Can you send me the file

@tanyaofei
Copy link
Author

Can you send me the file
1.parquet.zip

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

Can you create the DataFrame from this package, export it to paraquet and then try and import it back?

@tanyaofei
Copy link
Author

Can you create the DataFrame from this package, export it to paraquet and then try and import it back?

I tried it at the first time, it seems like a error parquet file with content "PAR1"

func main() {
    df := dataframe.NewDataFrame(dataframe.NewSeriesString("A", nil, []string{"1", "2", "3"}))
    file, _ := os.Create("1.parquet")
    _ = exports.ExportToParquet(context.Background(), file, df)
}

image

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

A Parquet file is not text based. Can you try importing the file back.

@tanyaofei
Copy link
Author

A Parquet file is not text based. Can you try importing the file back.

    df := dataframe.NewDataFrame(dataframe.NewSeriesString("A", nil, []string{"1", "2", "3"}))
    file, _ := os.Create("1.parquet")
    _ = exports.ExportToParquet(context.Background(), file, df)

    fr, _ := local.NewLocalFileReader("1.parquet")
    df, err := imports.LoadFromParquet(context.Background(), fr)
    if err != nil {
        panic(err)
    }
    fmt.Println(df)
panic: seek 1.parquet: invalid argument

goroutine 1 [running]:
main.main()
        .../main.go:21 +0x465
Exiting.

Error at imports/parquet.go, line 40: pr, err := reader.NewParquetReader(src, nil, int64(runtime.NumCPU()))

@tanyaofei
Copy link
Author

A Parquet file is not text based. Can you try importing the file back.

My parquet-go version is v1.6.2: github.com/xitongsys/parquet-go v1.6.2

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

I tried opening your file and it worked:

package main

import	"github.com/xitongsys/parquet-go-source/local"
import	"github.com/rocketlaunchr/dataframe-go/imports"
import "fmt"
import "context"

var ctx = context.Background()

func main() {
	fr, _ := local.NewLocalFileReader("1.parquet")
	defer fr.Close()

	df, err := imports.LoadFromParquet(ctx, fr)
	if err != nil {
		panic(err)
	}

	fmt.Println(df)
}

OUTPUT:

+-----+--------+-------+---------+
|     |   A    |   B   |    C    |
+-----+--------+-------+---------+
| 0:  |   a    |   2   |   10    |
| 1:  |   b    |   3   |   20    |
| 2:  |   c    |   4   |   NaN   |
| 3:  |   d    |   1   |   NaN   |
+-----+--------+-------+---------+
| 4X3 | STRING | INT64 | FLOAT64 |
+-----+--------+-------+---------+

@tanyaofei
Copy link
Author

I tried opening your file and it worked:

package main

import	"github.com/xitongsys/parquet-go-source/local"
import	"github.com/rocketlaunchr/dataframe-go/imports"
import "fmt"
import "context"

var ctx = context.Background()

func main() {
	fr, _ := local.NewLocalFileReader("1.parquet")
	defer fr.Close()

	df, err := imports.LoadFromParquet(ctx, fr)
	if err != nil {
		panic(err)
	}

	fmt.Println(df)
}

OUTPUT:

+-----+--------+-------+---------+
|     |   A    |   B   |    C    |
+-----+--------+-------+---------+
| 0:  |   a    |   2   |   10    |
| 1:  |   b    |   3   |   20    |
| 2:  |   c    |   4   |   NaN   |
| 3:  |   d    |   1   |   NaN   |
+-----+--------+-------+---------+
| 4X3 | STRING | INT64 | FLOAT64 |
+-----+--------+-------+---------+

Can you tell me your parquet-go version ?

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

module main

go 1.18

require (
	github.com/rocketlaunchr/dataframe-go v0.0.0-00010101000000-000000000000
	github.com/xitongsys/parquet-go-source v0.0.0-20200509081216-8db33acb0acf
)

require (
	github.com/apache/thrift v0.0.0-20181112125854-24918abba929 // indirect
	github.com/goccy/go-json v0.7.6 // indirect
	github.com/golang/snappy v0.0.0-20180518054509-2e65f85255db // indirect
	github.com/google/go-cmp v0.4.0 // indirect
	github.com/guptarohit/asciigraph v0.5.1 // indirect
	github.com/juju/clock v0.0.0-20190205081909-9c5c9712527c // indirect
	github.com/juju/errors v0.0.0-20200330140219-3fe23663418f // indirect
	github.com/juju/loggo v0.0.0-20200526014432-9ce3a2e09b5e // indirect
	github.com/juju/utils/v2 v2.0.0-20200923005554-4646bfea2ef1 // indirect
	github.com/klauspost/compress v1.9.7 // indirect
	github.com/mattn/go-runewidth v0.0.7 // indirect
	github.com/olekukonko/tablewriter v0.0.4 // indirect
	github.com/rocketlaunchr/mysql-go v1.1.3 // indirect
	github.com/xitongsys/parquet-go v1.5.2 // indirect
	golang.org/x/crypto v0.0.0-20200820211705-5c72a883971a // indirect
	golang.org/x/exp v0.0.0-20200331195152-e8c3332aa8e5 // indirect
	golang.org/x/net v0.0.0-20200904194848-62affa334b73 // indirect
	golang.org/x/sync v0.0.0-20200317015054-43a5402ce75a // indirect
	gopkg.in/yaml.v2 v2.3.0 // indirect
)

@tanyaofei
Copy link
Author

I use github.com/apache/thrift v0.0.0-20181112125854-24918abba929, github.com/xitongsys/parquet-go v1.5.2 and it works.

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

In the release notes:

[v1.6.0](https://github.com/xitongsys/parquet-go/releases/tag/v1.6.0)
Big changes in the type. Not compatiable with before.

I may need to update package to use 1.6+ instead of 1.5.

No idea why it is not using v1.5 for you since it's registered in the go.mod file.

@tanyaofei
Copy link
Author

In the release notes:

[v1.6.0](https://github.com/xitongsys/parquet-go/releases/tag/v1.6.0)
Big changes in the type. Not compatiable with before.

I may need to update package to use 1.6+ instead of 1.5.

No idea why it is not using v1.5 for you since it's registered in the go.mod file.

v1.5 works find, may be i installed parquet-go before installed dataframe-go, not sure about it.

@tanyaofei
Copy link
Author

It seems the problem solved, I should close this issue

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

Maybe you directly imported "github.com/rocketlaunchr/dataframe-go/imports" without importing "github.com/rocketlaunchr/dataframe-go". Since there is no go.mod file inside github.com/rocketlaunchr/dataframe-go/imports directory, it just downloaded and used the latest version of parquet-go

@pjebs pjebs reopened this Mar 30, 2022
@tanyaofei
Copy link
Author

Maybe you directly imported "github.com/rocketlaunchr/dataframe-go/imports" without importing "github.com/rocketlaunchr/dataframe-go". Since there is no go.mod file inside github.com/rocketlaunchr/dataframe-go/imports directory, it just downloaded and used the latest version of parquet-go

Here is my shell records

➜  go get -u github.com/rocketlaunchr/dataframe-go
go: downloading github.com/rocketlaunchr/dataframe-go v0.0.0-20211025052708-a1030444159b
go: downloading golang.org/x/exp v0.0.0-20200331195152-e8c3332aa8e5
go: downloading github.com/google/go-cmp v0.4.0
go: downloading github.com/guptarohit/asciigraph v0.5.1
go: downloading github.com/olekukonko/tablewriter v0.0.4
go: downloading golang.org/x/sync v0.0.0-20200317015054-43a5402ce75a
go: downloading github.com/olekukonko/tablewriter v0.0.5
go: downloading github.com/google/go-cmp v0.5.7
go: downloading github.com/mattn/go-runewidth v0.0.7
go: downloading github.com/mattn/go-runewidth v0.0.13
go: downloading golang.org/x/exp v0.0.0-20220328175248-053ad81199eb
go: downloading github.com/guptarohit/asciigraph v0.5.3
go: downloading github.com/rivo/uniseg v0.2.0
go: added github.com/google/go-cmp v0.5.7
go: added github.com/guptarohit/asciigraph v0.5.3
go: added github.com/mattn/go-runewidth v0.0.13
go: added github.com/olekukonko/tablewriter v0.0.5
go: added github.com/rivo/uniseg v0.2.0
go: added github.com/rocketlaunchr/dataframe-go v0.0.0-20211025052708-a1030444159b
go: added golang.org/x/exp v0.0.0-20220328175248-053ad81199eb
go: added golang.org/x/sync v0.0.0-20210220032951-036812b2e83c
➜  go get -u github.com/xitongsys/parquet-go/parquet                                     
go: downloading github.com/apache/thrift v0.16.0
go: upgraded github.com/apache/thrift v0.0.0-20181112125854-24918abba929 => v0.16.0
go: upgraded github.com/xitongsys/parquet-go v1.5.2 => v1.6.2
➜  go get -u github.com/xitongsys/parquet-go-source                                       
go: downloading github.com/xitongsys/parquet-go-source v0.0.0-20220315005136-aec0fe3e777c
go: upgraded github.com/xitongsys/parquet-go-source v0.0.0-20200817004010-026bad9b25d0 => v0.0.0-20220315005136-aec0fe3e777c

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

You shouldn't have done the last 2 go gets since they don't have a go.mod file so it just assumed the latest version hence: go: upgraded github.com/xitongsys/parquet-go v1.5.2 => v1.6.2

@pjebs
Copy link
Collaborator

pjebs commented Mar 30, 2022

From Go's point of view, when you do that, it's an unrelated package.

@tanyaofei
Copy link
Author

You shouldn't have done the last 2 go gets since they don't have a go.mod file so it just assumed the latest version hence: go: upgraded github.com/xitongsys/parquet-go v1.5.2 => v1.6.2

Get it, thanks a lot

@chippyash
Copy link

Hi - when is this lib going to be upgraded to use >= V1.6.2 of parquet-go please? having to fix on v1.5.4 just broke all the tagging I was using which assumed V1.6.2 :-(

@pjebs
Copy link
Collaborator

pjebs commented Jun 4, 2022

There is a backward-incompatible change in v1.6.2. Therefore I will need to explore it more deeply.

This package's go.mod is set to github.com/xitongsys/parquet-go v1.5.2 so it should work for you provided you don't try and indepdently go get the "github.com/rocketlaunchr/dataframe-go/imports" package.

Let the main package dictate the dependencies for the sub-packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants