Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Performant Low-level Operation For Adding Row #34

Open
AnthonyMBonafide opened this issue Jun 28, 2021 · 1 comment
Open

Add Performant Low-level Operation For Adding Row #34

AnthonyMBonafide opened this issue Jun 28, 2021 · 1 comment

Comments

@AnthonyMBonafide
Copy link

The SchemaWriter interface gives functionality for adding row level data via the AddData method. This method accepts the row information in the form of map[string]interface{} which allows the caller to provide the name of the column as the key(string) and the value in the form that is approiate for the underlying data(i.e. string, int64, []byte, bool, etc). However, this comes with a performance impact in the form of heap allocations and increased garbage collector managed memory. This is due to the key of type interface{} resulting in the usage of pointers and escaping to the heap. To increase performance, can a new method be added to support providing row data in a way that can reduce the allocations escaping to the heap while still giving the caller the control to handle dynamic data, like what is happening in the CSV to Parquet tool?

Doing a quick scan of the code base and to my untrained eye, it looks like one way to achieve this may be to create a generic struct that can encapsulate the data and use that rather than a map.

// RowData represents a row of data in a CSV file, and can be provided to the `SchemaWriter`
type RowData struct{
	Values []RowData
}

// RowData represents each field/column for a row in a CSV file
type RowData struct{
	DataName string

	/*
		Different data types.
		Only one of these should be populated at a time
	*/
	StringValue string
	IntValue int
	Int16Value int16
	Int32Value int32
	Int64Value int64
	BoolValue bool
}

The SchemaWriter interface can be updated to accept these new types, For example,

// AddDataRow writes a row of data to the underlying writer using the specified data and metadata
func AddDataRow(data RowData) error

I am wondering if my assumptions regarding performance are correct, if there are any known work arounds other than adding new functionality, if this is something that is desired for this project, and what is the desired method to achieve the results.

@panamafrancis
Copy link
Contributor

We won't consider union types, however with the advent of Go Generics we will take a look at this topic again when we have the chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants