Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Richer Support for Struct Tags #92

Open
julianedwards opened this issue Jul 27, 2022 · 0 comments
Open

Richer Support for Struct Tags #92

julianedwards opened this issue Jul 27, 2022 · 0 comments

Comments

@julianedwards
Copy link

julianedwards commented Jul 27, 2022

It would be neat to have richer support for struct tags for auto-generated schema definitions. I added this feature to a branch off my forked repo and am happy to put up a PR if you guys think this is a good idea! I added documentation on what this would look like (I just copied the updates I made to the README on my branch).

Object Schema Definitions

The sub-package parquetschema/autoschema supports auto-generating schema
definitions for a provided object's type using reflection and struct tags. The
generated schema is meant to be compatible with the reflection-based
marshalling/unmarshalling in the floor sub-package.

Supported Parquet Types

Parquet Type Go Types Note
BOOLEAN bool
INT32 int{8,16,32}, uint{,8,16,32}
INT64 int{,64}, uint64
INT96 [12]byte Must specify type=INT96 in the parquet struct tag.
FLOAT float32
DOUBLE float64
BYTE_ARRAY string, []byte
FIXED_LEN_BYTE_ARRAY []byte, [N]byte

Supported Logical Types

Logical Type Go Types Note
STRING string, []byte
MAP map[T1]T2 Maps with any key and value types.
LIST []T, [N]T Slices and arrays of any type except for byte.
ENUM string, []byte
DECIMAL int32, int64, []byte, [N]byte
DATE int32, time.Time
TIME int32, int64, goparquet.Time int32: TIME(MILLIS, {false,true}), int64: TIME({MICROS,NANOS}, {false,true})
TIMESTAMP int64, time.Time
INTEGER {,u}int{,8,16,32,64}
JSON string, []byte
BSON string, []byte
UUID [16]byte

Pointers are automatically mapped to optional fields. Unsupported Go types
include funcs, interfaces, unsafe pointers, unsigned int pointers, and complex
numbers.

Default Type Mappings

By default, Go types are mapped to Parquet types and in some cases logical
types as well. More specific mappings can be achieved by the use of struct
tags (see below).

Go Type Default Parquet Type Default Logical Type
bool BOOLEAN
int{,8,16,32,64} INT{64,32,32,32,64} INTEGER({64,8,16,32,64}, true)
uint{,8,16,32,64} INT{32,32,32,32,64} INTEGER({32,8,16,32,64}, false)
string BYTE_ARRAY STRING
[]byte BYTE_ARRAY
[N]byte FIXED_LEN_BYTE_ARRAY
time.Time INT64 TIMESTAMP(NANOS, true)
goparquet.Time INT64 TIME(NANOS, true)
map group MAP
slice, array group LIST
struct group

Struct Tags

Automatic schema definition generation supports the use of the parquet struct
tag for further schema specification beyond the default mappings. Tag fields
have the format key=value and are comma separated. The tags do not support
converted types as these are now deprecated by Parquet. Since converted types
are still required to support backward compatibility, they are automatically
set based on a field's logical type.

Tag Field Type Values Notes
name string ANY Defaults to the lower-case struct field name.
type string INT96 Unless using a [12]byte field for INT96, this does not ever need to be specified.
logicaltype string STRING, ENUM, DECIMAL, DATE, TIME, TIMESTAMP, JSON, BSON, UUID Maps and non-byte slices and arrays are always mapped to MAP and LIST logical types, respectively.
timeunit string MILLIS, MICROS, NANOS Only used when the logical type is TIME or TIMESTAMP, defaults to NANOS.
isadjustedtoutc bool ANY Only used when the logical type is TIME or TIMESTAMP, defaults to true.
scale int32 N >= 0 Only used when the logical type is DECIMAL, defaults to 0.
precision int32 N >= 0 Only used when the logical type is DECIMAL, required.

All fields must be prefixed by key. and value. when referring to key and
value types of a map, respectively, and element. when referring to the
element type of a slice or array. It is invalid to prefix name since it can
only apply to the field itself.

Object Schema Example

type example  struct {
        ByteSlice          []byte
        String             string
        ByteString         []byte          `parquet:"name=byte_string, logicaltype=STRING"`
        Int64              int64           `parquet:"name=int_64"`
        Uint8              uint8           `parquet:"name=u_int_8"`
        Int96              [12]byte        `parquet:"name=int_96, type=INT96"`
        DefaultTS          time.Time       `parquet:"name=default_ts"`
        Timestamp          int64           `parquet:"name=ts, logicaltype=TIMESTAMP, timeunit=MILLIS, isadjustedtoutc=false`
        Date               time.Time       `parquet:"name=date, logicaltype=DATE"`
        OptionalDecimal    *int32          `parquet:"name=decimal, logicaltype=DECIMAL, scale=5, precision=10"`
        TimeList           []int32         `parquet:"name=time_list, element.logicaltype=TIME, element.timeunit=MILLIS"`
	DecimalTimeMap     map[int64]int32 `parquet:"name=decimal_time_map, key.logicaltype=DECIMAL, key.scale=5, key.precision=15, value.logicaltype=TIME, value.timeunit=MILLIS", value.isadjustedtoutc=true`
        Struct             struct {
                OptionalInt64 *int64   `parquet:"name=int_64"`
	        Time          int64    `parquet:"name=time, logicaltype=TIME, isadjustedtoutc=false"`
	        StringList    []string `parquet:"name=string_list"`
        } `parquet:"name=struct"`
}

The above struct is equivalent to the following schema definition:

message autogen_schema {
    required binary byteslice;
    required binary string (STRING);
    required binary byte_string (STRING);
    required int64 int_64 (INTEGER(64,true));
    required int32 int_8 (INTEGER(8,false));
    required int96 int_96;
    required int64 default_ts (TIMESTAMP(NANOS,true));
    required int64 ts (TIMESTAMP(MILLIS,false));
    required int32 date (DATE);
    optional int32 decimal (DECIMAL(10,5));
    required group time_list (LIST) {
        repeated group list {
          required int32 element (TIME(MILLIS,true));
        }
    }
    optional group decimal_time_map (MAP) {
        repeated group key_value (MAP_KEY_VALUE) {
          required int64 key (DECIMAL(15,5));
          required int32 value (TIME(MILLIS, true));
        }
    }
    required group struct {
        optional int64 int_64 (INTEGER(64,true));
        required int64 time (TIME(NANOS, false));
        required group string_list (LIST) {
            repeated group list {
                required binary element (STRING);
            }
        }
    }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant