Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: disallow empty struct #8876

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

chloeh13q
Copy link
Contributor

@chloeh13q chloeh13q commented Apr 3, 2024

Description of changes

Disallow empty struct in type inference. Pyarrow allows the construction of empty structs but duckdb doesn't.

I think what the user in #8460 was attempting to create was an empty value in a struct column that has defined fields. You can accomplish this with the following:

>>> import ibis
>>>
>>> ibis.options.interactive = True
>>>
>>> t = ibis.memtable(
...             [
...                 {
...                     "year": "2024",
...                     "period": "M01",
...                     "periodName": "January",
...                     "latest": "true",
...                     "value": "3.7",
...                     "footnotes": None,
...                 }
...             ],
...             schema={"year": "string", "period": "string", "periodName": "string", "latest": "string", "value": "string", "footnotes": "struct<key: string>"}
...         )
>>> t
┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┓
┃ yearperiodperiodNamelatestvaluefootnotes ┃
┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━┩
│ stringstringstringstringstringstruct<k… │
├────────┼────────┼────────────┼────────┼────────┼───────────┤
│ 2024M01Januarytrue3.7NULL      │
└────────┴────────┴────────────┴────────┴────────┴───────────┘

I do want to note here that ibis interprets {} as a struct whose keys point to empty values rather than a struct that is empty:

>>> t = ibis.memtable(
...             [
...                 {
...                     "year": "2024",
...                     "period": "M01",
...                     "periodName": "January",
...                     "latest": "true",
...                     "value": "3.7",
...                     "footnotes": {},
...                 }
...             ],
...             schema={"year": "string", "period": "string", "periodName": "string", "latest": "string", "value": "string", "footnotes": "struct<key: string>"}
...         )
>>> t
┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ yearperiodperiodNamelatestvaluefootnotes            ┃
┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringstringstringstruct<key: string>  │
├────────┼────────┼────────────┼────────┼────────┼──────────────────────┤
│ 2024M01Januarytrue3.7    │ {'key': None}        │
└────────┴────────┴────────────┴────────┴────────┴──────────────────────┘

Issues closed

#8460

@NickCrews
Copy link
Contributor

It seems to me like {} does not properly follow the schema of "struct<key: string>", and so ibis.literal({}, type="struct<key: string>") should just fail. The user's input is malformed, and they will need to deal with it external to ibis. Or, perhaps they could use a map<string, string> type instead, which properly represents a datatype with a variable number of keys.

IDK, maybe we could make ibis lenient, but it feels like a slippery slope. If we allow this, then it feels like we should make it so that ibis.struct({"a": "foo"}).cast("struct<a: string, b:int>") to succeed, and result in {"a": "foo", "b": None} (ie if you cast something to a struct, anytime a field isn't present, fill it in with NULL). This feels OK to me, but I don't think backends have good builtin support for this, we would have to add workarounds all over the place.

@chloeh13q chloeh13q force-pushed the fix/empty-struct branch 2 times, most recently from 6bef829 to 1a4d1d0 Compare April 10, 2024 06:39
Comment on lines +173 to +175
if isinstance(df, dd.DataFrame):
df = df.compute()
self.schemas[table_name] = schema = PandasData.infer_table(df)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was breaking a bunch of tests because in one of the datasets we're using there is a struct column that looks like [{}, None, {1: "a", 2: "b"}]. We need all of the data in that column in order to infer the struct type there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants