Protocol RFC for collations #3068

olaky · 2024-05-08T07:45:57Z

Which Delta project/connector is this regarding?

Description

Protocol RFC for adding collation support to Delta

protocol_rfcs/collated-string-type.md

stefankandic · 2024-05-08T09:38:36Z

protocol_rfcs/collated-string-type.md

+
+Part | Description
+-|-
+Provider | Name of the provider. Must not contain dots


since provider must and name can't contain dots should the provider perhaps be optional?

How does it work with an optional provider when we allow dots in versions. So for example if we have

prover.name.version and name2.version.1

how do we create a parsing rule for this?

protocol_rfcs/collated-string-type.md

felipepessoto · 2024-05-08T15:53:05Z

Hi @olaky. I have some questions please:

Are we introducing a new writerFeature? I don't see any mention to that
Design doc says no readerFeature will be added. In that case it could return incorrect results if reader doesn't understand collation, for example: SELECT COUNT(*) FROM TABLE Group BY CaseInsensitiveColumn. Or when filtering. Is this expected?
Do you know how Spark handles the back-compatibility issue described above for Parquet tables?

Thanks.

olaky · 2024-05-08T18:52:49Z

Hi @olaky. I have some questions please:

Are we introducing a new writerFeature? I don't see any mention to that

Design doc says no readerFeature will be added. In that case it could return incorrect results if reader doesn't understand collation, for example: SELECT COUNT(*) FROM TABLE Group BY CaseInsensitiveColumn. Or when filtering. Is this expected?

Do you know how Spark handles the back-compatibility issue described above for Parquet tables?

Thanks.

Hi @felipepessoto,

thanks for taking an interest in this project.

1: The idea is to add a writer feature indeed. Because we want to keep collation information in field metadata of the table schema, this might actually not be a requirement though. The protocol is designed in a way that writers can provide statistics for UTF8_BINARY (the collation currently used), and that clients that do bot understand collations do not break. I will need some time to explore if we can get away without a writer feature, because this would be nice actually.
2: This is a difficult decision indeed, and yes, results can differ between clients respecting collations and between clients ignoring them. The thing is though that a lot of engines that can read and write Delta do not support collations, and are unlikely to do so in the mid term. This is why I am preferring not to have a reader feature, because that would make tables with collations unreadable by many clients for a very long time, which is not desired.
3: Spark will also not require clients to know about collations for the parquet tables it writes (this is work in progress).

One more thing to point out here is that Hive tables, and HMS by extension, do not support collations. So if we want to use a schema that forces clients to know about collations, it would also mean that tables are not a hive compliant table any more. Same goes for Delta UniForm, because Iceberg has no way to specify collations.

cstavr

Looks good to me. Left some nits.

protocol_rfcs/collated-string-type.md

…3117) ## Description This refactoring adds support for nested statistics columns. So far, all statistics are keys in the stats struct in AddFiles. This PR adds support for statistics that are part of nested structs. This is a prerequisite for file skipping on collated string columns ([Protocol RFC](#3068)). Statistics for collated string columns will be wrapped in a struct keyed by the versioned collation that was used to generate them. For example: ``` "stats": { "statsWithCollation": { "icu.en_US.72": { "minValues": { ...} } } } ``` This PR replaces statType in StatsColumn with pathToStatType, which can be used to represent a path. This way we can re-use all of the existing data skipping code without changes. ## How was this patch tested? It is not possible to test this change without altering [statsSchema](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L285). I would still like to ship this PR separately because the change is big enough in itself. There is existing test coverage for stats parsing and file skipping, but none of them uses nested statistics yet. ## Does this PR introduce _any_ user-facing changes? No

vkorukanti

Few naive questions (apologies in advance).

protocol_rfcs/collated-string-type.md

Protocol RFC for collations

2aad573

stefankandic reviewed May 8, 2024

View reviewed changes

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved

stefankandic reviewed May 8, 2024

View reviewed changes

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved

Use the same schema as iceberg compat v2. Add domain metadata

2cab504

olaky requested a review from stefankandic May 13, 2024 07:05

stefankandic mentioned this pull request May 13, 2024

[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE apache/spark#46280

Closed

cstavr reviewed May 16, 2024

View reviewed changes

olaky added 2 commits May 16, 2024 13:50

Merge remote-tracking branch 'delta/master' into collations-rfc

53a69c9

Extend the example and fix some grammar

c024a84

olaky force-pushed the collations-rfc branch from ab04cd9 to c024a84 Compare May 16, 2024 11:58

cstavr approved these changes May 16, 2024

View reviewed changes

stefankandic approved these changes May 16, 2024

View reviewed changes

This was referenced May 17, 2024

[PROTOCOL RFC] Support for collated strings in the schema and statistics #2894

Open

[SPARK] Support predicates for stats that are not at the top level #3117

Merged

vkorukanti reviewed May 23, 2024

View reviewed changes

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved

protocol_rfcs/collated-string-type.md Show resolved Hide resolved

vkorukanti reviewed May 23, 2024

View reviewed changes

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved

olaky added 2 commits May 24, 2024 15:46

Clarify writer requirements

de07bbf

Merge remote-tracking branch 'delta/master' into collations-rfc

f038944

olaky requested review from stefankandic, cstavr and vkorukanti May 24, 2024 13:46

vkorukanti approved these changes May 24, 2024

View reviewed changes

stefankandic approved these changes May 27, 2024

View reviewed changes

vkorukanti merged commit 6c0137b into delta-io:master May 27, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protocol RFC for collations #3068

Protocol RFC for collations #3068

olaky commented May 8, 2024 •

edited

stefankandic May 8, 2024

olaky May 13, 2024

felipepessoto commented May 8, 2024 •

edited

olaky commented May 8, 2024

cstavr left a comment

vkorukanti left a comment

Protocol RFC for collations #3068

Protocol RFC for collations #3068

Conversation

olaky commented May 8, 2024 • edited

Which Delta project/connector is this regarding?

Description

stefankandic May 8, 2024

Choose a reason for hiding this comment

olaky May 13, 2024

Choose a reason for hiding this comment

felipepessoto commented May 8, 2024 • edited

olaky commented May 8, 2024

cstavr left a comment

Choose a reason for hiding this comment

vkorukanti left a comment

Choose a reason for hiding this comment

olaky commented May 8, 2024 •

edited

felipepessoto commented May 8, 2024 •

edited