Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Doris 查询 HDFS 上Decimal 类型的数据异常 #1004

Open
wgzhao opened this issue Feb 4, 2024 · 2 comments
Open

[Bug]: Doris 查询 HDFS 上Decimal 类型的数据异常 #1004

wgzhao opened this issue Feb 4, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@wgzhao
Copy link
Owner

wgzhao commented Feb 4, 2024

What happened?

  1. 安装最新的 Doris,然后创建连接 hive 的 catalog。
  2. 通过 Addax 最新版本,往 HDFS 上写入包含 Decimal 类型的 ORC 文件
  3. 通过 Doris 去查询该表,Decimal 类型显示异常如下:
mysql> switch hive;
Query OK, 0 rows affected (0.01 sec)

mysql> select * from `default`.addax_test ;
+------+---------------+
| id   | fee           |
+------+---------------+
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
+------+---------------+
20 rows in set (0.12 sec)

表中前 10 条记录是通过 Addax 写入的数据,在 Hive 命令行以及 Trino 查询都是正常的,但在 Doris 里查询异常。
后 10 条记录是在 hive 命令行通过 insert into addax_test select * from addax_test 写入,这 10 条记录查询是正常的。

Version

4.1.3 (Default)

OS Type

Linux (Default)

Java JDK Version

Oracle JDK 1.8.0

Relevant log output

No response

@wgzhao wgzhao added the bug Something isn't working label Feb 4, 2024
@wgzhao wgzhao self-assigned this Feb 4, 2024
@wgzhao
Copy link
Owner Author

wgzhao commented Feb 4, 2024

正常 ORC 文件的元数据信息如下:

File Version: 0.12 with ORC_135
Rows: 10
Compression: ZLIB
Compression size: 262144
Type: struct<id:int,fee:decimal(20,3)>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: false
    Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
    Column 2: count: 10 hasNull: false bytesOnDisk: 16 min: 123.12 max: 123.12 sum: 1231.2

File Statistics:
  Column 0: count: 10 hasNull: false
  Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
  Column 2: count: 10 hasNull: false bytesOnDisk: 16 min: 123.12 max: 123.12 sum: 1231.2

Stripes:
  Stripe: offset: 3 data: 21 rows: 10 tail: 44 index: 71
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 25
    Stream: column 2 section ROW_INDEX start: 39 length 35
    Stream: column 1 section DATA start: 74 length 5
    Stream: column 2 section DATA start: 79 length 11
    Stream: column 2 section SECONDARY start: 90 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

File length: 324 bytes
Padding length: 0 bytes
Padding ratio: 0%

异常 ORC 文件的元数据信息如下:

File Version: 0.12 with FUTURE
Rows: 10
Compression: LZ4
Compression size: 262144
Type: struct<id:int,fee:decimal(38,18)>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: false
    Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
    Column 2: count: 10 hasNull: false bytesOnDisk: 21 min: 123.12 max: 123.12 sum: 1231.2

File Statistics:
  Column 0: count: 10 hasNull: false
  Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
  Column 2: count: 10 hasNull: false bytesOnDisk: 21 min: 123.12 max: 123.12 sum: 1231.2

Stripes:
  Stripe: offset: 3 data: 26 rows: 10 tail: 59 index: 78
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 25
    Stream: column 2 section ROW_INDEX start: 39 length 42
    Stream: column 1 section DATA start: 81 length 5
    Stream: column 2 section DATA start: 86 length 16
    Stream: column 2 section SECONDARY start: 102 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

File length: 371 bytes
Padding length: 0 bytes
Padding ratio: 0%

@wgzhao
Copy link
Owner Author

wgzhao commented Feb 6, 2024

进一步进行测试,可能是因为精度不一致导致的。Doris 要求字段定义的精度和 ORC 文件中字段定义的精度保持一致才能正确读取该字段,否则异常。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant