Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column names are not read for ORC tables #1556

Open
lucharo opened this issue Jun 1, 2021 · 5 comments
Open

Column names are not read for ORC tables #1556

lucharo opened this issue Jun 1, 2021 · 5 comments

Comments

@lucharo
Copy link

lucharo commented Jun 1, 2021

Is your feature request related to a problem? Please describe.
Hi! I mostly work with parquet and csv files but there are also some orc files in the db I use. I've noticed that the column names of my ORC tables are not inferred and instead the column names default to _col0, _col1, _col2... Additionally, the names argument is only enabled for reading (/creating) csv tables (with create_tables) hence I am not able to set the column names manually. My tables look like these when read with bsql
image

Describe the solution you'd like
I would like a names argument when file_format is set to orc in create_table: (https://github.com/BlazingDB/blazingsql/blob/92ed45f5af438fedc8cad82e4ef8ed3f3fb7eed6/docsrc/source/reference/python/tables/apache-orc.rst)

----For BlazingSQL Developers----
How and where should this be implemented?
What part of the code should be feature be implemented? What should the APIs and/or classes look like?

Other design considerations
What components of the engine could be affected by this? What functions should we make sure we use/reuse?

Testing considerations?
What sort of unit tests and/or End to End tests be implemented to test this?

@lucharo lucharo added the ? - Needs Triage needs team to review and classify label Jun 1, 2021
@lucharo
Copy link
Author

lucharo commented Jun 1, 2021

According to this hive issue, this is a problem with ORC tables create through hive and given that issue was reported in 2016 and it is still open I think it would be great to be able to assign column names on the fly/manually through the names arguments for orc files.

@felipeblazing felipeblazing removed the ? - Needs Triage needs team to review and classify label Jun 4, 2021
@felipeblazing
Copy link
Contributor

This isn't a high priority issue for us right now but you could accomplish this yourself by doing something like.

bc.create_table("table_name", bc.sql("select col_ as name1, col_2 as name2 from hive_table"))

@wmalpica
Copy link
Contributor

wmalpica commented Jun 4, 2021

You could also use our hive connection API. I believe that when we implemented this, we too into consideration the Hive issue you mention.
https://docs.blazingsql.com/reference/python/tables/apache-hive.html

@lucharo
Copy link
Author

lucharo commented Jun 7, 2021

Thanks for the swift replies as always! Regarding the hive connection, have you seen any performance difference between using the HDFS connector vs the Hive connector? @williamBlazing

@lucharo
Copy link
Author

lucharo commented Nov 23, 2021

Thanks for the swift replies as always! Regarding the hive connection, have you seen any performance difference between using the HDFS connector vs the Hive connector? @williamBlazing

@wmalpica could you please follow up on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants