Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stocator should create folder names with a trailing '/' in IBM COS #210

Open
mariobriggs opened this issue Jun 10, 2019 · 10 comments
Open
Assignees

Comments

@mariobriggs
Copy link

I am using Stocator via Spark to write a dataframe to IBM COS

df.write.parquet("cos://mybucket.service/tpcds/call_center")

in the above call, stocator creates the folder 'call_center' in IBM COS. However stocator does not create the folder name with a trailing '/' and as a result this messes up reading of these IBM COS folders when using other tools like Alluxio, CyberDuck etc.

Below is an example of the CyberDuck UI. Notice the folder 'call_center' is listed as a 0 byte sized file as well.

image

Browsing through the stocator code, i see the code commented out to create the foldername with a trailing '/' and using a build where it is uncommented solved the issue.

Look forward to a fix

@mariobriggs mariobriggs changed the title Stocator should create folder names with a trailing '/' Stocator should create folder names with a trailing '/' in COS Jun 10, 2019
@mariobriggs mariobriggs changed the title Stocator should create folder names with a trailing '/' in COS Stocator should create folder names with a trailing '/' in IBM COS Jun 10, 2019
@gilv gilv self-assigned this Jun 12, 2019
@gilv
Copy link
Contributor

gilv commented Jun 12, 2019

@mariobriggs I will handle this. Thanks

@kozchris
Copy link

This issue is also breaking our Apache Spark reads of part files. The Apache Spark writes of the part files are creating a 0 byte directory file with no trailing slash. When we add the ending slash to the directory file that gets created the reads work again.

@kozchris
Copy link

@gilv how is the progress coming on a fix?

@rpatel17
Copy link

I am also seeing this as a problem in our project. Thanks @gilv for looking into it.

@robin-sun
Copy link

Is this issue fixed now after 16 months? I am still seeing an empty file being created.

@gilv
Copy link
Contributor

gilv commented Dec 15, 2020

@robin-sun why there is a problem with an empty file? if you write "foo" file with Stocator via Spark it will be

foo
foo/_SUCCESS
foo/part-1-xx
foo/part-2-xx
etc.

You can now use Spark to read "foo" again and all works. If you list object storage via CLI you will see empty file "foo" and "foo/_SUCCESS". Why this is a a problem?

@robin-sun
Copy link

Hi Gil,
This is causing errors when downloading the whole parent folder to a Windows OS as Windows doesn't support file/folder with the same name. I will have to download the output folder 1 by 1.

But I guess the question is really, why do we need an empty file if it is not used/useful at all.

@mariobriggs
Copy link
Author

mariobriggs commented Dec 16, 2020 via email

@robin-sun
Copy link

Hi Mario/Gil

Could you help me understand, why do we need an empty file there?

@gilv
Copy link
Contributor

gilv commented Dec 22, 2020

@mariobriggs @robin-sun empty file name to simulate a folder in object storage is not invented by Stocator, but used in other Big Data systems. This is easiest way for Hadoop eco-system to mark a "folder".. So the compatibility with Windows indeed has issues with such approach. We need empty object since it has Stocator specific metadata. If you just need to download all data created by Stocator to Windows, then just write some script that will ignore empty objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants