Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade opening netcdf files method #78

Open
kwhitehall opened this issue Jul 20, 2016 · 2 comments
Open

upgrade opening netcdf files method #78

kwhitehall opened this issue Jul 20, 2016 · 2 comments

Comments

@kwhitehall
Copy link
Member

There was a suggestion to extend the netcdffile read methods e.g. NetcdfDFSFile, to ensure maximum time series on the partition during the read. Initially a flag can be added, that would (1) ensure the URI list is sorted according to time within the filenames; (2) deploy a private method to sort the list on the partition ahead of any other transformations.

@rahulpalamuttam
Copy link
Member

How about not just having it sorted, but also paired.
We could then go from RDD[(URI, Next URI)] => RDD[(SciTensor, Next SciTensor])

We would only be shuffling small URI strings when we sort and pair and will not be shuffling around
the array bundle.

@rahulpalamuttam
Copy link
Member

Actually pairing up the URI's brings in extra overhead of reading each HDFS file twice.
Let's just sort the URI's and read each file once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants