Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kwarg vsi to RasterDataset to support GDAL VSI #1399

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

adriantre
Copy link
Contributor

@adriantre adriantre commented Jun 5, 2023

Fix #1398
Fix #1165

This PR remove blockers for reading raster files directly from Cloud Storage (buckets) and other GDAL Virtual File Systems.

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jun 5, 2023
@adriantre
Copy link
Contributor Author

@microsoft-github-policy-service agree company="vake"

@adamjstewart
Copy link
Collaborator

Hey, sorry it took us so long to review this! We're normally much more responsive to issues/PRs but we just finished a paper deadline.

This looks great! We've been wanting better support for cloud (AWS/GCP/Azure) for a long time. The biggest thing holding that up is my own perfectionism (what about VectorDataset or NonGeoDataset?). There isn't an easy way to add support for cloud for every dataset without refactoring the entire library. I was considering doing this at the same time as porting to TorchData, but we're still on the fence about when or if that will actually happen. So in my opinion, starting with Raster/VectorDataset for now and worrying about the others should be fine.

In terms of the current implementation, here are my preliminary thoughts:

  1. Is it possible to use the same logic for VectorDataset?
  2. Can VSI vs. non-VSI be detected based on whether or not root starts with /vsi? Or would it start with https://? We can check for /vsizip/, /vsitar/, etc. if we need to be more explicit. I can't imagine too many users will happen to have a /vsi* directory on their root filesystem, that's the only issue I can think of.
  3. The listdir stuff seems fragile, especially the error handling. What happens if the URL/path is not actually a real VSI filesystem? Does it return an invalid root directory? Is this only a fiona feature or does rasterio have the same feature?
  4. Does this work for all cloud providers? Some of us may have access to others if you need help checking this.
  5. Testing this will be fun. For the normal tests that run on all PRs, we try to avoid any tests that require internet access. So if we can create a local VSI file to test this that would be ideal. We could potentially add tests that only get run during the release process if we want to check remote cloud providers, but idk how stable those would be.

@calebrob6 will likely want to check this to make sure this works for him on Azure/Planetary Computer.

P.S. Thanks for the PR! This would be a fantastic first contribution.

@adamjstewart adamjstewart added this to the 0.5.0 milestone Jun 9, 2023
@adriantre
Copy link
Contributor Author

adriantre commented Jun 19, 2023

There isn't an easy way to add support for cloud for every dataset without refactoring the entire library.

As Sean mentioned in my Fiona feature request maybe libcloud is a good option. But we could start with the simple fiona.listdir and if users report incompatibilities we can put more efforts into the "perfect" solution?

  1. Is it possible to use the same logic for VectorDataset?

Looks like the behaviour will be the same. Everything that can be opened using fiona/rasterio/gdal will support vsi.

  1. Can VSI vs. non-VSI be detected based on whether or not root starts with /vsi? Or would it start with https://? We can check for /vsizip/, /vsitar/, etc. if we need to be more explicit. I can't imagine too many users will happen to have a /vsi* directory on their root filesystem, that's the only issue I can think of.

fiona.listdir will work for local filesystems as well. Root will start with /vsi. If it is a problem that users for some reason create a local folder /vsi then TorchGeo datasets already have this problem as they read files using fiona and rasterio. We can use is_remote or valid_vsi from fiona instead of the vsi kwarg in my suggestion.

  1. The listdir stuff seems fragile, especially the error handling. What happens if the URL/path is not actually a real VSI filesystem? Does it return an invalid root directory? Is this only a fiona feature or does rasterio have the same feature?

Looks like rasterio does not have this method. Do you think the above mentioned valid_vsi will be sufficient regarding teh fragility?

  1. Does this work for all cloud providers? Some of us may have access to others if you need help checking this.

This (should) work with the cloud providers listen in the gdal docs. fiona.listdir use method VSIReadDir from gdal/ogr internally, which I assume is thoroughly tested. I have tested my implementation on Azure and GS.

  1. Testing this will be fun. For the normal tests that run on all PRs, we try to avoid any tests that require internet access.

Looks like moto can mock virtual file systems.

@adamjstewart
Copy link
Collaborator

We're trying to limit additional dependencies but I'll keep libcloud and moto in mind.

I agree that we should start with a simple implementation and worry about making it "perfect" later. Especially since it wouldn't be an API change if we modify the implementation later.

valid_vsi looks good to me, let's use that and remove the user-defined parameter.

@isaaccorley
Copy link
Collaborator

isaaccorley commented Jun 19, 2023

We had previously discussed possibly making a from_s3 or from_files method instead so a user could instantiate a dataset like dataset = RasterDataset.from_files(...). This would also reduce code in the constructor. Do we want to take this approach instead?

This also has the benefit of making the dependencies optional and only imported inside the method.

@adamjstewart
Copy link
Collaborator

We should definitely have a RasterDataset.from_files classmethod constructor, @calebrob6 has been asking for that for years.

This is somewhat orthogonal to this PR. We should add from_files in a separate PR. Then we can decide whether we want root to support both local and remote file systems or just tell users to use from_files.

@adriantre would you be interested in implementing from_files? It would share most of the code with __init__ (we can move it to a subfunction) but accept a list of files instead of a single directory. I'll open an issue so we can track this, it should be relatively easy. Also let me know what you think about requiring from_files for remote file systems or not.

@adriantre
Copy link
Contributor Author

I will give it a shot. I'm initially thinking that, analogously to rasterio.open, we shouldn't need to have different constructors for remote or local files. We may look to e.g. pandas.read_file to see how they handle imports and drivers for different file types.

@adriantre
Copy link
Contributor Author

The vsi can be passed in two ways. E.g. for azure:

  1. /vsiaz/my/root is OGR VSI
  2. az://my/root is Apache` Common VFS

Looks like 2 is preferred, at least somewhat documented, although internally 2. is converted to 1. before it is passed to the reader.

For either we would need to use fiona.listdir (or alternatives) to list the contents.

I realised that we cannot use valid_vsi from fiona. It only supports 2. and is not up-to-date with all supported vsi.

Lastly, another vsi complicating the matter is zip-archives, which fiona.listdir supports to a lesser degree. It works only if the root is the actual zip-archive, so my implementation listdir_vsi_recursive won't be sufficient if one wants all tif's in all zip-archives within the root.

I'm starting to feel like RasterDataset.from_files is the most flexible way of solving this. But at the same time it feels like it should be possible to make it work for root-directory discovery as well, but we would have to narrow the scope somehow.

Thoughts?

@adamjstewart
Copy link
Collaborator

It's certainly possible, but as you mention, the scope makes it difficult. I think from_files should be added regardless. If someone asks for native support in __init__, we can consider adding it, but let's see if from_files is sufficient for most people for now. We'll also want to document it so people know it actually exists. Maybe add a new tutorial. Not required for the initial PR unless you're really gung ho about it.

@calebrob6
Copy link
Member

I think .from_uris is more what I was thinking about, then the user could pass vsi strings, https links to COGs, local paths, etc.

@adriantre adriantre force-pushed the feature/support_gdal_virtual_file_systems branch from 3237d5f to 6624dee Compare June 22, 2023 14:15
@adriantre
Copy link
Contributor Author

adriantre commented Jun 22, 2023

I have now refactored this and rebased on #1442, so all changes from that PR is reflected here until that is merged.

@adriantre adriantre force-pushed the feature/support_gdal_virtual_file_systems branch from 6624dee to 7ca3cb7 Compare June 23, 2023 12:21
@adriantre
Copy link
Contributor Author

Is now outdated wrt. to #1442 but still works as proof of concept. Any feedback?

from rasterio._path import SCHEMES

prefix = path.split("://")[0]
schemes = prefix.split("+")
Copy link
Contributor Author

@adriantre adriantre Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, because vsi can be chained, e.g. zip+az://... for zip-files in azure.

@adamjstewart
Copy link
Collaborator

Any feedback?

I'm hesitant to add too much complexity if #1442 supports something similar. I would say let's keep pushing on #1442 and once that's done we can decide whether this is functionality we need or whether the user should be responsible for calling this function themselves and passing the input as a list of paths to the dataset.

@adriantre
Copy link
Contributor Author

adriantre commented Jun 29, 2023

All right, I have an idea. In #1442 I have an else statement in list_files that hits if the path is not a local path. There we can call a method that raises until the user overrides it with their own logic.

def handle_nonlocal_path(self, path: str) -> set[str]:

This at least remove the blockers / need of overriding the whole __init__.

@adamjstewart
Copy link
Collaborator

Is that logic needed for remote files or only for remote directories? I was under the impression that if a user implemented a method to find remote files, then GDAL could open those without issue.

@adriantre
Copy link
Contributor Author

Indeed, gdal can open them if they reach rasterio.open. But with my implementation of allowing list of files, the code checks os.exists, which it does not. So we need to handle them separately, or not check if it exists. Thoughts?

@adamjstewart
Copy link
Collaborator

We don't need to check os.exists. If it exists, we can check if it's a directory. If it's a directory, we recursively search. If it's a file or doesn't exist, we assume it's a path and proceed as normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading from GDAL virtual file systems (e.g. cloud storage) Reading S3File using RasterDataset
4 participants