Add an API method to give us a streaming file object #29

dmsolow · 2019-01-29T19:46:38Z

It doesn't look like there's a way to get a streaming download from google storage in the Python API. We have download_to_file , download_to_string, and download_to_filename, but I don't see anything that returns a file-like object that can be streamed. This is a disadvantage for many file types which can usefully be processed as they download.

Can a method like this be added?

The text was updated successfully, but these errors were encountered:

tseaver · 2019-01-29T22:54:07Z

@dmsolow Hmm, Blob.download_to_file takes a file object -- does that not suit your usecase?

dmsolow · 2019-01-30T03:21:33Z

I don't think so. The situation is that it's often useful to start processing a file as it downloads instead of waiting until it's finished. For example if there's 1GB CSV file in google storage, it should be possible to parse it line by line as it's downloaded.

It's fairly common for network libraries to offer this kind of functionality. For example in the standard urllib.request HTTP library:

import urllib.request
import csv
from io import TextIOWrapper

with urllib.request.urlopen('http://test.com/big.csv') as f:
    wrapped = TextIOWrapper(f) # decode from bytes to str
    reader = csv.reader(wrapped)
    for row in reader:
       print(row[0])

This parses the CSV as it's downloaded. I'd like to get the same functionality from google storage. If there's already a good way to do this with the current library, please let me know.

tseaver · 2019-01-30T17:59:02Z

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good. You can make this work using Python's os.pipe: see this gist, which produces the following output:

$ bin/python pipe_test.py 
reader: start
reader: read one chunk
reader: read one chunk
...
reader: read one chunk
reader: read 800000 bytes

dmsolow · 2019-01-30T18:55:36Z

Using a separate thread kind of feels like a hack to me, but it is surely one way to do it. I think the ability to do this without using extra threads would be widely useful, but idk how hard it would be to implement.

tseaver · 2019-01-31T16:41:59Z

OK, looking at the underlying implementation in google-resumable-media, all that we actually expect of the file object is that it has a write method, which is then passed each chunk as it is downloaded.

You could therefore pass in an instance of your own class which wrapped the underlying stream, e.g.:

from google.cloud.storage import Client

class ChunkParser(object):

    def __init__(self, fileobj):
        self._fileobj = fileobj

    def write(self, chunk):
        self._fileobj.write(chunk)
        self._do_something_with(chunk)

client = Client()
bucket = client.get_bucket('my_bucket_name')
blob = bucket.blob('my_blob.xml')'

with open('my_blob.xml', 'wb') as blob_file:
    parser = ChunkParser(blob_file)
    blob.download_to_file(parser)

yan-hic · 2019-02-18T19:40:14Z

This was requested many times but was at some point turned down (googleapis/google-cloud-python#3903)

As an alternative, one can use the gcsfs library which supports file-obj for read and write.

dmsolow · 2019-02-19T01:26:49Z

It's a shame that this was turned down. It's a feature that every python dev is going to expect from a library like this, as evidenced by the fact that it keeps coming up.

akuzminsky · 2019-04-01T00:53:10Z

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good.

Unfortunately this doesn't work with uploading streams.
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L1160 returns size of a pipe equal to zero. As result the pipe never empties and thus a child gets eventually blocked writing to it.

Are there known workarounds?

tseaver · 2019-04-16T18:07:08Z

@akuzminsky The line you've linked to is in the implementation of Blob.upload_from_filename. This issue is about being able to process downloaded chunks before the download completes.

@dmsolow Does my file-emulating wrapper class solution work for you?

dmsolow · 2019-04-16T18:14:55Z

@tseaver No. I would like something that is a "file-like object." This means something that supports standard Python io methods like readline, next, read etc. Maybe that object buffers chunks under the hood, but it should essentially be indistinguishable from the file object returned by the builtin open function.

thnee · 2019-06-11T10:49:32Z

I was really surprised to see that not only is this feature not available, but it also has been brought up and closed in the past. It seems like an obvious and important feature to have.

Fortunately, gcsfs works really well as a substitute, but it's a little bit awkward to have to have a second library for such a core functionality.

But gcsfs does not support setting Content-Type, so I end up having to first upload the file using gcsfs, and then call gsutil setmeta via subprocess to set it after the file has been uploaded. This takes extra time and it is brittle, it is more of a workaround than a solution.

yan-hic · 2019-06-11T11:43:46Z

@thnee you should check back, gcsfs has the setxattrs() method to set metadata, including content-type.

ElliotSilver · 2019-06-17T20:47:04Z

The lack of a simple streaming interface is a challenge to implementing a cloud function that reads/writes large files. I need the ability to read an object in from cloud storage, manipulate it, and write it out to another object. Since the only filestore available to GCF is /tmp which lives in the function memory space, you are limited to files less than 2 GB.

IlyaFaer · 2019-06-25T13:36:51Z

Well, if this new method is so much wanted, I'd propose solution: class, that inherits FileIO. It inits ChunkedDownload in self property and then on every read() call it consumes next chunk and returns it (some variants provided, as seek() will work in that class, so as flush()). New blob-method will be initializing this object and returning it to user

Looks like it'll work, 'cause (as I know) most file methods works through read(), so overriding it must do the trick. I've already raw-coded this and tried some tests - it worked. And it's compact

olejorgenb · 2019-08-21T16:03:23Z

Tensorflow have an implementation that gives a file like object for gc blobs: https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile

Not sure if it actually streams or not though.

petedannemann · 2020-01-27T17:54:35Z

smart_open now has support for streaming files to/from GCS.

from smart_open import open

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

rocketbitz · 2020-01-29T20:00:31Z

@petedannemann great work - any ETA for an official release?

petedannemann · 2020-01-29T22:43:35Z

@rocketbitz no idea but for now you could install from Github

pip install git+https://github.com/RaRe-Technologies/smart_open

xbrianh · 2020-02-25T16:13:43Z

I've implemented gs-chunked-io to satisfy my own needs for GS read/write streams. It's designed to compliment the Google python API.

import gs_chunked_io as gscio
from google.cloud.storage import Client

bucket = Client().bucket("my-bucket")
blob = bucket.get_blob("my-key)

# read
with gscio.Reader(blob) as fh:
    fh.read(size)

# read in background
with gscio.AsyncReader(blob) as fh:
    fh.read(size)

# write
with gscio.Writer("my_new_key", bucket) as fh:
    fh.write(data)

- annoyingly GCS doesn't support file-like objects: googleapis/python-storage#29 - use a small library for doing file-like object support for GCS: https://github.com/xbrianh/gs-chunked-io

petedannemann · 2020-03-16T12:58:40Z

@petedannemann great work - any ETA for an official release?

Release 1.10 last night included GCS functionality

abhipn · 2020-11-13T20:40:11Z

Any update on this?

Fixes #29

Fixes googleapis#29

IlyaFaer self-assigned this Aug 2, 2019

crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020

product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Jan 31, 2020

yoshi-automation added 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 3, 2020

frankyn added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 4, 2020

kurtisvg mentioned this issue Apr 2, 2020

Download an entire bucket to a virtual machine GoogleCloudPlatform/python-docs-samples#3225

Closed

tseaver changed the title ~~Storage: add an API method to give us a streaming file object~~ Add an API method to give us a streaming file object Aug 17, 2020

frankyn assigned andrewsg and unassigned IlyaFaer Nov 30, 2020

frankyn mentioned this issue Nov 30, 2020

When to support Streaming Upload/Download for GCS? #330

Closed

frankyn mentioned this issue Jan 5, 2021

support for an tf.io.gfile.GFile like API #354

Closed

andrewsg mentioned this issue Feb 22, 2021

feat: add blob.open() for file-like I/O #385

Merged

gcf-merge-on-green bot closed this as completed in #385 Mar 24, 2021

gcf-merge-on-green bot pushed a commit that referenced this issue Mar 24, 2021

feat: add blob.open() for file-like I/O (#385)

440a0a4

Fixes #29

cojenco pushed a commit to cojenco/python-storage that referenced this issue Oct 13, 2021

feat: add blob.open() for file-like I/O (googleapis#385)

12af8c3

Fixes googleapis#29

cojenco pushed a commit to cojenco/python-storage that referenced this issue Oct 13, 2021

feat: add blob.open() for file-like I/O (googleapis#385)

ae556cc

Fixes googleapis#29

release-please bot mentioned this issue Jan 12, 2022

chore(main): release 3.0.0 #688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an API method to give us a streaming file object #29

Add an API method to give us a streaming file object #29

dmsolow commented Jan 29, 2019

tseaver commented Jan 29, 2019

dmsolow commented Jan 30, 2019

tseaver commented Jan 30, 2019

dmsolow commented Jan 30, 2019

tseaver commented Jan 31, 2019

yan-hic commented Feb 18, 2019

dmsolow commented Feb 19, 2019

akuzminsky commented Apr 1, 2019

tseaver commented Apr 16, 2019

dmsolow commented Apr 16, 2019

thnee commented Jun 11, 2019

yan-hic commented Jun 11, 2019 •

edited

ElliotSilver commented Jun 17, 2019

IlyaFaer commented Jun 25, 2019 •

edited

olejorgenb commented Aug 21, 2019

petedannemann commented Jan 27, 2020 •

edited

rocketbitz commented Jan 29, 2020

petedannemann commented Jan 29, 2020

xbrianh commented Feb 25, 2020

petedannemann commented Mar 16, 2020

abhipn commented Nov 13, 2020

Add an API method to give us a streaming file object #29

Add an API method to give us a streaming file object #29

Comments

dmsolow commented Jan 29, 2019

tseaver commented Jan 29, 2019

dmsolow commented Jan 30, 2019

tseaver commented Jan 30, 2019

dmsolow commented Jan 30, 2019

tseaver commented Jan 31, 2019

yan-hic commented Feb 18, 2019

dmsolow commented Feb 19, 2019

akuzminsky commented Apr 1, 2019

tseaver commented Apr 16, 2019

dmsolow commented Apr 16, 2019

thnee commented Jun 11, 2019

yan-hic commented Jun 11, 2019 • edited

ElliotSilver commented Jun 17, 2019

IlyaFaer commented Jun 25, 2019 • edited

olejorgenb commented Aug 21, 2019

petedannemann commented Jan 27, 2020 • edited

rocketbitz commented Jan 29, 2020

petedannemann commented Jan 29, 2020

xbrianh commented Feb 25, 2020

petedannemann commented Mar 16, 2020

abhipn commented Nov 13, 2020

yan-hic commented Jun 11, 2019 •

edited

IlyaFaer commented Jun 25, 2019 •

edited

petedannemann commented Jan 27, 2020 •

edited