Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow files in directories to be downloaded onto local machine #2199

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

hochoy
Copy link

@hochoy hochoy commented May 18, 2023

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary) (n/a)

Fixes #2200 🦕

@product-auto-label product-auto-label bot added size: s Pull request size is small. api: storage Issues related to the googleapis/nodejs-storage API. labels May 18, 2023
@ddelgrosso1 ddelgrosso1 added the owlbot:run Add this label to trigger the Owlbot post processor. label May 18, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 18, 2023
@ddelgrosso1 ddelgrosso1 added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels May 18, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 18, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 18, 2023
@hochoy hochoy marked this pull request as ready for review May 18, 2023 20:20
@hochoy hochoy requested review from a team as code owners May 18, 2023 20:20
@hochoy
Copy link
Author

hochoy commented May 18, 2023

@ddelgrosso1 , I tried running samples-test and system-test but based on the logs, it appears those require an active project in order to do an e2e/live test against real GCP infrastructure.

Is that accurate?
How expensive is that if I run it on my own project? Should I be triggering it here in the repo instead (and leverage your repo's cloudbuild/github actions infra)?

@hochoy
Copy link
Author

hochoy commented May 18, 2023

Also, I would like to add a test to check that the files are truly downloaded into the local machine or memory.
Would you recommend I give that a shot? Or is that overkill?

src/transfer-manager.ts Outdated Show resolved Hide resolved
@ddelgrosso1
Copy link
Contributor

@ddelgrosso1 , I tried running samples-test and system-test but based on the logs, it appears those require an active project in order to do an e2e/live test against real GCP infrastructure.

Is that accurate? How expensive is that if I run it on my own project? Should I be triggering it here in the repo instead (and leverage your repo's cloudbuild/github actions infra)?

That is accurate. I wouldn't worry about running these on your own, they get run in the CI pipeline each time a commit is pushed. However, unit tests should work without issue locally.

Also, I would like to add a test to check that the files are truly downloaded into the local machine or memory. Would you recommend I give that a shot? Or is that overkill?

If you feel up to it a unit test can probably be created to test this. I can look to see if we have any similar tests elsewhere that might serve as a guide.

One thing I will do is to cleanup the the JS Docs to make it abundantly clear that not supplying a prefix will result in the files downloaded to memory.

@product-auto-label product-auto-label bot added size: xs Pull request size is extra small. and removed size: s Pull request size is small. labels May 28, 2023
src/file.ts Show resolved Hide resolved
@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: xs Pull request size is extra small. labels May 29, 2023
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels May 29, 2023
Comment on lines +2756 to +2758
done();
} catch (e) {
done(e);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any assert(false) will throw an Error, meaning done() never gets called and the test times out. Adding the try catch guarantees done() gets called. Similar to what is described in this Stack overflow discussion: https://stackoverflow.com/questions/66461468/mocha-test-false-assert-timeouts

I'm a bit surprised why this doesn't cause issues in the fs.readFile callback assertions in the other tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting find, let me dig into this a little bit more.

test/file.ts Outdated Show resolved Hide resolved
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 31, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 31, 2023
@hochoy hochoy requested a review from ddelgrosso1 June 3, 2023 16:44
@ddelgrosso1 ddelgrosso1 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 5, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 5, 2023
@ddelgrosso1 ddelgrosso1 added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Jun 5, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 5, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 5, 2023
Copy link
Contributor

@ddelgrosso1 ddelgrosso1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ddelgrosso1
Copy link
Contributor

@danielbankhead would you mind just giving this a second set of eyes?

Copy link
Member

@danielbankhead danielbankhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I think we can make it better for File#download customers by moving the logic to TransferManager#downloadManyFiles

Comment on lines +2103 to +2107
if (
destination &&
(destination.endsWith('/') || destination.endsWith('\\'))
) {
callback?.(null, Buffer.alloc(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be problematic as / could be a valid ending character for an object name: https://cloud.google.com/storage/docs/objects

Customers, namely outside of the Transfer Manager flow, could face issues.

Copy link
Author

@hochoy hochoy Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @danielbankhead , thanks for the review. The issue with object names ending with / is that while they may be valid on GCS, it can't actually be written on a Linux-based filesystem. I don't have a windows to try it on. Perhaps you know of an edge case where names ending with / can actually be written.

Screen Shot 2023-06-09 at 10 26 27 AM

Copy link
Author

@hochoy hochoy Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario you described is valid, if destination is unset/falsy, and a Buffer gets created/returned for the GCS object ending with /. That behaviour remains unchanged in this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I think allowing the file system to error would be clearer and easier to understand than returning a Buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishwarajanand I think we want to address this per @danielbankhead feedback.

Comment on lines +2101 to +2109

// Skip directory objects as they cannot be written to local filesystem
if (
destination &&
(destination.endsWith('/') || destination.endsWith('\\'))
) {
callback?.(null, Buffer.alloc(0));
} else if (destination) {
fs.mkdirSync(path.dirname(destination), {recursive: true});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should live out of File#download for a few reasons:

  • The mkdirSync would make this method slower for non-TM customers:
    • where the directory already exist (unnecessary I/O)
    • with slower filesystems (blocking I/O)
  • The logic for determining if a directory should be created can be handled for multiple files at once (rather than each file in the same directory doing the same work).

Copy link
Author

@hochoy hochoy Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coincidentally, that was the original design I had (creating the directory structure on TransferManager). However, I made the change on File#download because the root issue is that File#download itself cannot write nested paths. And solving it at the lower level would solve the problem for both File and TransferManager.

Screen Shot 2023-06-09 at 10 39 25 AM

That being said, it's not hard to optimize at TransferManager, I can create another branch. What do we do then about the limitation of File#download above? Do we:

  1. warn users not to provide nested destination paths?
  2. or warn users that they need to ensure that the folder hierarchy exists before providing nested destination paths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simple solution would be to add a flag to gate this functionality and Transfer Manager use this flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishwarajanand same comment here, I think we want to address @danielbankhead idea about flagging / gating this.

@hochoy
Copy link
Author

hochoy commented Jul 8, 2023

Hey @danielbankhead @ddelgrosso1 , just wanted to provide a heads up that I'll be revisiting this in about 1-2 months due to project priorities. Thank you for all the feedback and recommendations, they are all valid points for a user like myself.

I do need something like TransferManager for a project at work, so I'll very likely be back (no promises of course). But in its current design, it would be more practical for me to build my own version of the TransferManager to reflect the interface I'm looking for. I will try to wrap the existing File.download method during my experimentation, similar to the current design, but might go a different way if needed.

Couple things I would design for, after both your feedback:

  • downloads should always be concurrent
  • empty objects (like "folders") on gcp are not writeable need special handling
  • directory creation should be optimized at the TransferManager level
  • unix vs windows-based file handling
  • read and write failures should not fail silently. In the event of a partial directory/folder download, we should make it easy for the user to identify which files succeeded vs failed.
  • if similar files exist in the write destination, we could provide a default of either over-writing or error-throwing or user-prompt

I envision users wanting to use TransferManager to download entire buckets or "sub-directories". So, those are some of the considerations I could think of.

Something else that the Storage service should consider, is whether this should be done at the API-level instead. Allowing users to use more advanced queries. I know adding any logic around bucket objects is likely terrible for performance, but it would make this cross-language compatible. Trying to implement TransferManager across all supported languages would mean a lot of SDK work.

@ddelgrosso1 ddelgrosso1 added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 19, 2023
@ddelgrosso1 ddelgrosso1 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 7, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2024
@vishwarajanand vishwarajanand removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 9, 2024
@ddelgrosso1 ddelgrosso1 added the owlbot:run Add this label to trigger the Owlbot post processor. label May 9, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 9, 2024
@ddelgrosso1 ddelgrosso1 added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels May 9, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 9, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/nodejs-storage API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

transferManager.downloadManyFiles not writing files locally
5 participants