Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HTTP Range Requests in MNRead.get #1709

Open
robyngit opened this issue Oct 2, 2023 · 3 comments
Open

Support HTTP Range Requests in MNRead.get #1709

robyngit opened this issue Oct 2, 2023 · 3 comments
Labels
enhancement New feature or request
Milestone

Comments

@robyngit
Copy link
Member

robyngit commented Oct 2, 2023

Detect and handle HTTP range requests to enable clients to retrieve a portion of a file without the need to download the entire content. This feature would allow MetacatUI and other clients to preview data files before downloading them. It would also allow clients to resume downloads in the event of a network interruption.

  • @mbjones mentioned the possibility of implementing this without making changes to the DataONE API: A range request could be made via HTTP headers, leaving the request body unchanged and having Metacat only handle the range request headers.
  • It would be the client's responsibility to generate the range request, for example:
curl -H "Range: bytes=0-1000" https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A24b85258-3e86-40cb-accc-28153513dea8
  • The feature could be a non-mandatory enhancement, such that the existing behavior remains consistent for repositories not making use of range requests.
  • Apache Tomcat and the Servlet API might provide built-in support for HTTP range requests.
  • A discussion is needed on how this feature interacts with event metrics:
    • Is a range request categorized as a download or a view/read?
    • Does this require a new event type, e.g. "partial read", "preview"?

Note: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A24b85258-3e86-40cb-accc-28153513dea8 gives a 100,000 line CSV file that could be useful for testing

@robyngit robyngit added the enhancement New feature or request label Oct 2, 2023
@taojing2002
Copy link
Contributor

@robyngit Two questions:

  1. Does this feature only support the text data files (e.g, cvs)? How about Excel files?
  2. What are the units of the range? Lines or bytes or both?

@mbjones
Copy link
Member

mbjones commented Oct 2, 2023

@taojing2002 good questions. Range requests are byte-based requests, basically specifiying a byte range to be requested. They are application-agnostic, and assume that the client knows what to do with the bytes. Tools like curl use range requests to allow resuming downloads if a network connection is interrupted. Data systems use range requests to retrieve chunks of data from inside a data file, but that is of course only useful if the data files are organized in such a way that contiguous byte ranges produce meaningful chunks. So, for text files, getting the first few KB is a good way to get a preview, but the client would need to be aware that the byte boundary is unlikely to correspond with the end-of-line delimiter used in that format. In contrast, netCDF, HDF5, and Zarr are binary formats that allow byte range requests that can get specific segments of data that correspond to specific scientifically meaningful chunks (e.g., a single image out of a time series, or a specific spatial window out of a larger extent). Hope that's all helpful.

@taojing2002
Copy link
Contributor

@mbjones Thanks! So I think we will use bytes for the range for any formats. The clients have the responsibility to parse the bytes.

@mbjones mbjones added this to the 3.1.0 milestone Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants