Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing scroll API docs? #225

Open
kylebarron opened this issue Apr 2, 2020 · 10 comments
Open

Missing scroll API docs? #225

kylebarron opened this issue Apr 2, 2020 · 10 comments

Comments

@kylebarron
Copy link

kylebarron commented Apr 2, 2020

I'm trying to create a seamless cloudless landsat basemap using MosaicJSON. So I'm trying to loop over all cloudless landsat imagery to record it in the MosaicJSON file. When I attempt to do that I get an error saying to use the "scroll api" instead.

{'code': 500,
 'description': '[illegal_argument_exception] Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}

I've searched the code, searched the API docs, searched issues, and I can't find any reference to a scroll API. Does it exist?

Separately, I tried to use sat-search but it doesn't give the same number of results as the HTTP API for the same query, namely here it gives 3859 results from search.found() instead of the 29773 results that the meta key of the HTTP API says should exist.

Repro code:

from satsearch import Search
import json
import requests

query_str = '{"bbox": [-127.64, 23.92, -64.82, 52.72], "time": "2013-01-01T00:00:00Z/2020-04-01T23:59:59Z", "query": {"eo:sun_elevation": {"gt": 0}, "landsat:tier": {"eq": "T1"}, "collection": {"eq": "landsat-8-l1"}, "eo:cloud_cover": {"gte": 0, "lt": 10}, "eo:platform": {"eq": "landsat-8"}}, "sort": [{"field": "eo:cloud_cover", "direction": "asc"}]}'
query = json.loads(query_str)

url = 'https://sat-api.developmentseed.org/stac/search'
headers = {
    "Content-Type": "application/json",
    "Accept-Encoding": "gzip",
    "Accept": "application/geo+json", }

data = requests.post(url, headers=headers, json=query).json()
data['meta']['found']
# 29773

search = Search(**query)
search.found()
# 3859

Am I missing something, or why do these identical queries return different numbers of results?

@matthewhanson
Copy link
Member

matthewhanson commented Apr 2, 2020

Hello @kylebarron ,

The scroll API must be a reference to Elasticsearch, where the Scroll API is an alternate way to "page" through large responses. I've not seen this error, but it indicates that the paging mechanism in sat-api is not working as expected.

Your AOI is pretty large, I would recommend that you divide your query into smaller queries, such as one year at a time for that AOI...or divide your AOI into smaller AOIs.

Note also that the deployed DevSeed sat-api you are using is a little out of date. STAC is now on version 0.9 and there is a new forked and refactored version of sat-api called stac-api, along with a beta version of sat-search. However as of right now there isn't a deployed version of stac-api containing the same public datasets the DevSeed API does. Within the next 2 months there will be one for Sentinel-2 in the new version.

Are you interested in Sentinel-2 data, Landsat-8, or both?

@kylebarron
Copy link
Author

Thanks for your response. I'm guessing that there's a default Elasticsearch option that sets 10,000 as the max scroll and that wasn't modified...

If it's not any worse for performance on the backend to retrieve items 29,500-30,000 than it is to retrieve items 0-500, it would be nice to restrict usage by rate limiting rather than a max number of results, so that a user could (slowly) page through as many results as they desired.

I think the best workaround is to split it up by year as you mentioned.

I'm only interested in Landsat 8, since Sentinel 2 isn't stored on AWS in COG.

@drewbo
Copy link
Member

drewbo commented Apr 2, 2020

@kylebarron I'm a little rusty on how it works exactly, but I think you can combine the page and limit parameters to page through as many results as you need. For example:

https://sat-api.developmentseed.org/stac/search?page=1&limit=100
https://sat-api.developmentseed.org/stac/search?page=2&limit=100
https://sat-api.developmentseed.org/stac/search?page=3&limit=100

Let me know if that helps

@vincentsarago
Copy link
Member

@kylebarron
Copy link
Author

kylebarron commented Apr 2, 2020

@drewbo

I think you can combine the page and limit parameters to page through as many results as you need

That's what I'm attempting to do in the original post.

Aka if I first find total number of results:

import json
import requests

query_str = '{"bbox": [-127.64, 23.92, -64.82, 52.72], "time": "2013-01-01T00:00:00Z/2020-04-01T23:59:59Z", "query": {"eo:sun_elevation": {"gt": 0}, "landsat:tier": {"eq": "T1"}, "collection": {"eq": "landsat-8-l1"}, "eo:cloud_cover": {"gte": 0, "lt": 10}, "eo:platform": {"eq": "landsat-8"}}, "sort": [{"field": "eo:cloud_cover", "direction": "asc"}]}'
query = json.loads(query_str)

url = 'https://sat-api.developmentseed.org/stac/search'
headers = {
    "Content-Type": "application/json",
    "Accept-Encoding": "gzip",
    "Accept": "application/geo+json", }

data = requests.post(url, headers=headers, json={**query, **{'limit': 0}}).json()
data['meta']['found']
# 29773

But then if I try to retrieve a high enough page, it fails: (i.e. page 58 with a limit of 500 should be 29000-29500 (if page numbering starts at 1))

data = requests.post(url, headers=headers, json={**query, **{'limit': 500, 'page': 58}}).json()
data
# {'code': 500,
#  'description': '[illegal_argument_exception] Result window is too large, from + size must be less than or equal to: [10000] but was [29000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'}

@vincentsarago

Yes I'm using essentially the exact same code, ported to a CLI and Python package that I can run locally instead of on Lambda. Regardless, it never goes past 10,000 results returned from the API.

You can test with:

git clone https://github.com/kylebarron/landsat-cogeo-mosaic
cd landsat-cogeo-mosaic
pip install -e .
landsat-cogeo-mosaic create \
    --bounds '-127.64,23.92,-64.82,52.72' \
    --max-cloud 10 \
    --stac-collection-limit 500 \
    --season summer > mosaic.json

It logs

{"page": 1, "limit": 500, "found": 29773, "returned": 500}

But if you wait a few minutes you'll see it cuts off at page 20.

@matthewhanson
Copy link
Member

@kylebarron Looks like 10000 is a limit within Elasticsearch. And while there is a way around it, it's not recommended.

https://discuss.elastic.co/t/how-to-increase-the-default-size-limit-from-10000-to-1000000-in-elasticsearch/208807/4

It says to use the scroll API to do paging (would have to be implemented in sat-api), although last I read the scroll API was not recommended for production.

I think your best bet is to ensure that your queries don't have so many responses and just to divide up the queries.

On the API side though, it should at least throw a more meaningful error if the # of responses > 10K

Thanks for bringing this up.

@kylebarron
Copy link
Author

Good to know. I don't want to suggest a change that makes backend performance worse.

I think for my own use I'll first find the total number of results with limit=0, and present an error to the user if >10,000. That way there's no confusion in the future about missing entries.

@drewbo
Copy link
Member

drewbo commented Apr 2, 2020

My mistake @kylebarron, I should have read through the full error first 🤦‍♂. Do you think it would support your use case if we passed through the parameters necessary for the elasticsearch search scroll?

@matthewhanson do you know why this isn't recommended for production (maybe performance reasons)?

@matthewhanson
Copy link
Member

I'm not really sure why it's not recommended for general paging, from the documentation:
"Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration."

It's quite a bit more complicated since it's not stateless you'd either need to use session tokens or as you suggest @drewbo pass back the parameters which means adding new query parameters to the API for users to hand back info about the scroll API.

@kylebarron
Copy link
Author

It's fine, I don't intend to ask for a ton of work when the workaround isn't that bad. You can close this if you want, or leave it open if you want to update the API to throw a more meaningful error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants