Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too slow query fetch #150

Closed
umax opened this issue Jan 29, 2021 · 13 comments · Fixed by #155 or #190
Closed

Too slow query fetch #150

umax opened this issue Jan 29, 2021 · 13 comments · Fixed by #155 or #190
Assignees
Labels
api: datastore Issues related to the googleapis/python-datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@umax
Copy link

umax commented Jan 29, 2021

Hi, team!

Imagine we have an entity Order with customer_id (str, uuid4), created_at (datetime) and other fields. I want to get all orders for specified customer and time range.

The query returns about 1500 records.
The problem: it takes about 7-8 seconds!

Environment details

  • OS type and version: Docker image python:3.8-slim
  • Python version: Python 3.8.6
  • pip version: 20.2.2
  • google-cloud-datastore version: 2.0.1

Datastore indexes:

  • kind: Order
    properties:

    • name: customer_id
    • name: created_at
      direction: asc
  • kind: Order
    properties:

    • name: customer_id
    • name: created_at
      direction: desc

Code example

query = db_client.query(
    kind='Order',
    filters=(
        ('customer_id', '=', customer_id),
        ('created_at', '>=', datetime_from),
        ('created_at', '<=', datetime_to),
    ),
)
result = list(query.fetch())

What I do wrong or can you suggest how to speed up this query?
Thanks!

@product-auto-label product-auto-label bot added the api: datastore Issues related to the googleapis/python-datastore API. label Jan 29, 2021
@dmahugh dmahugh added priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue. labels Jan 29, 2021
@tseaver tseaver self-assigned this Feb 2, 2021
@tseaver
Copy link
Contributor

tseaver commented Feb 2, 2021

@umax In order to optimize query performance, you need to create composite indexes for the various queries your application makes.

Per the Datastore docs on index configuration:

Firestore in Datastore mode uses indexes for every query your application makes. These indexes are updated whenever an entity changes, so the results can be returned quickly when the application makes a query. Datastore mode provides built-in indexes automatically, but needs to know in advance which composite indexes the application will require. You specify which composite indexes your application needs in a configuration file.

@umax
Copy link
Author

umax commented Feb 3, 2021

@tseaver does provided composite indexes is not enough for my query?

@tseaver
Copy link
Contributor

tseaver commented Feb 3, 2021

@umax Firestore doesn't auto-provide composite indexes: it only has indexes on individual fields. You have to define composite indexes manually.

@umax
Copy link
Author

umax commented Feb 3, 2021

@tseaver in the description of this issue I provided list of composite indexes from index.yaml file. Without these indexes the query does not work at all.

@BinarSkugga
Copy link

Possibly related to #145

@umax
Copy link
Author

umax commented Feb 15, 2021

What I did: I created composite index and use projection field to fetch only custom fields of entity. But this not solve problem in general. I think that 7-8 seconds to fetch 1500 is not acceptable for official library for Google services.

@tseaver
Copy link
Contributor

tseaver commented Mar 23, 2021

@umax I'm sorry I missed seeing the index definition in your initial description.

I wonder if the second index (the "descending" one) is causing the back-end to choke?

@tsteinruecken
Copy link

Hi,

I guess we’ve ran into the same issue (tested on 2.1.0 and several older versions). We have a query that returns 30 entities (no filter, no order). The combined size of them is less than 200kb. Fetching them from an Appengine F2 instance takes close to 0.8 Seconds. To debug this I’ve split the fetch by running the query keys-only (which is fast!) and then calling get_multi separately. As far as i can tell, the time is wasted in client.py:547. Copying these 30 protobufs into entities takes 0.6 seconds of said 0.8 seconds total. I’ve also iterated the list of these protobufs before and had each PB cast to string to ensure there’s no RPC in flight that's being waited on inside entity_from_protobuf. For me it seems to be entirely CPU-Hog there. Our entities contain several lists and nested entities, maybe this is contributing to this slowdown as i can’t see anything obviously wrong with that entity_from_protobuf function?

@craiglabenz
Copy link
Contributor

Relatedly, I've been exploring some query slowness here. I'm going to add @tsteinruecken's findings to my search for the culprit.

@tseaver
Copy link
Contributor

tseaver commented Apr 14, 2021

I have written a script to test google.cloud.datastore.helpers.entity_from_protobuf, which is the workhorse used in marshaling query results into Entity objects.

$ python3.8 -m venv /tmp/datastore-1.15.3

$ /tmp/datastore-1.15.3/bin/pip install --upgrade setuptools pip wheel
...
Successfully installed pip-21.0.1 setuptools-56.0.0 wheel-0.36.2

$ /tmp/datastore-1.15.3/bin/pip install "google.cloud-datastore==1.15.3"
...
Successfully installed cachetools-4.2.1 certifi-2020.12.5 chardet-4.0.0 google-api-core-1.26.3 google-auth-1.28.1 google-cloud-core-1.6.0 google.cloud-datastore googleapis-common-protos-1.53.0 grpcio-1.37.0 idna-2.10 packaging-20.9 protobuf-3.15.8 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-2.4.7 pytz-2021.1 requests-2.25.1 rsa-4.7.2 six-1.15.0 urllib3-1.26.4

$ python3.8 -m venv /tmp/datastore-2.1.0

$ /tmp/datastore-2.1.0/bin/pip install --upgrade setuptools pip wheel
...
Successfully installed pip-21.0.1 setuptools-56.0.0 wheel-0.36.2

$ /tmp/datastore-2.1.0/bin/pip install "google-cloud-datastore==2.1.0"
...
Successfully installed cachetools-4.2.1 certifi-2020.12.5 chardet-4.0.0 google-api-core-1.26.3 google-auth-1.28.1 google-cloud-core-1.6.0 google-cloud-datastore-2.1.0 googleapis-common-protos-1.53.0 grpcio-1.37.0 idna-2.10 libcst-0.3.18 mypy-extensions-0.4.3 packaging-20.9 proto-plus-1.18.1 protobuf-3.15.8 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-2.4.7 pytz-2021.1 pyyaml-5.4.1 requests-2.25.1 rsa-4.7.2 six-1.15.0 typing-extensions-3.7.4.3 typing-inspect-0.6.0 urllib3-1.26.4

$ /tmp/datastore-1.15.3/bin/python compare_perf_issue_1_15_3.py 
Time: 1.3033227920532227

$ /tmp/datastore-2.1.0/bin/python compare_perf_issue_1_15_3.py 
Time: 13.733032941818237

Those times are pretty repeatable.

tseaver pushed a commit that referenced this issue Apr 20, 2021
More efficiently uses proto-plus wrappers, as well as inner protobuf attribute access, to greatly reduce the performance costs seen in version 2.0.0 (which stemmed from the introduction of proto-plus).

The size of the performance improvement scales with the number of attributes on each Entity, but in general, speeds once again closely approximate those from 1.15.

Fixes #145
Fixes #150
@tseaver
Copy link
Contributor

tseaver commented Apr 20, 2021

Note that the merge of @craiglabenz's PR #155 improves the time for Datastore 2.x on my benchmark, from being a full 10x slower than Datastore 1.15.3 to about 3x slower.

@tsteinruecken
Copy link

@tseaver could you please reopen this issue? I don't consider it resolved at all.
I've attached a sample appengine application. It contains a large entity to test.
Deploy it, call /initDB first (it will write the contained entity 30 times into "test_kind").
Then call /testFetch or /testBaseline. The baseline is what i consider should be the time it should take to fetch these entities (it does a roundtrip to the datastore keys-only, then deserializes the locally stored entity for each key received). It's about ~1.3 seconds on the configured F2 instance. Python-datastore==1.15.3 is already 5(!) times slower. 2.1.0 is fiveteen times slower than the baseline, and while there's a small improvement on 2.1.2 it's still fourteen times slower than given baseline (that's ~15 Seconds to fetch these 30 entities).
appengine-datastore-bench.zip

@tseaver
Copy link
Contributor

tseaver commented May 4, 2021

/cc @craiglabenz, @crwilcox

@tseaver tseaver reopened this May 4, 2021
tseaver added a commit that referenced this issue Jul 9, 2021
tseaver added a commit that referenced this issue Jul 9, 2021
…ies (#190)

Always unwrap to get the raw protobuf message, rather than the proto-plus wrapper.

We are back to within a few percent of the older/faster version on my comparison test.

Closes #150.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the googleapis/python-datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
6 participants