Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor performance in Firestore reads due to proto-plus sequence unmarshaling #224

Closed
GabrieleMazzola opened this issue Apr 10, 2021 · 4 comments · Fixed by #226
Closed
Assignees
Labels
P2 A nice-to-fix bug priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@GabrieleMazzola
Copy link
Contributor

Environment details

  • OS type and version: Windows 10 Pro
  • Python version: 3.9
  • pip version: 20.2.2
  • google-cloud-firestore: 2.1.0

Problem

We have been using Firestore with the python admin SDK for some time now, and in some cases the latency in fetching data was nowhere near the expected latency. For example, reading 500 (very small) documents from a collection could take up to 30
seconds depending on which fields the documents had. After some experiments, we found out this poor performance directly related with the presence of empty arrays in the documents.

Seemingly related issues

Steps to reproduce

  • Create two collections of 500 documents each: one with documents containing an empty array, and the other collection with documents containing a non-empty array.
  • Retrieve all documents in each collection and compare execution time.
from google.cloud import firestore
import time
import os

N_DOCS = 500

if __name__ == '__main__':

    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "./gcp_sa_creds-dev.json"
    firestore_client = firestore.Client()

    for coll_name in ["performance_test_empty_array", "performance_test_not_empty_array"]:
        coll_ref = firestore_client.collection(coll_name)

        print(f"\n{coll_name}")
        
        # Just uncomment this if you need to clean up the previous test
        # batch = firestore_client.batch()
        # for doc in coll_ref.stream():
        #     batch.delete(doc.reference)
        # batch.commit()

        # Create data
        print("Creating data")
        start = time.time()
        batch = firestore_client.batch()
        for _ in range(N_DOCS):
            batch.set(
                coll_ref.document(),
                {
                    "field": [] if coll_name == "performance_test_empty_array" else ["aa"],
                }
            )
        batch.commit()
        end = time.time()
        elapsed = round(end-start, 2)
        print(f"{elapsed} seconds to create {N_DOCS} documents")

        # Read data
        print("Fetching data")
        start = time.time()
        docs = coll_ref.get()
        end = time.time()
        elapsed = round(end-start, 2)
        print(f"{elapsed} seconds to fetch {N_DOCS} documents")

Leading to the following result:

performance_test_empty_array
Creating data
0.67 seconds to create 500 documents
Fetching data
10.9 seconds to fetch 500 documents

performance_test_not_empty_array
Creating data
0.65 seconds to create 500 documents
Fetching data
0.28 seconds to fetch 500 documents

Issue analysis

To demonstrate the issue with empty arrays, we created a collection containing 500 documents each having a single field as follows:

firestore_empty_arrays

We then fetched the data with the following code:

from google.cloud import firestore
import time
import os

N_DOCS = 500

if __name__ == '__main__':

    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "./gcp_sa_creds-dev.json"
    firestore_client = firestore.Client()

    coll_ref = firestore_client.collection("performance_test_empty_array")

    # Read data
    print("Fetching data")
    start = time.time()
    docs = coll_ref.get()
    end = time.time()
    elapsed = round(end-start, 2)
    print(f"{elapsed} seconds to fetch {N_DOCS} documents")

This required 9.76 seconds. By profiling the python code we understood that this issue was related to an extremely high
number of calls (15,295,500 times) to the deepcopy function of the Python standard library, as it can be seen
in the following profiling output:

profiler_500docs_empty_array

call_graph_empty_arrays

The call graph showed us that a lot of calls to the deepcopy function were caused by the _pb_type function defined in the RepeatedComposite class (contained in "proto.marshal.collections.repeated") used to infer the protocol buffer type for a sequence. The reason for this is the following snippet taken from the "_pb_type" function:

proto.marshal.collections.repeated:RepeatedComposite._pb_type

[...]
# We have no members in the list.
# In order to get the type, we create a throw-away copy and add a
# blank member to it.
canary = copy.deepcopy(self.pb).add()
return type(canary)

which performs a deepcopy of the self.pb element just to add an element in order to infer the type of the sequence.

Proposed solution

By replacing the above snippet with the following snippet we can bypass the deepcopy:

[...]
# We have no members in the list.
return self.pb._message_descriptor._concrete_class

That is, we get the type of the protocol buffer from the protected attributes. With this change, fetching the 500 documents took 0.24 seconds compared to 9.76 seconds with the original snippet. In fact, if we profile the code with the newly introduced change, we see that the number of calls to the deepcopy function went from 15,295,500 to 1,500.

Tests

All tests pass after the proposed solution was implemented.
tests_passing

Next steps

As soon as the maintainers have validated (or improved upon) our proposed solution, we can readily submit a PR to address this issue.

@zoncrd @lucastanzani

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Apr 12, 2021
@software-dov
Copy link
Contributor

Very well researched. Yes, this looks both safe and correct. The entire cached property logic can be removed and be replaced with the _concrete_class lookup, if it passes all the tests.

@software-dov software-dov added P2 A nice-to-fix bug priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Apr 12, 2021
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jul 13, 2021
@tseaver
Copy link
Contributor

tseaver commented Sep 24, 2021

I addressed the two-orders-of-magnitude slowdown in Firestore yesterday: googleapis/python-firestore#458

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 3, 2021
@hkdevandla
Copy link
Member

@software-dov , is this anything to be done here in proto-plus besides the fix that was submitted to address the perf? If nothing else, can we close this bug?

@hkdevandla
Copy link
Member

@vam-google (for vis)

@vchudnov-g vchudnov-g assigned atulep and unassigned software-dov Mar 15, 2022
@parthea parthea assigned parthea and unassigned atulep May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 A nice-to-fix bug priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants