Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve read performance by using stale reads #1994

Open
IchordeDionysos opened this issue Feb 2, 2024 · 8 comments
Open

Improve read performance by using stale reads #1994

IchordeDionysos opened this issue Feb 2, 2024 · 8 comments
Assignees
Labels
api: firestore Issues related to the googleapis/nodejs-firestore API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@IchordeDionysos
Copy link
Contributor

In the documentation, it is mentioned that stale reads may improve the performance of reading from Firestore as data can be just fetched from the nearest replica without having to reconfirm with the leader replica:
https://firebase.google.com/docs/firestore/understand-reads-writes-scale#stale_reads

I'm using the following code to perform a stale read:

const random = Math.random();
const useStaleReads = random < USE_STALE_READ_PERCENTAGE;

logger.profile(`stale-read-${random}`);

let snap: DocumentSnapshot<FirebaseFirestore.DocumentData>;
if (useStaleReads) {
  export const STALE_READ_STALENESS = 60 * 1000; // 1 minute
  const maxDataStaleness: Date = new Date(
    new Date().getTime() - STALE_READ_STALENESS
  );
  snap = await firestore.runTransaction(
    async t => {
      return t.get(ref);
    },
    {
      readOnly: true,
      readTime: Timestamp.fromDate(maxDataStaleness),
    }
  );
} else {
  snap = await ref.get();
}

logger.profile(`stale-read-${random}`, {
  level: 'info',
  message: 'Read from Firestore',
  meta: {
    useStaleReads,
  },
});

As the data is not changed very often it's fine to have one minute (or even longer) stale content.

But what we are seeing is that the strong reads are faster than the stale reads:
image
image

Query used for analysing the logs

WITH latencies AS (
  SELECT
    timestamp ,
    JSON_VALUE(json_payload.metadata.useStaleReads) as uses_stale_reads,
    JSON_VALUE(json_payload.metadata.profile.durationMs) as duration_in_ms,
  FROM  `simpleclub.global._Default._AllLogs`  AS logs
  WHERE NORMALIZE_AND_CASEFOLD(logs. resource.type , NFKC) = "cloud_run_revision"
    AND NORMALIZE_AND_CASEFOLD(SAFE.STRING(logs. resource.labels ["revision_name"]), NFKC) = "cloud-run-revision"
    AND NORMALIZE_AND_CASEFOLD(SAFE.STRING(logs. resource.labels ["service_name"]), NFKC) = "cloud-run-service"
    AND REGEXP_CONTAINS(SAFE.STRING(logs. json_payload ["metadata"]["profile"]["id"]), "stale")
    AND JSON_VALUE(json_payload.metadata.useStaleReads) = "true"
  ORDER BY timestamp DESC
)
SELECT
  STRUCT(
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(5000)] AS percentile_50,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(7500)] AS percentile_75,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9000)] AS percentile_90,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9500)] AS percentile_95,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9900)] AS percentile_99,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9950)] AS percentile_99_5,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9990)] AS percentile_99_9,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9995)] AS percentile_99_95,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9999)] AS percentile_99_99
  ) as duration_in_ms,
  uses_stale_reads,
  COUNT(*) as request_count
FROM latencies
GROUP BY uses_stale_reads

I wanted to share this experience with you and maybe I'm doing something wrong here...
Not sure if increasing to the 60s staleness (instead of the 15s) breaks it?

Interesting data:

  • We are using Firestore via GRPC (not REST)
  • @google-cloud/firestore: v6.8.0
  • Firestore database is hosted in eur3 (multi-region)
  • Deployed on Cloud Run
    • Always on CPU
    • CPU start-up boost
    • max 40 requests / instance
    • 1st gen execution environment
    • 1 CPU
    • 4GiB memory
@IchordeDionysos IchordeDionysos added priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue. labels Feb 2, 2024
@product-auto-label product-auto-label bot added the api: firestore Issues related to the googleapis/nodejs-firestore API. label Feb 2, 2024
@IchordeDionysos
Copy link
Contributor Author

IchordeDionysos commented Feb 2, 2024

A quick test with the 15s staleness shows very similar numbers ...

@tom-andersen tom-andersen self-assigned this Feb 2, 2024
@tom-andersen
Copy link
Contributor

tom-andersen commented Feb 2, 2024

There is an unfortunate implementation detail that transactions will send a begin transaction request, followed by your get document requests. Effectively, that means transactions are sending multiple requests instead of one with the regular get request.

We are looking to improve this.

The v1 FirestoreClient allows complete access to communication protocol, including ability to set readTime on get document requests. With this, you could achieve improved performance. However, this means taking responsibility for many of the things the regular API surface handles for you. Unless you really need this, I suggest you wait until we improve the regular API surface and/or optimize our handling of transactions with readTime.

Thank you for the question.

Interest in features like this from the developer community helps inform priorities for SDK development. I will be sure to pass this on. Feel free to tell us why this important.

@IchordeDionysos
Copy link
Contributor Author

@tom-andersen Thanks for the provided details 👌

The reason I'm asking is that we are looking into this particular technique for a latency-sensitive service where we want to improve the latency even more.

We have already looked into and adopted techniques like caching, optimizing business logic, etc.

--

I could imagine the following designs for such a native read-time feature:

const firestore = getFirestore();
firestore.settings({
  readTime: Timestamp.fromDate(),
});

(For use-cases where you'd want all requests to query at a particular point in time. This would be useful for data recovery scripts, to not having to redefine the read time every time)

and/or:

getFirestore()
  .doc('foo/bar')
  .get({
    readTime: Timestamp.fromDate(maxDataStaleness),
  })
getFirestore()
  .collection('foo')
  .where('bar', '==', true)
  .get({
    readTime: Timestamp.fromDate(maxDataStaleness),
  })

@IchordeDionysos
Copy link
Contributor Author

IchordeDionysos commented Feb 4, 2024

I've quickly implemented a version of this and ran some tests (10k requests) in a Cloud Shell:
main...simpleclub-extended:nodejs-firestore:feat/support-read-time-on-get

Metric With readTime Without readTime Improvement
50th percentile 16 ⭐ 17 -5.88%
75th percentile 18 18 -
87.5th percentile 19 ⭐ 20 -5%
93.75th percentile 21 21 -
96.88th percentile 23 23 -
98.44th percentile 25 ⭐ 27 -7.41%
99.22th percentile 35 32 ⭐ +8.57%
99.61th percentile 48 45 ⭐ +6.25%
99.80th percentile 77 70 ⭐ +9.09%
99.90th percentile 101 86 ⭐ +14.85%
99.95th percentile 110 ⭐ 112 -1.79%
99.98th percentile 115 ⭐ 359 -67.97%
99.99th percentile 125 ⭐ 565 -77.88%
99.99th percentile 512 ⭐ 1326 -61.39%
Test script

import {Firestore, Timestamp} from '@google-cloud/firestore';
import {createHistogram} from 'perf_hooks';

async function run() {
  const firestore = new Firestore({
    projectId: '<project>',
  });

  const histogram = createHistogram();
  for (let i = 0; i < 10000; i++) {
    const start = performance.now();
    const maxDataStaleness: Date = new Date(
      new Date().getTime() - 15 * 1000
    );
    await firestore
      .doc('always/the/same/document')
      .get({
        readTime: Timestamp.fromDate(maxDataStaleness),
      });
    const end = performance.now();
    histogram.record(Math.round(end - start));
  }
  console.log('min', histogram.min);
  console.log('max', histogram.max);
  console.log('mean', histogram.mean);
  console.log('stddev', histogram.stddev);
  console.log('exceeds', histogram.exceeds);
  console.log('percentiles', histogram.percentiles);
}
run();

@IchordeDionysos
Copy link
Contributor Author

IchordeDionysos commented Feb 4, 2024

Okay, quickly ran another test, that randomly picks a document, instead of reading the same topic all the time (as this may result in a different behavior).

Metric With readTime Without readTime Improvement
50th percentile 10 ⭐ 12 -16.99%
75th percentile 12 ⭐ 13 -7.69%
87.5th percentile 13 ⭐ 14 -7.14%
93.75th percentile 14 ⭐ 15 -6.67%
96.88th percentile 16 ⭐ 17 -5.88%
98.44th percentile 18 ⭐ 20 -10%
99.22th percentile 20 ⭐ 26 -23%
99.61th percentile 25 ⭐ 48 -47.92%
99.80th percentile 54 ⭐ 79 -31.65%
99.90th percentile 73 ⭐ 96 -23.96%
99.95th percentile 96 ⭐ 129 -25.58%
99.98th percentile 110 ⭐ 150 -26.67%
99.99th percentile 138 ⭐ 202 -31.68%
99.99th percentile 145 ⭐ 218 -33.49%
Test script

import {Firestore, Timestamp} from '@google-cloud/firestore';
import {createHistogram} from 'perf_hooks';

async function run() {
  const firestore = new Firestore({
    projectId: '<project>',
  });

  const documentIds = await firestore.collection('the/test/collection').listDocuments();
  console.log(documentIds.length);

  const histogram = createHistogram();
  for (let i = 0; i < 10000; i++) {
    const start = performance.now();
    const maxDataStaleness: Date = new Date(
      new Date().getTime() - 15 * 1000
    );
    const randomDocument = documentIds[Math.floor(Math.random() * documentIds.length)];
    await randomDocument.get({
      readTime: Timestamp.fromDate(maxDataStaleness),
    });
    const end = performance.now();
    histogram.record(Math.round(end - start));
  }
  console.log('min', histogram.min);
  console.log('max', histogram.max);
  console.log('mean', histogram.mean);
  console.log('stddev', histogram.stddev);
  console.log('exceeds', histogram.exceeds);
  console.log('percentiles', histogram.percentiles);
}
run();

Note: I don't get those numbers consistently 🤔

@tom-andersen
Copy link
Contributor

tom-andersen commented Feb 5, 2024

Looks like you were able implement the optimization. This is a good test case, where the only difference is readTime.

Understanding why you see these latencies, is a little beyond SDK support. I am sure there are other customer specific factors in play, such as database size, concurrent writes, warmup.

You may want to use Firebase support to get answer specific to your use case:

https://firebase.google.com/support/troubleshooter/firestore/queries

Can I help you with anything else?

@tom-andersen
Copy link
Contributor

tom-andersen commented Feb 6, 2024

Follow up for @IchordeDionysos. I asked internally, and was given some explanation:

Stale reads have two main values:

  1. Avoiding any waits for pending writes. So if they are comparing strong vs stale reads on a write only workload there is likely little difference.
  2. Using the non-primary region for reads. If they are using a regional instance than this one isn't applicable.

In your case, (2) is applicable.

You should run the workload (a) without transactions (b) from europe-west4 instead of europe-west1

@tom-andersen
Copy link
Contributor

@IchordeDionysos The next release of SDK will have optimization for transactions with readTime. They will reduce the number of requests required, and thereby reduce the latency. Feel free to do your test again with version 7.3.1 or newer.

#2002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the googleapis/nodejs-firestore API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

2 participants