Improve read performance by using stale reads #1994

IchordeDionysos · 2024-02-02T21:03:28Z

In the documentation, it is mentioned that stale reads may improve the performance of reading from Firestore as data can be just fetched from the nearest replica without having to reconfirm with the leader replica:
https://firebase.google.com/docs/firestore/understand-reads-writes-scale#stale_reads

I'm using the following code to perform a stale read:

const random = Math.random();
const useStaleReads = random < USE_STALE_READ_PERCENTAGE;

logger.profile(`stale-read-${random}`);

let snap: DocumentSnapshot<FirebaseFirestore.DocumentData>;
if (useStaleReads) {
  export const STALE_READ_STALENESS = 60 * 1000; // 1 minute
  const maxDataStaleness: Date = new Date(
    new Date().getTime() - STALE_READ_STALENESS
  );
  snap = await firestore.runTransaction(
    async t => {
      return t.get(ref);
    },
    {
      readOnly: true,
      readTime: Timestamp.fromDate(maxDataStaleness),
    }
  );
} else {
  snap = await ref.get();
}

logger.profile(`stale-read-${random}`, {
  level: 'info',
  message: 'Read from Firestore',
  meta: {
    useStaleReads,
  },
});

As the data is not changed very often it's fine to have one minute (or even longer) stale content.

But what we are seeing is that the strong reads are faster than the stale reads:

Query used for analysing the logs

WITH latencies AS (
  SELECT
    timestamp ,
    JSON_VALUE(json_payload.metadata.useStaleReads) as uses_stale_reads,
    JSON_VALUE(json_payload.metadata.profile.durationMs) as duration_in_ms,
  FROM  `simpleclub.global._Default._AllLogs`  AS logs
  WHERE NORMALIZE_AND_CASEFOLD(logs. resource.type , NFKC) = "cloud_run_revision"
    AND NORMALIZE_AND_CASEFOLD(SAFE.STRING(logs. resource.labels ["revision_name"]), NFKC) = "cloud-run-revision"
    AND NORMALIZE_AND_CASEFOLD(SAFE.STRING(logs. resource.labels ["service_name"]), NFKC) = "cloud-run-service"
    AND REGEXP_CONTAINS(SAFE.STRING(logs. json_payload ["metadata"]["profile"]["id"]), "stale")
    AND JSON_VALUE(json_payload.metadata.useStaleReads) = "true"
  ORDER BY timestamp DESC
)
SELECT
  STRUCT(
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(5000)] AS percentile_50,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(7500)] AS percentile_75,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9000)] AS percentile_90,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9500)] AS percentile_95,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9900)] AS percentile_99,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9950)] AS percentile_99_5,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9990)] AS percentile_99_9,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9995)] AS percentile_99_95,
    APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9999)] AS percentile_99_99
  ) as duration_in_ms,
  uses_stale_reads,
  COUNT(*) as request_count
FROM latencies
GROUP BY uses_stale_reads

I wanted to share this experience with you and maybe I'm doing something wrong here...
Not sure if increasing to the 60s staleness (instead of the 15s) breaks it?

Interesting data:

We are using Firestore via GRPC (not REST)
@google-cloud/firestore: v6.8.0
Firestore database is hosted in eur3 (multi-region)
Deployed on Cloud Run
- Always on CPU
- CPU start-up boost
- max 40 requests / instance
- 1st gen execution environment
- 1 CPU
- 4GiB memory

IchordeDionysos · 2024-02-02T21:13:17Z

A quick test with the 15s staleness shows very similar numbers ...

tom-andersen · 2024-02-02T22:37:00Z

There is an unfortunate implementation detail that transactions will send a begin transaction request, followed by your get document requests. Effectively, that means transactions are sending multiple requests instead of one with the regular get request.

We are looking to improve this.

The v1 FirestoreClient allows complete access to communication protocol, including ability to set readTime on get document requests. With this, you could achieve improved performance. However, this means taking responsibility for many of the things the regular API surface handles for you. Unless you really need this, I suggest you wait until we improve the regular API surface and/or optimize our handling of transactions with readTime.

Thank you for the question.

Interest in features like this from the developer community helps inform priorities for SDK development. I will be sure to pass this on. Feel free to tell us why this important.

IchordeDionysos · 2024-02-03T08:47:56Z

@tom-andersen Thanks for the provided details 👌

The reason I'm asking is that we are looking into this particular technique for a latency-sensitive service where we want to improve the latency even more.

We have already looked into and adopted techniques like caching, optimizing business logic, etc.

--

I could imagine the following designs for such a native read-time feature:

const firestore = getFirestore();
firestore.settings({
  readTime: Timestamp.fromDate(),
});

(For use-cases where you'd want all requests to query at a particular point in time. This would be useful for data recovery scripts, to not having to redefine the read time every time)

and/or:

getFirestore()
  .doc('foo/bar')
  .get({
    readTime: Timestamp.fromDate(maxDataStaleness),
  })

getFirestore()
  .collection('foo')
  .where('bar', '==', true)
  .get({
    readTime: Timestamp.fromDate(maxDataStaleness),
  })

IchordeDionysos · 2024-02-04T01:50:22Z

I've quickly implemented a version of this and ran some tests (10k requests) in a Cloud Shell:
main...simpleclub-extended:nodejs-firestore:feat/support-read-time-on-get

Metric	With `readTime`	Without `readTime`	Improvement
50th percentile	16 ⭐	17	-5.88%
75th percentile	18	18	-
87.5th percentile	19 ⭐	20	-5%
93.75th percentile	21	21	-
96.88th percentile	23	23	-
98.44th percentile	25 ⭐	27	-7.41%
99.22th percentile	35	32 ⭐	+8.57%
99.61th percentile	48	45 ⭐	+6.25%
99.80th percentile	77	70 ⭐	+9.09%
99.90th percentile	101	86 ⭐	+14.85%
99.95th percentile	110 ⭐	112	-1.79%
99.98th percentile	115 ⭐	359	-67.97%
99.99th percentile	125 ⭐	565	-77.88%
99.99th percentile	512 ⭐	1326	-61.39%

Test script

import {Firestore, Timestamp} from '@google-cloud/firestore';
import {createHistogram} from 'perf_hooks';

async function run() {
  const firestore = new Firestore({
    projectId: '<project>',
  });

  const histogram = createHistogram();
  for (let i = 0; i < 10000; i++) {
    const start = performance.now();
    const maxDataStaleness: Date = new Date(
      new Date().getTime() - 15 * 1000
    );
    await firestore
      .doc('always/the/same/document')
      .get({
        readTime: Timestamp.fromDate(maxDataStaleness),
      });
    const end = performance.now();
    histogram.record(Math.round(end - start));
  }
  console.log('min', histogram.min);
  console.log('max', histogram.max);
  console.log('mean', histogram.mean);
  console.log('stddev', histogram.stddev);
  console.log('exceeds', histogram.exceeds);
  console.log('percentiles', histogram.percentiles);
}
run();

IchordeDionysos · 2024-02-04T02:13:49Z

Okay, quickly ran another test, that randomly picks a document, instead of reading the same topic all the time (as this may result in a different behavior).

Metric	With `readTime`	Without `readTime`	Improvement
50th percentile	10 ⭐	12	-16.99%
75th percentile	12 ⭐	13	-7.69%
87.5th percentile	13 ⭐	14	-7.14%
93.75th percentile	14 ⭐	15	-6.67%
96.88th percentile	16 ⭐	17	-5.88%
98.44th percentile	18 ⭐	20	-10%
99.22th percentile	20 ⭐	26	-23%
99.61th percentile	25 ⭐	48	-47.92%
99.80th percentile	54 ⭐	79	-31.65%
99.90th percentile	73 ⭐	96	-23.96%
99.95th percentile	96 ⭐	129	-25.58%
99.98th percentile	110 ⭐	150	-26.67%
99.99th percentile	138 ⭐	202	-31.68%
99.99th percentile	145 ⭐	218	-33.49%

Test script

import {Firestore, Timestamp} from '@google-cloud/firestore';
import {createHistogram} from 'perf_hooks';

async function run() {
  const firestore = new Firestore({
    projectId: '<project>',
  });

  const documentIds = await firestore.collection('the/test/collection').listDocuments();
  console.log(documentIds.length);

  const histogram = createHistogram();
  for (let i = 0; i < 10000; i++) {
    const start = performance.now();
    const maxDataStaleness: Date = new Date(
      new Date().getTime() - 15 * 1000
    );
    const randomDocument = documentIds[Math.floor(Math.random() * documentIds.length)];
    await randomDocument.get({
      readTime: Timestamp.fromDate(maxDataStaleness),
    });
    const end = performance.now();
    histogram.record(Math.round(end - start));
  }
  console.log('min', histogram.min);
  console.log('max', histogram.max);
  console.log('mean', histogram.mean);
  console.log('stddev', histogram.stddev);
  console.log('exceeds', histogram.exceeds);
  console.log('percentiles', histogram.percentiles);
}
run();

Note: I don't get those numbers consistently 🤔

tom-andersen · 2024-02-05T15:08:28Z

Looks like you were able implement the optimization. This is a good test case, where the only difference is readTime.

Understanding why you see these latencies, is a little beyond SDK support. I am sure there are other customer specific factors in play, such as database size, concurrent writes, warmup.

You may want to use Firebase support to get answer specific to your use case:

https://firebase.google.com/support/troubleshooter/firestore/queries

Can I help you with anything else?

tom-andersen · 2024-02-06T16:55:16Z

Follow up for @IchordeDionysos. I asked internally, and was given some explanation:

Stale reads have two main values:

Avoiding any waits for pending writes. So if they are comparing strong vs stale reads on a write only workload there is likely little difference.
Using the non-primary region for reads. If they are using a regional instance than this one isn't applicable.

In your case, (2) is applicable.

You should run the workload (a) without transactions (b) from europe-west4 instead of europe-west1

tom-andersen · 2024-02-28T19:33:58Z

@IchordeDionysos The next release of SDK will have optimization for transactions with readTime. They will reduce the number of requests required, and thereby reduce the latency. Feel free to do your test again with version 7.3.1 or newer.

#2002

IchordeDionysos added priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue. labels Feb 2, 2024

product-auto-label bot added the api: firestore Issues related to the googleapis/nodejs-firestore API. label Feb 2, 2024

tom-andersen self-assigned this Feb 2, 2024

brettwillis mentioned this issue Mar 14, 2024

Improve network latency of runTransaction() routine #2015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve read performance by using stale reads #1994

Improve read performance by using stale reads #1994

IchordeDionysos commented Feb 2, 2024

IchordeDionysos commented Feb 2, 2024 •

edited

tom-andersen commented Feb 2, 2024 •

edited

IchordeDionysos commented Feb 3, 2024

IchordeDionysos commented Feb 4, 2024 •

edited

IchordeDionysos commented Feb 4, 2024 •

edited

tom-andersen commented Feb 5, 2024 •

edited

tom-andersen commented Feb 6, 2024 •

edited

tom-andersen commented Feb 28, 2024

Improve read performance by using stale reads #1994

Improve read performance by using stale reads #1994

Comments

IchordeDionysos commented Feb 2, 2024

IchordeDionysos commented Feb 2, 2024 • edited

tom-andersen commented Feb 2, 2024 • edited

IchordeDionysos commented Feb 3, 2024

IchordeDionysos commented Feb 4, 2024 • edited

IchordeDionysos commented Feb 4, 2024 • edited

tom-andersen commented Feb 5, 2024 • edited

tom-andersen commented Feb 6, 2024 • edited

tom-andersen commented Feb 28, 2024

IchordeDionysos commented Feb 2, 2024 •

edited

tom-andersen commented Feb 2, 2024 •

edited

IchordeDionysos commented Feb 4, 2024 •

edited

IchordeDionysos commented Feb 4, 2024 •

edited

tom-andersen commented Feb 5, 2024 •

edited

tom-andersen commented Feb 6, 2024 •

edited