Store cache in S3 storage #27

infomiho · 2024-05-06T14:23:58Z

If the cache doesn't exist locally, fetches it from S3.

When we save the cache locally, we upload it to S3 as well.

Martinsos

Nice, this will be great! Left some questions.

src/analytics/events.ts

Martinsos · 2024-05-07T18:37:00Z

src/analytics/events.ts

    allOldEventsFetched = !isThereMore;
  }
+  await saveCachedEvents(events);


So before, we saved cached events on every successful fetching of events.

Now, you save it only once all the old events are fetched (which can take minutes, as there are like 22 MB of them currently when starting from scratch, and can fail while doing this. Meaning that it might fetch some decent amount of events, let's say 10MB of them, only to die before saving them to disk/S3, and then next time it will start from scratch.

Maybe that is ok? Most often it will already have most of the events fetched, and it will only need to fetch small amount of events to add to that total number of events.

But, if we will need to build cache of events from scratch (cold start), this could be an issue. That shouldn't be happening often though hm.

Some possible solutions:

Save to disk after every successful fetch, and store to cloud only once all are fetched. But hm that won't work actually, because we will first load from cloud, there will be none or old events (less events), and then we are starting from scratch. Unless we were smart when downloading events, and we don't use the downloaded events if the ones we have on disk already are fresher (if there is more of them could be good enough check).

Save (at same time to disk and S3) every so an so -> not too often, but also not too late. So maybe after every X events fetched? Or after every X time elapsed?

Btw is there anything we have to worry about regarding dev vs prod environment? I do sometimes run analytics locally, because I am experiment with them, or extracting some novel analytics, or some more custom analytics, then we can get on discord bot. So I will want access to actual prod S3 file then. Ok. But could there be some conflicts? WHat if both me and production are saving the file at the same time? Could that cause any issues? I guess not hm, since file is handled in such atomical way (always loaded whole, always saved whole).

Local vs remote cache: we could split the process of saving the file locally and uploading it. Since we first download the remote cache, the new data we fetch is additive. Whatever we upload is going to have more data than what we downloaded, if I'm not mistaken? So if we upload only after we fetched everything, that's okay. If the process fails somewhere, oh well, we'll retry and fetch again.

I assume the biggest fear is that we'll corrupt the remote cache? Then we'll delete it manually from S3 and refetch everything (this is not a big problem since we have been doing it this whole time!)

Dev vs. prod concerns: you just use different buckets for dev and prod, this is how people usually do S3 buckets :)

You are right, we are uplaoding more than we downloaded, that is not the issue. What I was concerned with is that we can spend quite some time downloading data only to have a single error ruin it all before we upload anything, forcing us to start it all from scratch.
But ok, that more of an issue before I did that PR which made the process less flaky, so now it is less likely it will die in the middle of downloading. That might become an issue again once we have bigger amounts of events, so we should be aware of it, but I think we can postpone worrying about this for now.

No, corrupting remote cache is not much of a fear I think, we are doing it all atomically so it can't really happen.

Different buckets for dev and prod - right, but this is not typical use case, I explained above how it is different: I might be running it locally, but I do want to have actual real events, there are no "dev" events. So both in prod and in dev I want to work on same events, which is atypical. We could have this split between dev and prod, just to be safe, but actually that sucks because then if I haven't run it for some time locally, and I probably haven't, I will have quite a lot to catch up on, a lot of events to download. Ok, so probably best to by default use same bucket for both, and dev only if we are doing some actual development on that part of the program and we want to see if it works correctly without corrupting production file. We should document this in README, this suggestion! Tell people how to go about it.

src/analytics/events.ts

src/analytics/storage.ts

env.example

Martinsos · 2024-05-09T13:17:58Z

src/analytics/events.ts

@@ -115,13 +116,24 @@ async function fetchEvents({

 // NOTE: This file is gitignored. If you change its name, update it also in gitignore.
 const cachedEventsFilePath = "wasp-analytics-cached-events.json";
+const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";


So it is the same name, but you decided to keep it distinct still, just to make it clear it doesn't have to be the same name? That is all right I guess! Altough even if you used the same name all the way I think it would be fine.

Martinsos · 2024-05-09T13:18:12Z

src/analytics/events.ts

@@ -115,13 +116,24 @@ async function fetchEvents({

 // NOTE: This file is gitignored. If you change its name, update it also in gitignore.
 const cachedEventsFilePath = "wasp-analytics-cached-events.json";
+const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";


Suggested change

const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";

const cachedEventsRemotePath = "wasp-analytics-cached-events.json";

Will also require updating below.

Martinsos · 2024-05-09T13:26:19Z

src/analytics/events.ts

    allOldEventsFetched = !isThereMore;
  }
+  await saveCachedEvents(events);


You are right, we are uplaoding more than we downloaded, that is not the issue. What I was concerned with is that we can spend quite some time downloading data only to have a single error ruin it all before we upload anything, forcing us to start it all from scratch.
But ok, that more of an issue before I did that PR which made the process less flaky, so now it is less likely it will die in the middle of downloading. That might become an issue again once we have bigger amounts of events, so we should be aware of it, but I think we can postpone worrying about this for now.

No, corrupting remote cache is not much of a fear I think, we are doing it all atomically so it can't really happen.

Different buckets for dev and prod - right, but this is not typical use case, I explained above how it is different: I might be running it locally, but I do want to have actual real events, there are no "dev" events. So both in prod and in dev I want to work on same events, which is atypical. We could have this split between dev and prod, just to be safe, but actually that sucks because then if I haven't run it for some time locally, and I probably haven't, I will have quite a lot to catch up on, a lot of events to download. Ok, so probably best to by default use same bucket for both, and dev only if we are doing some actual development on that part of the program and we want to see if it works correctly without corrupting production file. We should document this in README, this suggestion! Tell people how to go about it.

Martinsos · 2024-05-09T15:18:37Z

README.md

@@ -13,6 +13,9 @@ Copy env.example to .env and fill it with your PostHog API key and Discord bot t

 Both are needed to run bot locally, but only PostHog API key is needed to run just analytics locally.

+Also, if you want to use the S3 storage to store the cache file, fill in the S3 credentials as well. This is useful if you want to
+test the S3 storage locally.


Not only that, but to also not have to wait for long time to fetch all the analytics locally, if you don't have any fetched yet.

Martinsos · 2024-05-09T15:19:22Z

src/analytics/events.ts

+      cachedEventsFilePath,
+    );
+  } catch (e: unknown) {
+    // Log and continue


Redundant comment

Martinsos · 2024-05-09T17:09:35Z

src/analytics/events.ts

+  try {
+    await uploadFileToStorage(cacheedEventsRemotePath, cachedEventsFilePath);
+  } catch (e: unknown) {
+    // Log and continue


Redundant comment.

Martinsos · 2024-05-09T17:12:04Z

src/analytics/events.ts


 // Returns: [PosthogEvent]
 // where events are guaranteed to be continuous, with no missing events between the cached events.
 // Newest event is first (index 0), and oldest event is last, and cached events are continuous,
 // in the sense that there is no events between the oldest and newest that is missing.
 // There might be missing events before or after though.
 async function loadCachedEvents(): Promise<PosthogEvent[]> {
+  try {
+    await downloadFileFromStorage(


But you try to donwload regardless of S3 being used or not, from what I see? So even if I set S3 to not being used, this will fail. Aha, I will get error message, but that is it. Might be nicer to check here if S3 is used, and if yes, then download file, if not, nothing. A bit weird otherwise that I say S3 is not used and then I get an error about downloading cache from S3.
If you don't like this direction, maybe rather drop the notion of IS_S3_USED and just throw warnings if S3 stuff is missing and that is it.

Store cache in S3 storage

26a1fcf

infomiho requested a review from Martinsos May 6, 2024 14:23

Martinsos reviewed May 7, 2024

View reviewed changes

PR comments

d060d2a

Martinsos reviewed May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store cache in S3 storage #27

Store cache in S3 storage #27

infomiho commented May 6, 2024

Martinsos left a comment

Martinsos May 7, 2024

infomiho May 8, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

Martinsos May 9, 2024

	const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";
	const cachedEventsRemotePath = "wasp-analytics-cached-events.json";

Store cache in S3 storage #27

Are you sure you want to change the base?

Store cache in S3 storage #27

Conversation

infomiho commented May 6, 2024

Martinsos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment