Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store cache in S3 storage #27

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Store cache in S3 storage #27

wants to merge 2 commits into from

Conversation

infomiho
Copy link
Contributor

@infomiho infomiho commented May 6, 2024

If the cache doesn't exist locally, fetches it from S3.

When we save the cache locally, we upload it to S3 as well.

@infomiho infomiho requested a review from Martinsos May 6, 2024 14:23
Copy link
Member

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this will be great! Left some questions.

src/analytics/events.ts Outdated Show resolved Hide resolved
allOldEventsFetched = !isThereMore;
}
await saveCachedEvents(events);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before, we saved cached events on every successful fetching of events.

Now, you save it only once all the old events are fetched (which can take minutes, as there are like 22 MB of them currently when starting from scratch, and can fail while doing this. Meaning that it might fetch some decent amount of events, let's say 10MB of them, only to die before saving them to disk/S3, and then next time it will start from scratch.

Maybe that is ok? Most often it will already have most of the events fetched, and it will only need to fetch small amount of events to add to that total number of events.

But, if we will need to build cache of events from scratch (cold start), this could be an issue. That shouldn't be happening often though hm.

Some possible solutions:

  1. Save to disk after every successful fetch, and store to cloud only once all are fetched. But hm that won't work actually, because we will first load from cloud, there will be none or old events (less events), and then we are starting from scratch. Unless we were smart when downloading events, and we don't use the downloaded events if the ones we have on disk already are fresher (if there is more of them could be good enough check).
  2. Save (at same time to disk and S3) every so an so -> not too often, but also not too late. So maybe after every X events fetched? Or after every X time elapsed?

Btw is there anything we have to worry about regarding dev vs prod environment? I do sometimes run analytics locally, because I am experiment with them, or extracting some novel analytics, or some more custom analytics, then we can get on discord bot. So I will want access to actual prod S3 file then. Ok. But could there be some conflicts? WHat if both me and production are saving the file at the same time? Could that cause any issues? I guess not hm, since file is handled in such atomical way (always loaded whole, always saved whole).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local vs remote cache: we could split the process of saving the file locally and uploading it. Since we first download the remote cache, the new data we fetch is additive. Whatever we upload is going to have more data than what we downloaded, if I'm not mistaken? So if we upload only after we fetched everything, that's okay. If the process fails somewhere, oh well, we'll retry and fetch again.

I assume the biggest fear is that we'll corrupt the remote cache? Then we'll delete it manually from S3 and refetch everything (this is not a big problem since we have been doing it this whole time!)

Dev vs. prod concerns: you just use different buckets for dev and prod, this is how people usually do S3 buckets :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we are uplaoding more than we downloaded, that is not the issue. What I was concerned with is that we can spend quite some time downloading data only to have a single error ruin it all before we upload anything, forcing us to start it all from scratch.
But ok, that more of an issue before I did that PR which made the process less flaky, so now it is less likely it will die in the middle of downloading. That might become an issue again once we have bigger amounts of events, so we should be aware of it, but I think we can postpone worrying about this for now.

No, corrupting remote cache is not much of a fear I think, we are doing it all atomically so it can't really happen.

Different buckets for dev and prod - right, but this is not typical use case, I explained above how it is different: I might be running it locally, but I do want to have actual real events, there are no "dev" events. So both in prod and in dev I want to work on same events, which is atypical. We could have this split between dev and prod, just to be safe, but actually that sucks because then if I haven't run it for some time locally, and I probably haven't, I will have quite a lot to catch up on, a lot of events to download. Ok, so probably best to by default use same bucket for both, and dev only if we are doing some actual development on that part of the program and we want to see if it works correctly without corrupting production file. We should document this in README, this suggestion! Tell people how to go about it.

src/analytics/events.ts Outdated Show resolved Hide resolved
src/analytics/events.ts Show resolved Hide resolved
src/analytics/storage.ts Outdated Show resolved Hide resolved
env.example Show resolved Hide resolved
@@ -115,13 +116,24 @@ async function fetchEvents({

// NOTE: This file is gitignored. If you change its name, update it also in gitignore.
const cachedEventsFilePath = "wasp-analytics-cached-events.json";
const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is the same name, but you decided to keep it distinct still, just to make it clear it doesn't have to be the same name? That is all right I guess! Altough even if you used the same name all the way I think it would be fine.

@@ -115,13 +116,24 @@ async function fetchEvents({

// NOTE: This file is gitignored. If you change its name, update it also in gitignore.
const cachedEventsFilePath = "wasp-analytics-cached-events.json";
const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const cacheedEventsRemotePath = "wasp-analytics-cached-events.json";
const cachedEventsRemotePath = "wasp-analytics-cached-events.json";

Will also require updating below.

allOldEventsFetched = !isThereMore;
}
await saveCachedEvents(events);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we are uplaoding more than we downloaded, that is not the issue. What I was concerned with is that we can spend quite some time downloading data only to have a single error ruin it all before we upload anything, forcing us to start it all from scratch.
But ok, that more of an issue before I did that PR which made the process less flaky, so now it is less likely it will die in the middle of downloading. That might become an issue again once we have bigger amounts of events, so we should be aware of it, but I think we can postpone worrying about this for now.

No, corrupting remote cache is not much of a fear I think, we are doing it all atomically so it can't really happen.

Different buckets for dev and prod - right, but this is not typical use case, I explained above how it is different: I might be running it locally, but I do want to have actual real events, there are no "dev" events. So both in prod and in dev I want to work on same events, which is atypical. We could have this split between dev and prod, just to be safe, but actually that sucks because then if I haven't run it for some time locally, and I probably haven't, I will have quite a lot to catch up on, a lot of events to download. Ok, so probably best to by default use same bucket for both, and dev only if we are doing some actual development on that part of the program and we want to see if it works correctly without corrupting production file. We should document this in README, this suggestion! Tell people how to go about it.

@@ -13,6 +13,9 @@ Copy env.example to .env and fill it with your PostHog API key and Discord bot t

Both are needed to run bot locally, but only PostHog API key is needed to run just analytics locally.

Also, if you want to use the S3 storage to store the cache file, fill in the S3 credentials as well. This is useful if you want to
test the S3 storage locally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only that, but to also not have to wait for long time to fetch all the analytics locally, if you don't have any fetched yet.

cachedEventsFilePath,
);
} catch (e: unknown) {
// Log and continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment

try {
await uploadFileToStorage(cacheedEventsRemotePath, cachedEventsFilePath);
} catch (e: unknown) {
// Log and continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment.


// Returns: [PosthogEvent]
// where events are guaranteed to be continuous, with no missing events between the cached events.
// Newest event is first (index 0), and oldest event is last, and cached events are continuous,
// in the sense that there is no events between the oldest and newest that is missing.
// There might be missing events before or after though.
async function loadCachedEvents(): Promise<PosthogEvent[]> {
try {
await downloadFileFromStorage(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you try to donwload regardless of S3 being used or not, from what I see? So even if I set S3 to not being used, this will fail. Aha, I will get error message, but that is it. Might be nicer to check here if S3 is used, and if yes, then download file, if not, nothing. A bit weird otherwise that I say S3 is not used and then I get an error about downloading cache from S3.
If you don't like this direction, maybe rather drop the notion of IS_S3_USED and just throw warnings if S3 stuff is missing and that is it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants