Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart caching #6

Open
SwiftWinds opened this issue Dec 22, 2021 · 3 comments
Open

Smart caching #6

SwiftWinds opened this issue Dec 22, 2021 · 3 comments

Comments

@SwiftWinds
Copy link
Member

SwiftWinds commented Dec 22, 2021

Implement smart caching of Reddit threads and such to drastically improve performance (don't forget to invalidate caches intelligently too)

Related to Caching: Have a Quick Search vs Extensive Search, where quick search only searches through and returns cached results and extensive search tries the process from the start (pulling newer Reddit comments)

@SwiftWinds SwiftWinds created this issue from a note in Recommeddit Project Board (In progress) Dec 22, 2021
@SwiftWinds
Copy link
Member Author

SwiftWinds commented Dec 22, 2021

database structure:

  • queries
    • name: string
    • category: string
    • googleResults: list<string>
    • lastValidated: datetime
    • results: list<{entity: reference, score: number}>
  • entities
    • name: string
    • validCategories: list<string>
    • invalidCategories: list<string>
    • description: string
    • imageUrl: string
  • threads
    • id: string
    • subreddit: string
    • url: string
    • title: string
    • selftext: string
    • votes: number
    • percentUpvoted: number
    • cacheEntryCreated: datetime
    • cacheEntryLastUpdated: datetime
    • comments:
      • id: string
      • url: string
      • text: string
      • votes: number
      • cacheEntryCreated: datetime
      • cacheEntryLastUpdated: datetime

the validate function runs daily, which revalidates any queries that have a lastValidated of > 1 month. It deletes the query from the database if it's invalid (the top Reddit results off google are different)

@SwiftWinds
Copy link
Member Author

SwiftWinds commented Feb 16, 2022

I added threads to the database structure. This is for later when @123generic will help store large recommendation threads in our database for faster querying

@SwiftWinds
Copy link
Member Author

SwiftWinds commented Feb 23, 2022

Copying from the notion:


project reddit(database cache)

part 1

async get_query_results(string query)

  • return None if it doesn’t have the query
  • otherwise return list of results

async get_entity(string name)

  • returns None if there is no entity
  • otherwise return entity in the form of a dict

async get_thread(string url)

  • return None if there is no thread
  • otherwise return thread in the form of a dict

async set_thread(dict thread)

async merge_entity(string name, optional string[] validCategories, optional string[] invalidCategories, optional description, optional imageUrl)

  • Merge this data with any existing entity in database
  • Or if it doesn’t exist, create it

async store_query_to_cache(string query, string category, string[] googleResults, (entity: string, score: double)[] results)

  • first param is query string
  • second param is category of query
  • third param is the list of reddit URLs off of google
  • fourth param is a list of tuples
    • first elem of each tuple is the string of the entity
      • take the entity string → convert to reference to the entity document
    • second elem is the double that represents the score of the entity
  • returns nothing

Future ideas:

  • we can store validCategories and invalidCategories as subcollections instead of lists if those lists become really big (to save on query time; we don’t want to unnecessarily query a large list)
    • check_if_entity_is_in_category(string entity_name, string category) will be used so that it manually queries the validCategories and invalidCategories collections (because those subcollections won’t be queried by default to save on time)

part 2

every day, check if there are expired (more than 30 days) queries, and if so, check if the google results are different. If so, delete the query entry in database, otherwise simply update lastValidated to the current time

Future ideas:

  • In the future, we might also store in the database, when the google result was posted and when it was last queried. If it has been < 180 days ago, then it is unarchived and it could have been modified.
    • We then parse those unarchived results and check if any new comments since the last time it was checked or any comments are edited
      • if so, we rerun the queries on these results too
      • else, we don’t rerun
  • Also in the future, we can check if the query is used very often. In that case, we don’t delete the query and we just re-run it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants