Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistently store scraped tweets #23

Open
laurin opened this issue Feb 27, 2022 · 14 comments
Open

Persistently store scraped tweets #23

laurin opened this issue Feb 27, 2022 · 14 comments
Labels
enhancement New feature or request
Milestone

Comments

@laurin
Copy link
Contributor

laurin commented Feb 27, 2022

As discussed in #16, the current storage of scraped tweets is not optimal, because the newly scraped tweets will just be appended to the existing tweets.txt-file, creating a lot of duplicates.
Integrating a database is probably not necessary at this point, we could store the scraped tweets with their ID in a json-file and only add new ones in the run of the application.

@laurin
Copy link
Contributor Author

laurin commented Feb 27, 2022

We should also store the time the tweet was created and discard tweets after a certain time or allow the user to select a time-range. The latter would probably require the map to be generated client-side.

@kinshukdua
Copy link
Owner

I agree a json-file is probably the best option. I don't think we should generate things client side, especially because that might add unnecessary lag, especially in places where there might be very slow internet because of the current circumstances. I want to serve a static html to keep the load times as low as possible. Lets just keep set discard tweet time as a parameter server side.

@Krishna-Sivakumar
Copy link
Contributor

Krishna-Sivakumar commented Feb 27, 2022

We can consider SQLite here too, since it's simple and file-based. It sounds like we're performing some conditional manipulation, and this will help us cut down on time complexity.

@Krishna-Sivakumar
Copy link
Contributor

@DomiiBunn mentioned firebase, would work here.

@DomiiBunn
Copy link
Collaborator

@DomiiBunn mentioned firebase, which would work here.

It depends on the complexity you'd look for. Firebase is a nice balance between file storage(JSON files, SQLite, etc) and standalone databases as it's almost as flexible as and handles security, hosting, high availability and at the usage, we'd be expecting it should be fully free. As long as DB reads are cached that is.

@kinshukdua
Copy link
Owner

The reason I'm a little hesitant about firebase is that it adds another steps for developed looking to reproduce the repo and contribute. The simpler the project, the easier it is to contribute (as long as it doesn't impact performance or features).

@DomiiBunn
Copy link
Collaborator

Use a config file and specify

useDatabaseCache: false

That way for a larger deployment it's worth caching and for personal deployment it's still working fine without added complexity

@DomiiBunn
Copy link
Collaborator

DomiiBunn commented Feb 28, 2022

Or using redis but idk how painful it is to implement with python

And i think it would be a bit of an over kill.

@sahal-mulki
Copy link
Contributor

I am working on a fix for duplicate tweets.

@DomiiBunn DomiiBunn added this to the Beta 0.2.0 milestone Feb 28, 2022
@DomiiBunn DomiiBunn added the enhancement New feature or request label Feb 28, 2022
@Krishna-Sivakumar
Copy link
Contributor

Krishna-Sivakumar commented Mar 1, 2022

Let's just go with a json file.

@DomiiBunn
Copy link
Collaborator

Sounds good to me

@sahal-mulki
Copy link
Contributor

Nvm, I failed miserably at it.

@DomiiBunn
Copy link
Collaborator

I'd love to help but python ain't my coup of tea

@sahal-mulki
Copy link
Contributor

Sure-a-mundo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants