Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates tweets (by tweet_id) & Integrity. #31

Open
EvanCarroll opened this issue Aug 27, 2018 · 0 comments
Open

Duplicates tweets (by tweet_id) & Integrity. #31

EvanCarroll opened this issue Aug 27, 2018 · 0 comments

Comments

@EvanCarroll
Copy link

Love the postgres loader, but we're going to try to keep this repo clean and simple, so I won't merge it into master. Glad to see some cool forks with great new features and tools for analysis though. Thank you for your work.

But this is the problem your repo isn't clean. It's massive, and very hard to account for errors in integrity. Take for instance duplicate tweets. I cleaned them all up, and #29 has the commit fb59797

Github won't render it but try,

git diff fb5979762dca592109f919e4c805d0fb985aa9a9 fb5979762dca592109f919e4c805d0fb985aa9a9^

Of course, it's easy to tell you where the integrity-violations are when you're self-hosting and you have a simple system to ensure integrity Now you've got to still remove duplicates again, and after you do that you'll have to push up a totally new copy of the data files.

If you need examples look for tweet IDs,

psql:load.psql:11: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(612233279027064832) already exists.
CONTEXT:  COPY tweets, line 99245
COPY 233540
psql:load.psql:13: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(626060785005953025) already exists.
CONTEXT:  COPY tweets, line 20505
psql:load.psql:14: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(669263743784583168) already exists.
CONTEXT:  COPY tweets, line 200111
psql:load.psql:15: ERROR:  duplicate key value violates unique constraint "tweets_pkey"
DETAIL:  Key (tweet_id)=(614858663417802752) already exists.
CONTEXT:  COPY tweets, line 103929

The schema, to ensure this never happens, is tucked away in a little folder for anyone who wants to use it and is 40k.

@EvanCarroll EvanCarroll changed the title Duplicates & Integrity. Duplicates tweets (by tweet_id) & Integrity. Aug 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant