Code review notes #26

nialldcms · 2020-07-21T12:48:43Z

Correct data being collected

Only issues I've noticed is that the next in many of the collected is that the text fields for the subreddit submissions and the comments do not seem to contain the agreed search terms, e.g. '5g coronavirus' - why is this? We should only be collected data on the relevant search terms.

For example, this submission appears in the table. Is this just data pertaining to a test run of the code?
On the submissions data, I think we want to collect upvotes and downvotes instead of score. Score is the sum of these but doesn't give you a sense of the controversy score, and these data separated out might provide better insight into how any disinformation if regarded on Reddit - and might be useful if we want to do any bespoke ML training later.
Further to this point we should be careful about how we use this data if we are collecting just "newest possible" and how we handle duplicates, as this data changes as a function of time. Do we want to change any functionality to account for this?
Data in comments table can be related back to the original submission by removal of the comment ID from the URL string.

Duplicates code

This looks to be functioning properly but we might want to consider if the old data should be updated with newer data, e.g. if a row with id X appears in in the bq table already, should this old row in the bq table be replaced with newly sourced data?

nialldcms assigned Kiki-Jiji and anthonye93 Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code review notes #26

Code review notes #26

nialldcms commented Jul 21, 2020

Code review notes #26

Code review notes #26

Comments

nialldcms commented Jul 21, 2020

Correct data being collected

Duplicates code