You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Only issues I've noticed is that the next in many of the collected is that the text fields for the subreddit submissions and the comments do not seem to contain the agreed search terms, e.g. '5g coronavirus' - why is this? We should only be collected data on the relevant search terms.
For example, this submission appears in the table. Is this just data pertaining to a test run of the code?
On the submissions data, I think we want to collect upvotes and downvotes instead of score. Score is the sum of these but doesn't give you a sense of the controversy score, and these data separated out might provide better insight into how any disinformation if regarded on Reddit - and might be useful if we want to do any bespoke ML training later.
Further to this point we should be careful about how we use this data if we are collecting just "newest possible" and how we handle duplicates, as this data changes as a function of time. Do we want to change any functionality to account for this?
Data in comments table can be related back to the original submission by removal of the comment ID from the URL string.
Duplicates code
This looks to be functioning properly but we might want to consider if the old data should be updated with newer data, e.g. if a row with id X appears in in the bq table already, should this old row in the bq table be replaced with newly sourced data?
The text was updated successfully, but these errors were encountered:
Correct data being collected
Only issues I've noticed is that the next in many of the collected is that the text fields for the subreddit submissions and the comments do not seem to contain the agreed search terms, e.g. '5g coronavirus' - why is this? We should only be collected data on the relevant search terms.
For example, this submission appears in the table. Is this just data pertaining to a test run of the code?
On the submissions data, I think we want to collect upvotes and downvotes instead of score. Score is the sum of these but doesn't give you a sense of the controversy score, and these data separated out might provide better insight into how any disinformation if regarded on Reddit - and might be useful if we want to do any bespoke ML training later.
Further to this point we should be careful about how we use this data if we are collecting just "newest possible" and how we handle duplicates, as this data changes as a function of time. Do we want to change any functionality to account for this?
Data in comments table can be related back to the original submission by removal of the comment ID from the URL string.
Duplicates code
This looks to be functioning properly but we might want to consider if the old data should be updated with newer data, e.g. if a row with id X appears in in the bq table already, should this old row in the bq table be replaced with newly sourced data?
The text was updated successfully, but these errors were encountered: