-
Notifications
You must be signed in to change notification settings - Fork 0
/
wrangle_report.txt
2 lines (2 loc) · 2.08 KB
/
wrangle_report.txt
1
2
In this project several datasets were explored relating to the Twitter account WeRateDogs. The main dataset included information for 2356 tweets such as the text of the tweet, a timestamp, the values of the rating given within the text of the tweet, and information about various stages a dog could be labelled as. The second dataset contained values from a neural network, which made three predictions about what breed of dog was in each picture provided to it, as well as the confidence in that prediction. The final dataset included further information about the individual tweets, namely the count of retweets and favorites. This third dataset required the use of the Twitter API, which involved applying to use the API and receiving several keys and tokens for access.
When starting the assessing and cleaning portion of the project it was found that the information from the third dataset actually could be joined into the first dataset without sacrificing quality standards, and so that was one of the initial issues dealt with during cleaning. Other issues involved in cleaning were the removal of unnecessary categories in the first dataset which were clearly not going to be used during the analysis portion. It was found that for tidiness the categories in the first datset of doggo, floofer, pupper, and puppo, could all be combined into one stage category. Also there were several columns, such as stage, name, and rating_numerator which included erroneous data, and so depending on the situation these incorrect values were either corrected, set as empty cells, or the rows were deleted so as to not throw off calculations later on. In addition there were many columns which were in the incorrect format, such as timestamp not being in datetime format, tweet_id making more sense as a string rather than an int because it did not represent a true numerical value of use, and retweet_count and favorite_count needing to be ints and they represented whole values rather than floats. After conducting this wrangling the datasets were able to be used with greater ease and clarity during the analysis process.