Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request - code to get reviews in different languages #13

Open
MrRaghav opened this issue Jul 19, 2020 · 2 comments
Open

feature request - code to get reviews in different languages #13

MrRaghav opened this issue Jul 19, 2020 · 2 comments

Comments

@MrRaghav
Copy link

Hello, I want to thank you for this code. I've used it for my university project.

However, as an additional feature I've created a small python script "reviewsInLanguages.py" to collect the reviews in the languages other than English. If you find it okay, please add it to your code base. This python script can be used after running following command:

python3 amazon-reviews-scraper/amazon_comments_scraper.py -s "text to be searched" \
                                             &>> ../outputfiles/input.txt

import sys
import langid
import pandas as pd

# a code by Raghvendra Pratap Singh
# M.Sc. student, Dublin City University, Ireland, 2019-20
#
#usage:
#python3 reviewsInLanguages.py <inputfile> <two letter language> <output.csv>
#
#example:
#python3 reviewsInLanguages.py inputs_dir/God_Talks_with_Arjuna_01012017.txt hi outputs_dir/God_Talks_with_Arjuna.csv

fileValue = sys.argv[1]

file1 = open(fileValue, 'r')
Lines = file1.readlines()
list = []
count = 0
ListOfLanguages = ['af','am','an','ar','as','az','be','bg','bn','br','bs','ca','cs','cy','da','de','dz','el','en','eo','es','et','eu','fa','fi','fo','fr','ga','gl','gu','he','hi','hr','ht','hu','hy','id','is','it','ja','jv','ka','kk','km','kn','ko','ku','ky','la','lb','lo','lt','lv','mg','mk','ml','mn','mr','ms','mt','nb','ne','nl','nn','no','oc','or','pa','pl','ps','pt','qu','ro','ru','rw','se','si','sk','sl','sq','sr','sv','sw','ta','te','th','tl','tr','ug','uk','ur','vi','vo','wa','xh','zh','zu']

if len(sys.argv[2])==2:
    if sys.argv[2] in ListOfLanguages:
        # Strips the newline character
        for line in Lines:
            a = langid.classify(line)
            if a[0]==sys.argv[2]:
                list.append(line)
    else:
        print("Please check https://pypi.org/project/langid/1.1dev/ and if your input language is available there, add it to ListOfLanguages")
else:
    print("please enter the language with length of 2 characters")
    sys.exit()


df = pd.DataFrame(list)
df.to_csv(sys.argv[3], encoding='utf-8')

Note: A better approach would be to run this command through the scheduler in Linux
It worked well for me and I collected 2900+ reviews.

@wanghaisheng
Copy link

where is the collect code ?

@philipperemy
Copy link
Owner

Thanks! Don't forget to open a PR if you can!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants