This project is for the BigVis seminar held by National Sun Yat-sen University in 2021.
In this project, we analyzed a 600 million-article text network on MakeUp board (PTT) and developed an interactive front-end interface for effective data visualization with SparkR and shiny app.
- R, shiny app, sparkR
We use hadoop to preprosess the vast amount of the documents in cloud server, in this step, we exlude documents which are too short, and calculate the importance of the word by tf-idf.
In this step, after we get the importance of the words:
- We choose words with high tfidf and categorized them into six category, the details of the categories would be explained below.
- Build word-sentences and word-document matrix to create the brand-centric network. The nodes are words, and the edges are in two types, they can be the time co-occurance of the word pair or the correlation of the words.
- Choose the brand on the left side bar
- Choose the number of the nodes displayed on the web page (4 ~ 32 nodes)
- Choose the relation type of the edges (correlation / co-occurance)
- Choose the threshold of the word relation, only the word pairs which have the relation higher than the threshold would be displayed on the page.
- Click on the edge of the word pair which you are interested. and the sentences containing the word pair would be displayed on the bottom right side bar.
- The sentences would be sorted by post date, and by clicking the sentence, the full article would be show in the buttom left side bar.
brand
: Name of the makeup brand, likem.a.c
,benifit
,dior
feature
: Features of the product, like持久力
,廣感
,自然光
product
: Name of the makeup product, like染眉膏
,眉筆
,口紅
condition
: Product trial effect, like致痘
,偏乾
,卡粉
problem
: The user's makeup concerns while applying makeup,like黑眼圈
,乾肌
,油肌
emotion
: Emotion of the user, like心動
,必買
,燒到
The article and sentences containing the word pair would be displayed in this format.