This is the project for theis Weakly supervised Aspect-Based Sentiment Analysis with Tensor Graph Convolutional Network.
We propose a new framework called ASSA-TG, which improves the generation process of aspect-specific sentiment seeds. The original method only considers sequential relations between words. In our approach, TensorGCN is used to extract dependency relation and semantic similarity information to improve the quality of generated keywords.
-
About this paper
-
References paper
- An Integration of TextGCN and Autoencoder into Aspect-based Sentiment Analysis (Tsai,2022)
- Tensor graph convolutional networks for text classification (Liu, 2020)
- Graph Convolutional Networks for Text Classification (Yao, 2019)
- Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis (Angelidis, 2016)
-
References repo
pip install requirements.txt
For training data, we use the restaurant reviews from Yelp and laptop reviews from Amazon, for testing data, we use the SemEval-2016 dataset, the data can be downloaded by the links below:
- Training Data
- Testing
- SemEval-2016
- place in the folder
data/test
Use the data/train/preprocess_data.ipynb
to preprocess the training & test data (lowercase, transfer url string and etc), and create word2vec model for ASSA model. Generated files includes:
{dataset}_sent_5w.csv
: preprocessed training text{dataset}_sent_5w_wv.model
: word2vec modeltest/test_{dataset}_sent.csv
: preprocessed testing text
The command would transfer data to format for ASSA model training, the result would be saved in the folder data/preprocessed/{dataset}
, included:
{dataset}_sent_test.hdf5
: test data in.hdf5
format{dataset}_sent_train.hdf5
: train data in.hdf5
format{dataset}_sent_word_counts.txt
: word frequency of train data{dataset}_sent_word_mapping.txt
: word & word ID of train data
cd data
#Training Data
python prep_hdf5_train.py --dataset="{dataset}"
# Testing Data
python prep_hdf5_test.py --dataset="{dataset}"
# dataset variable can be `YELP` or `AMAZON`
Train the ASSA model, and the result would be saved in model_result
, included the performance of each iteration and predicted results of text reviews. For the first round of the training, we use the general sentiment seeds in folder {dataset}_sent_baseline
.
cd ..
python ./model/MATE.py \
--mver="{model_version}"
--sver="{seed_version}"
--JASA_seed_num= 10
--dataset= "{dataset}"
--round= 1
--epochs= 5
--sseed= "baseline"
Description of variables:
mver
: model versionsver
: sentiment seed versionJASA_seed_num
: number of seedsdataset
: name of dataset (YELP
/AMAZON
)round
: number of iterate timeepochs
: number of epoch of each iterationsseed
: whether to use baseline sentiment seed or not (baseline
/other
)
Building GCN graph and generate aspect-specific sentiment seeds of TensorGCN and TextGCN, the generated seeds would be in the folder model_result/{dataset_model_version}
.
- TensorGCN
Generated graph structure would be in the folder
TGCN/data_tgcn
python TGCN.py \
--mver="{model_version}"
--dataset="{dataset}"
--round= 1
--graph="original"
--train_seed_num = 5
--thres= 0.3
--seed_type= `GCNonly`
- TextGCN
Generated graph structure would be in the folder
TextGCN/data_textgcn
python textGCN.py \
--mver="{model_version}"
--dataset="{dataset}"
--round= 1
--graph="original"
--train_seed_num = 5
--seed_type=`GCNonly`
Description of variables:
-
mver
: model version -
dataset
: name of dataset (YELP
/AMAZON
) -
round
: iterate round of the seed generation -
graph
: the type of the edges constructing graph (original
,DP
orDP+
) -
train_seed_num
: number of the training general sentiment seeds -
thres
: threshold of the word similarity of sematic graph -
seed_type
: whether to add general sentiment seeds to the final generated sentiment seeds (GCNonly
/add
) -
Type of the edge in the graph In the original graph, we only add edges to the term pairs which have positive PMI, but for more experiment, we also try to exclude more edge to test if simplier graph can generate better seeds. Based on the papaer of Qiu,2016, we choose word pairs with specific dependency type. However, the result shows the seeds of original graph can improve the ASSA model the most. The edges of word pairs should be fulfill the the construction below:
original
: Edges of word pairs with positive PMIDP
: Word pairs that have the following dependence types:amod, case, nsubj, csubj, dobj, iobj, conj
and positive PMIDP+
:Word pairs that have the following dependence types:amod, case, nsubj, csubj, dobj, iobj, conj, advmod, dep, cop, mark, nsubjpass, nmod, xcomp, xcomp, csubjpass, poss
and positive PMI
- Move the generated seeds from the last step to the folder
seed/sen/{dataset}/{seed_version}
- Repeat the step 4 to train the ASSA model again, but change the
sver
variable to the name of new model version.