Fake News Classification for 2019 WSDM Fake News Classification on Kaggle

Requirements

Python Version: Python 3.6+

Install Required Packages:

pip install -r requirements.txt

Input

To run this program, need to put train.csv and test.csv in ./data

Execution

Execute the following python scripts in numeric order

|-- 1-Preprocessing:
|   |-- cutting words(Chinese and English)、word2vec embedding
|-- 2-K-means:
|   |-- 1-calculate_sentence_vector.py: 		# sentence vector
|   |-- 2-k_means_on_sentence_vectors.py: 		# K-means scoring on sentence vector
|-- 3-Ensemble:
|   |-- 1-Calculate_Features:
|   |-- |-- 1-calc_sentence_embedding.py: 		# sentence embedding
|   |-- |-- 2-calc_english_noun_similarity.py: 		# calculate English noum similarity
|   |-- |-- 3-calc_sentiment_score.py: 			# calculate sentiment score
|   |-- |-- 4-string_matching.py: 			# calculate string matching ratio
|   |-- |-- 5-get_label.py: 				# transform category labels to integers
|   |-- |-- 6-combine.py: 				# combine all the features in a csv file

Run the chosen aggregation models

Tree based:
classifier.py
	used models:			# predicting testing data, generate `submission.csv`
		decision tree
		random forest
		xgboost
	needed files: 
		feature_train.csv: 	# training data feature
		feature_test.csv: 	# testing data feature
		label.npy: 		# training data label
	used features:
		overlap_ratio: 		# overlap ratio of two titles
		partial_ratio: 		# partial overlap ratio of two titles
		tokenset_ratio: 	# token set ratio of two titles
		rumor: 			# word '谣' '官方' '假' '真相'
		EnglishNounScore: 	# similarity of English noun
		EmbeddingScore: 	# sentence embedding
	kmeans: 			# scoring of K-means

RNN based:

used models:GRU-1 GRU-bi GRU-multi GRU-multi-2 GRU-w2v LSTM-1 LSTM-biLSTM-w2v
needed files: 
	cut_training.csv: 		# cutted training data
	cut_testing.csv: 		# cutted testing data
	label.npy: 			# training data label
	word2vec_model

9-Tools:
  compareCorrect.py: 		# calculate K-means accuracy
  testing_embedding_model.py: 	# calculate cosine similarity accuracy
  visualize_english.py: 	# calculate English noun similarity accuracy
  Visualizer.py: 		# visualization

Submission

The generated submission.csv is the result that can be submitted into kaggle.

Citations and References

@ONLINE{wsdm,
	URL = "www.kaggle.com/c/fake-news-pair-classification-challenge/overview",
	AUTHOR = "{WSDM}",
	TITLE = "{WSDM} - Fake News Classification",
	YEAR = "2019",
}
@ONLINE{jieba,
	URL = "github.com/fxsjy/jieba",
	AUTHOR = "fxsjy",
	TITLE = "Jieba Github Repository",
	YEAR = "2018",
}
@ONLINE{nltk,
	URL = "www.nltk.org/api/nltk.tokenize.html",
	AUTHOR = "Steven Bird, Edward Loper, Ewan Klein",
	TITLE = "Natural Language Toolkit",
	YEAR = "2018",
}

@ONLINE{gensim,
	URL = "radimrehurek.com/gensim/models/word2vec.html",
	AUTHOR = "Radim Řehůřek",
	TITLE = "Gensim Model",
	YEAR = "2018",
}

@ONLINE{percTag,
	URL = "www.nltk.org/_modules/nltk/tag/perceptron.html",
	AUTHOR = "Steven Bird, Edward Loper, Ewan Klein",
	TITLE = "{NLTK - Perceptron Tagger}",
	YEAR = "2018",
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
0-Dictionaries		0-Dictionaries
1-Preprocessing		1-Preprocessing
2-K_Means		2-K_Means
3-Ensemble		3-Ensemble
9-Tools		9-Tools
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
readme.md		readme.md
report.bib		report.bib
report.pdf		report.pdf
report.tex		report.tex
requirements.txt		requirements.txt

EricChen1248/NLP-Project-1-Fake-News-Classification-WSDM-2019

Folders and files

Latest commit

History

Repository files navigation

Fake News Classification for 2019 WSDM Fake News Classification on Kaggle

Requirements

Input

Execution

Submission

Citations and References

About

Resources

Stars

Watchers

Forks

Languages