Skip to content

billpwchan/DeepTrust

Repository files navigation

DeepTrust Logo

billpwchan/DeepTrust API Reference Documentation

Issues License LastCommit

CommitActivity RepoSize Languages

DeepTrust Description

Different from existing works, the present project proposes a reliable information extraction framework named DeepTrust. DeepTrust enables financial data providers to precisely locate correlated information on Twitter upon a financial anomaly occurred, and apply information retrieval and validation techniques to preserve only reliable knowledge that contains a high degree of trust. The prime novelty of DeepTrust is the integration of a series of state-of-the-art NLP techniques in retrieving information from a noisy Twitter data stream, and assessing information reliability from various aspects, including the argumentation structure, evidence validity, neural generated text traces, and text subjectivity.

The DeepTrust is comprised of three interconnected modules:

  • Anomaly Detection module
  • Information Retrieval module
  • Reliability Assessment module

All modules function in sequential order within the DeepTrust framework, and jointly contribute to achieving an overall high level of precision in retrieving information from Twitter that constitutes a collection of trusted knowledge to explain financial anomalies. Solution effectiveness will be evaluated both module-wise and framework-wise to empirically conclude the practicality of the DeepTrust framework in fulfilling its objective.

How to Install

Open Anaconda Prompt in you computer, and type the following command to create an environment.

conda env create -f environment.yml

To export current environment, use the following command

conda env export > environment.yml

To update current environment with the latest dependencies, use the following command

conda env update --name DeepTrust --file environment.yml --prune

Prerequisite

  1. Refinitiv Eikon: https://eikon.refinitiv.com/index.html
  2. Twitter Developer V2 Access: https://developer.twitter.com/en/portal/dashboard
  3. Microsoft Visual C++ 14.0 or greater: Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

Configuration File Format

Please create a file config.ini in the root folder before executing any following commands.

[Eikon.Config]
ek_api_key = <Refinitiv Eikon>
open_permid = <Refinitiv Eikon>

[Twitter.Config]
consumer_key = <Twitter API V2>
consumer_secret = <Twitter API V2>
bearer_token = <Twitter API V2>
access_token_key = <Twitter API V2>
access_token_secret = <Twitter API V2>

[MongoDB.Config]
database = <MongoDB Atlas>
username = <MongoDB Atlas>
password = <MongoDB Atlas>

[RA.Feature.Config]
min_tweet_retweet = 0
min_tweet_reply = 0
min_tweet_like = 0
min_tweet_quote = 0
max_tweet_tags = 15
min_author_followers = 0
min_author_following = 0
min_author_tweet = 0
min_author_listed = 0
max_profanity_prob = 0.2

[RA.Neural.Config]
roberta_threshold = 0.7
classifier_threshold = 0.9
gpt2_weight = 0.54
bert_weight = 0.46
neural_mode = precision

[RA.Subj.Config]
textblob_threshold = 0.5

Command-line Interface Usages

Anomaly Detection Module Examples

Retrieve a list of anomalies in TWTR (Twitter Inc.) pricing data between 04/01/2021 and 20/05/2021 using ARIMA-based detection method.

python main.py -m AD -t TWTR -sd 04/01/2021 -ed 20/05/2021 --ad_method arima

The date format for both -sd and -ed parameters follows UK time format.

Available --ad_method includes ['arima', 'lof', 'if], which stands for AUTO-ARIMA, Local Outlier Factor and Isolation Forest.

Information Retrieval Module Examples

General Tweet Retrieval

Collect correlated tweets from Twitter data stream of TWTR (Twitter Inc.) regarding a detected financial anomaly on 30 April 2021. Data uploaded to MongoDB database specified in the config.ini file.

python main.py -m IR -t TWTR -ad 30/04/2021 -irt tweet-search

Tweet Updates (Geo-Data + Tweet Sensitivity)

For out-dated tweets missing possible_sensitive and geo fields, update those tweets in the MongoDB database.

python main.py -m IR -t TWTR -ad 30/04/2021 -irt tweet-update

Reliability Assessment Module Examples

  1. Feature-based Filtering

Feature-based filtering on the retrieved collection of tweets (e.g., Remove tweets with no public metrics - Retweets/Likes/Quotes). Rules can be specified in the config.ini under RA.Feature.Config. Verified results are updated to the MongoDB database in the field feature-filter.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat feature-filter
  1. Synthetic Text Filtering

(Note: Synthetic Text Filtering only apply on tweets with Feature-Filter = True)

Update RoBERTa-based Detector, GLTR-BERT and GLTR-GPT2 detectors results to MongoDB collection first. With a powerful GPU (tested on 1080Ti), the total time is approximately 3 days for the TWTR example, shorter for other financial anomalies.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat neural-update --models roberta gltr-bert gltr-gpt2

Fine-tune a GPT-2-medium generator model and generate some fake tweets for training. It may take several hours on a single 1080Ti GPU to fine-tune the model. The fine-tuned model is by default saved to ./reliability_assessment/neural_filter/gpt_generator. WanDB is suggested for monitoring the training progress.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat neural-generate

Update detectors results on the generated fake tweets! These results are used for training a SVM classifier for classifying synthetic tweets.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat neural-update-fake --models roberta gltr-bert gltr-gpt2

Train an SVM classifier and use it for generating the final decision on tweets.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat neural-train --models gltr-bert gltr-gpt2

Also, SVM classification results should be updated to the tweet collection.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat neural-update --models svm

Finally, verify all tweets based on the RoBERTa-based detector, GLTR-BERT-SVM and GLTR-GPT2-SVM detectors, and update them to the MongoDB Database in the field neural-filter.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat neural-verify
  1. Argument Detection and Filtering

Update TARGER sequence labeling results to the Mongo collection

python main.py -m RA -ad 30/04/2021 -t TWTR -rat arg-update

Update argument detection results to the mongodb collection using the sequence tags.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat arg-verify
  1. Subjectivity Analysis and Filtering

Fine-Tune InferSent model using SUBJ dataset and store the model checkpoint to ./reliability_assessment/subj_filter/infersent/models.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat subj-train

Update InferSent, WordEmb and TextBlob evaluation results to the MongoDB database.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat subj-update --models infersent wordemb textblob

Update subjectivity analysis results to the mongodb collection using the fine-tuned MLP model. Results are stored in MongoDB database in the field subj-filter.

python main.py -m RA -ad 30/04/2021 -t TWTR -rat subj-verify
  1. Sentiment Analysis

Update FinBERT evaluation results to the MongoDB database in the field sentiment-filter.

python -m RA -ad 30/04/2021 -t TWTR -rat sentiment-verify

Evaluation Module Examples (For Annotators Only)

Annotate a subset of original tweet collection using customized search query for extracting maximum number of reliable tweets.

python -m RA -ad 30/04/2021 -t TWTR -rat label

Evaluate performance metrics - both per-class and weighted metrics on the annotated subset.

python -m RA -ad 30/04/2021 -t TWTR -rat eval

Evaluate the sensitivity of synthetic text filter on changes of RoBERTa threshold.

python -m RA -ad 30/04/2021 -t TWTR -rat neural-eval --models roberta_threshold

Important Notes

Change following code in modeling_gpt.py in package pytorch-pretrained-bert to include GPT-2 Large capabilities

PRETRAINED_MODEL_ARCHIVE_MAP = {
    "gpt2":        "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin",
    "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin",
    "gpt2-large":  "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin",
    "gpt2-xl":     "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-pytorch_model.bin"
}
PRETRAINED_CONFIG_ARCHIVE_MAP = {
    "gpt2":        "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json",
    "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json",
    "gpt2-large":  "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json",
    "gpt2-xl":     "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-config.json"
}

Same for tokenization_gpt2.py in package pytorch-pretrained-bert to include GPT-2 Large capabilities

PRETRAINED_VOCAB_ARCHIVE_MAP = {
    'gpt2':        "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json",
    "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json",
    "gpt2-large":  "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json",
    "gpt2-xl":     "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json"
}
PRETRAINED_MERGES_ARCHIVE_MAP = {
    'gpt2':        "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt",
    "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt",
    "gpt2-large":  "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt",
    "gpt2-xl":     "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt"
}
PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {
    'gpt2':        1024,
    'gpt2-medium': 1024,
    'gpt2-large':  1024,
    'gpt2-xl':     1024
}

Update trainer.py for script run_clm.py for handling NaN loss values.

training_loss = self._training_step(model, inputs, optimizer)
tr_loss += 0 if np.isnan(training_loss) else training_loss

In _prediction_loop function

temp_eval_loss = step_eval_loss.mean().item()
eval_losses += [0 if np.isnan(temp_eval_loss) else temp_eval_loss]

To fine-tune GPT-2-medium for Tweets

python run_clm.py --model_name_or_path gpt2-medium --model_type gpt2 --train_data_file ./detector_dataset/TWTR_2021-04-30_train.txt --eval_data_file ./detector_dataset/TWTR_2021-04-30_test.txt --line_by_line --do_train --do_eval --output_dir ./tmp --overwrite_output_dir --per_gpu_train_batch_size 1 --per_gpu_eval_batch_size 1 --learning_rate 5e-5 --save_steps 20000 --logging_steps 50 --num_train_epochs 1

Future Plans

  • Anomaly Detection module
  • Information Retrieval Module
  • Reliability Assessment Module

Acknowledgement

The below list is acknowledgement of direct reference to their published code repository in forms of Copy+Paste or with slight modification. The main skeleton of the DeepTrust framework is entirely implemented by the author, and only pre-trained model configurations+training scripts are referenced. All codes listed below are open-sourced and protected under MIT license or Apache 2.0 license.

Citation

@misc{https://doi.org/10.48550/arxiv.2203.08144,
  doi = {10.48550/ARXIV.2203.08144},
  url = {https://arxiv.org/abs/2203.08144},
  author = {Chan, Pok Wah},
  keywords = {Statistical Finance (q-fin.ST), Computation and Language (cs.CL), Machine Learning (cs.LG), Social and Information Networks (cs.SI), FOS: Economics and business, FOS: Economics and business, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {DeepTrust: A Reliable Financial Knowledge Retrieval Framework For Explaining Extreme Pricing Anomalies},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Contributor

Bill Chan -- Main Developer