SEC EDGAR Topic Modelling and relevance forecasting of Risk Factors

Initial Project Statement - Third Year Project - University of Manchester

Topic modelling of the financial domain

Background

A large amount of financial information is available through corporate textual disclosures. The volume of data that has become available, however, makes a manual discovery of topics and the interpretation of the texts infeasible. One particular type of information that is of high interest to investors is the corporate disclosure of risk exposures. Using an automated text algorithm could facilitate and speed up the identification of risks that companies are exposed to.

Description

Companies in the U.S. must disclose their risk factors in a specific section of their annual reports. While identifying this section is relatively easy, the description of the risk factors within this section is unstructured and encompasses several pages. Instead of letting humans read through these sections to identify particular risk factors, the goal of this project is to employ an automated term recognition method to identify the most common risk factors across firms and years. One contribution in this area has been made by Bao and Datta (2014) who employ an LDA model to discover and quantify risk topics. However, individual words that the LDA identifies may not be as meaningful as domain-specific (multi-word) terms. This project will aim to look at how to combine these two approaches.

Deliverables

The main goal of the project is topic modellig (see (Blei and Lafferty, 2009)) for the financial risk domain. The model will aim to employ a term recognition method (e.g. (Spasic et al. 2013)) to examine whether an accurate automated discovery of risk factors from US firms’ annual report disclosures can be achieved and how it can best be presented. The student will have the opportunity to compare the various approaches to topic discovery, e.g., LDA models and term recognition methods.

References

Bao, Y., Datta, A., 2014. Simultaneously discovering and quantifying risk types from textual risk disclosures. Management Science 60, 1371–1391. Blei, David M., and John D. Lafferty, Topic models, Chapter 4 in Srivastava, Ashok N., and Mehran Sahami, eds. Text mining: Classification, clustering, and applications. CRC Press, 2009. Spasic, Irena et al., 2013, FlexiTerm: a flexible term recognition method, Journal of Biomedical Semantics, 1—15.

Supervisor: Goran Nenadic

Functionality implemented:

Extract risk factors section from SEC EDGAR and store them into a MongoDB instance
Dataset analysis for the extracted documents
BERTopic model - utilises DTM to generate a view over the evolution of topics throughout time
Regression model - forecast topic frequency
Classifier model - evaluate if a topic is going to increase in importance
Timeseries model - forecast topic frequency (utilising a timeseries model)

Collab version with models, datasets and pretrained BERTopic can be found here

User guide:

(It is recommended to be used in a powerful environment comparable to the ones provided by Google Colab Pro+)

Prerequirements:

Git
(optional) Git LFS
Python 3.8
(optional) MongoDB

Create a MongoDB instance locally - or restore the one provided here
Clone the repository
(optional) Create a virtual environment
Get the data - either by running data_gathering.ipynb or by downloading the datasets provided on Google Drive
Run BERTopic_Model (Dynamic Topic Modelling) in order to transform document data into topics based timeseries
Run Increasing_Classifier: if you want to determine if a keyword is part of a topic which is going to increase in importance
Run Regression_Model: if you want to predict evolution of topics past their limit in the dataset (or on future timestamps)

User guide:

The analysis was developed on the following platforms:

AMD Ryzen 5 4600h | 32GB RAM | NVIDIA GeForce® GTX 1650 4GB GDDR6
- Windows 11
- openSUSE Tumbleweed
Apple M1 | 16GB RAM
- macOS Monterey 12.3
1.7GHz quad-core Intel Core i7, Turbo Boost up to 4.5GHz | 16GB RAM | 128MB of eDRAM
- macOS BigSur 11.6.5
Google Colab Pro+ | NVIDIA K80 | 2 x Intel Xeon @ 2.199GHz | 52GB RAM
- Ubuntu 18.04.5 LTS

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
collab		collab
latex		latex
logs		logs
models		models
output		output
static		static
.gitattributes		.gitattributes
.gitignore		.gitignore
BERTopic_Model.ipynb		BERTopic_Model.ipynb
LICENSE		LICENSE
README.md		README.md
Regression_model.ipynb		Regression_model.ipynb
data_gathering.ipynb		data_gathering.ipynb
dataset_stats.ipynb		dataset_stats.ipynb
increasing_classifier.ipynb		increasing_classifier.ipynb
requirements.txt		requirements.txt
timeseries_model.ipynb		timeseries_model.ipynb

License

armandanusca/topic-modelling-risk-factors

Folders and files

Latest commit

History

Repository files navigation

SEC EDGAR Topic Modelling and relevance forecasting of Risk Factors

Initial Project Statement - Third Year Project - University of Manchester

Topic modelling of the financial domain

Background

Description

Deliverables

References

Supervisor: Goran Nenadic

Functionality implemented:

Collab version with models, datasets and pretrained BERTopic can be found here

User guide:

User guide:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages