Skip to content

Find correlation between earnings call transcripts and stock price movements (V1)

Notifications You must be signed in to change notification settings

Vxtr10/Earnings-Call-NLP-Strategy-V1

Repository files navigation

Earnings Call NLP Strategy V1

ABSTRACT

Publicly-traded companies are prohibited to fabricate or deceive investors in earnings calls so it’s a useful tool for stock valuations. There may exist patterns from earnings calls that may be identified by a machine learning algorithm and used to extrapolate the direction of future stock movements. Various feature extraction techniques are used to convert earnings call transcripts (texts) to machine-readable formats (vectors). The main feature extraction methods include the use of TF-IDF and Cosine similarity; sentiment analysis using the Loughran-Mcdonalds Dictionary; and various text complexity metrics. The features are then processed through a Random Forest Classifier where both Binary (one vs rest approach) and Multi-class methods were implemented. The former achieved an average accuracy of 83%, whereas multi-class methods achieved 45% and 74% accuracy, depending on the label range.

CONCLUSION

There are several weaknesses this model face. Firstly, it doesn’t have a large dataset (18280). Secondly, it has limited features for training (only various sentiment, prose, and text complexity) which means it doesn’t consider other elements that may impact the price movement of a stock. Due to the limited and qualitative nature of existing features, it means the model cannot give a precise quantitative percentage estimation, thus explaining the difficulty with a regression model. Since the movement of share prices is not solely dependent on earnings calls and instead also relies on external information (press releases, financials, sentiment etc.), improvements must be made to ensure higher accuracies.

The model may improve by including the ability to read financial statements and spot irregularities with past statements (e.g. spotting new footnotes, and new risk factors in 10K and 10Qs). Another improvement is to include analyst rating as well as the general sentiment towards a specific ticker (and its changes over time) by scraping content from Twitter, Reddit, and StockTwits. Technical analysis can also be applied (e.g. data on Volume, MACD, RSI) by scraping off yahoo finance using the yFinance plugin, to gain higher accuracies for multi-class classification algorithms, potentially even enough information for regression models.

Nonetheless, at present, an effective strategy is to go long on shares predicted to overperform (labelled 1.0); and to short those predicted to underperform (by reversing the labels, i.e. switching ‘<’ to ‘ >=’ for the binary classification model); as well as adding empathise on companies who operate in the same industry. Howbeit, it’s prudent to expand on the current dataset as well as to avoid data imbalances using methods such as over/under-sampling, or by using a classification model that features a weighted loss function.

Furthermore, the conclusions I received may not be applicable in the real world since the data I selected for classification measures the Day 0 to 50 percent change. This is unrealistic as stocks often move pre-market/after-hours almost immediately after the earnings are announced. This means we cannot capitalise on the day 0 bounce, since transcripts often take 1-2 days before being published, nonetheless, it’s a good starting point.

About

Find correlation between earnings call transcripts and stock price movements (V1)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published