The "footnews-detection-API" is a machine learning project aimed at predicting misinformation in football news. This project utilizes DistilBERT, a distilled version of BERT, for its core functionality. The primary goal is to fine-tune an AutoEncoder-like model to accurately perform this task.
In this section, the data was cleaned using regular expressions, and visualizations were created to display the most common words according to the label, utilizing matplotlib and wordcloud. The data were then split into three distinct sets: training, testing, and validation, and formatted into a DatasetDict structure.
Datasets were loaded and formatted into the "Dataset" format from Hugging Face. Subsequently, tokenization was performed using DistilBERT's associated tokenizer, preparing the data for the model.
Here, the training dataset corpora were encoded and pooled (retrieving the [CLS]
token) to facilitate dimensionality reduction for visualization. The data were standardized, and PCA was applied to identify potential clustering trends in 2D.
DistilBERT was trained over 3 epochs, with accuracy and F1 score metrics being logged. This phase focused on model optimization to ensure reliable predictions.
Error analysis involved examining the confusion matrix and assessing the loss on validation set examples. This step aimed to identify and analyze the phrases where the model was most frequently incorrect. Additionally, a practical example was applied to test the model's response in a real-world scenario.