Skip to content

cerndb/SparkDLTrigger

Repository files navigation

SparkDLTrigger - Particle Classifier using Deep Learning and Spark

Welcome to the SparkDLTrigger repository!
This project is about building a machine learning pipeline for a high-energy physics particle classifier using mainstream tools and techniques from open source for ML and Data Engineering.

SWAN Open in Colab

Related articles and presentations

This project is supported by several articles and presentations that provide further insights and details. Check them out:

Physics Use Case and Implementation

This work is about building a particle classifier that enhances the accuracy of event selection in High Energy Physics.
By utilizing neural networks, we can classify different event topologies of interest, thereby improving the state of the art in accuracy. This work reproduces the findings of the research article Topology classification with deep learning to improve real-time event selection at the LHC and implements it using mainstream tools and techniques from open source for ML and Data Engineering at scale. We have used tools and frameworks such as Apache Spark and TensorFlow/Keras, Jupyter Notebooks. We have deployed the workload on different computing resources at CERN and on Cloud resources, including CERN's Hadoop and Spark service clusters and also using GPU resources on Kubernetes.

Physics use case for the particle classifier

Authors

Project Structure

The project repository is organized into the following sections:

1. Download datasets

Location: Download datasets
Description: Contains datasets required for the project.

2. Data preparation using Apache Spark

Location: Data ingestion and feature preparation
Description: Covers the process of data ingestion and feature preparation using Apache Spark.

3. Preparation of the datasets in Parquet and TFRecord formats

Location: Preparation of the datasets in Parquet and TFRecord formats
Description: Provides instructions for preparing the datasets in Parquet and TFRecord formats.

4. Model tuning

Location: Hyperparameter tuning
Description: Explains the process of hyperparameter tuning to optimize the model's performance.

5. Model training

  • Location: HLF classifier with Keras

    • Description: Demonstrates the training of a High-Level Features (HLF) classifier using a simple model and a small dataset. The notebooks also showcase various methods for feeding Parquet data to TensorFlow, including memory, Pandas, TFRecords, and tf.data.
  • Location: Inclusive classifier

    • Description: This classifier uses a Recurrent Neural Network and is data-intensive.
      This shows a case when the training when data cannot fit into memory
  • Location: Methods for distributed training

    • Description: Discusses methods for distributed training.
  • Location: Training_Spark_ML

    • Description: Covers training using tree-based models run in parallel using Spark MLlib Random Forest, XGBoost, and LightGBM.
  • Location: Saved models

    • Description: Contains saved models.

Additionally, you can explore the archived work in the article_2020 branch.

Note: Each section includes detailed instructions and examples to guide you through the process.

Data Pipelines for Deep Learning

Data pipelines play a crucial role in the success of machine learning projects. They integrate various components and APIs for seamless data processing throughout the entire data chain. Implementing an efficient data pipeline can significantly accelerate and enhance productivity in the core machine learning tasks. In this project, we have developed a data pipeline consisting of the following four steps:

  1. Data Ingestion

    • Description: In this step, we read data from the ROOT format and the CERN-EOS storage system into a Spark DataFrame. The resulting data is then saved as a table stored in Apache Parquet files.
    • Objective: The data ingestion step ensures that the necessary data is accessible for further processing.
  2. Feature Engineering and Event Selection

    • Description: This step focuses on processing the Parquet files generated during the data ingestion phase. The files contain detailed event information, which is further filtered and transformed to produce datasets with new features.
    • Objective: Feature engineering and event selection enhance the data representation, making it suitable for training machine learning models.
  3. Parameter Tuning

    • Description: In this step, we perform hyperparameter tuning to identify the best set of hyperparameters for each model architecture. This is achieved through a grid search approach, where different combinations of hyperparameters are tested and evaluated.
    • Objective: Parameter tuning ensures that the models are optimized for performance and accuracy.
  4. Training

    • Description: The best models identified during the parameter tuning phase are trained on the entire dataset. This step leverages the selected hyperparameters and the processed data to train the models effectively.
    • Objective: Training the models on the entire dataset enables them to learn and make accurate predictions.

By following this data pipeline, we ensure a well-structured and efficient workflow for deep learning tasks. Each step builds upon the results of the previous one, ultimately leading to the development of high-performance machine learning models.

Machine learning data pipeline

Machine learning data pipeline

Results and Model Performance

The training of DL models has yielded satisfactory results that align with the findings of the original research paper. The performance of the models can be evaluated through various metrics, including loss convergence, ROC curves, and AUC (Area Under the Curve) analysis.

Loss converging, ROC and AUC

The provided visualization demonstrates the convergence of the loss function during training and the corresponding ROC curves, which illustrate the trade-off between the true positive rate and the false positive rate for different classification thresholds. The AUC metric provides a quantitative measure of the model's performance, with higher AUC values indicating better classification accuracy.

By achieving results consistent with the original research paper, we validate the effectiveness of our DL models and the reliability of our implementation. These results contribute to advancing the field of high-energy physics and event classification at the LHC (Large Hadron Collider).

For more detailed insights into the experimental setup, methodology, and performance evaluation, please refer to the associated documentation and research article. The results of the DL model(s) training are satisfactory and match the results of the original research paper.

Additional Info and References