Will be ongoing project to implement machine learning techniques for peakfinding in crystallography experimental imaging data.
For use at CXLS beamline, this software aims to identify the Bragg peaks, camera length (m), and photon energy (keV), of an image based on the training data. This software will be continuing development, so please check the most recent update to main
branch for the most recent updates.
- Clone the repository:
# local
git clone --recurse-submodules https://github.com/adamkurth/cxls_hitfinder.git
or using GitLab:
# local
git clone --recurse-submodules https://gitlab.com/adamkurth/cxls_hitfinder.git
Or if you forget to add --recurse-submodules
, then run the following command:
git submodule update --init --recursive
- Change into the directory:
# local
cd cxls_hitfinder
To install Apptainer, run the following command:
- Run the Docker file to generate the Docker image:
# local
docker buildx build -t cxls_hitfinder .
- Save the Docker image:
# local
docker save cxls_hitfinder > cxls_hitfinder.tar
To transfer the Docker image to a remote machine, use scp
:
Note: Please ensure that the remote machine has Docker installed, and has cloned the repository on the remote machine.
On remote machine:
# On SOL
git clone https://github.com/adamkurth/cxls_hitfinder.git
cd cxls_hitfinder
Then, on the local machine, transfer the Docker image to the remote machine using scp
:
# local
scp cxls_hitfinder.tar <ASURITE>@sol.asu.edu:/home/<ASURITE>/cxls_hitfinder/
- Load the Docker image on the respective machine:
# on SOL
docker load < cxls_hitfinder.tar
- Run the Docker image:
# On SOL
docker run -d --name my_cxls_hitfinder_instance cxls_hitfinder
- Access the Docker container:
docker exec -it my_cxls_hitfinder_instance bash
For Mac with (Apple Silicon or Intel), you can't install Apptainer (formerly Singularity) directly because it's designed to run on Linux. However, you can use a Linux virtual machine on your Mac and then install Apptainer within that VM. Lima is a tool that sets up a Linux VM for you on a Mac, and once it's running, you can install Linux-compatible tools like Apptainer inside it.
- Install Lima:
brew install lima
- Start the VM:
limactl start default
- Access the Lima VM:
lima start default
- Install Apptainer:
sudo apt-get update
sudo apt-get install singularity-container
- Run Apptainer:
singularity shell default
The checks.sh
script will ensure the proper directories are made in the images/
directory. The images/
directory is where the images will be stored. The images/
directory will have the following structure:
cnn_hitfinder/
├── cnn/
│ ├── src/
│ │ ├── pkg/
│ │ │ ├── __init__.py
│ │ │ ├── util.py
│ │ │ ├── models.py
│ │ │ ├── functions.py
│ │ ├── cnn.py
│ │ ├── test.ipynb
│ │ ├── etc.
├── images/
│ ├── peaks/
│ ├── labels/
│ ├── peaks_water_overlay/
│ └── water/
├── scripts/
│ └──
├── requirements.txt
└── README.md
Note that every folder in images
contains 01
through 09
corresponding to the camlen/keV combination
Parameter matrix for camlen
and keV
:
Dataset (01-09) | camlen (m) | photon energy (keV) |
---|---|---|
01 |
0.15 | 6 |
02 |
0.15 | 7 |
03 |
0.15 | 8 |
04 |
0.25 | 6 |
05 |
0.25 | 7 |
06 |
0.25 | 8 |
07 |
0.35 | 6 |
08 |
0.35 | 7 |
09 |
0.35 | 8 |
Contents of images/
:
images/peaks
contains the Bragg peaks.images/labels
contains the labels for the Bragg peaks, (i.e. above a threshold of 5).images/peaks_water_overlay
contains the Bragg peaks overlayed with the respective keV water image.images/water
contains the different keV water images.
The following bash script will make the needed directories in images/
:
cd cnn/src/
bash checks.sh
-
Inside of the
cnn/src/pkg/util.py
file, the following classes are used:PathManager
: This class is used to manage the paths of the datasets.DatasetManager
: This class is used to manage the datasets themselves.Processor
: This class is used to process the selected dataset.
The
select_dataset
function of thePathManager
class is used to select the dataset with the identifier '01'. TheDatasetManager
class is then used to manage the dataset, with thetransform
parameter set toNone
in this case. Please note that for dataset '01', theclen
andphoton_energy
parameters are set to 1.5 meters and 6000 eV respectively. -
The
Processor
class is responsible for processing the dataset. It includes functions for loading the data, detecting peaks, and generating labels. Theprocess_directory
function is used to process all the images in theimages/peaks/
directory of the selected dataset (e.g., '01'). It uses the correspondingimages/water/water{0*}.h5
files to generate overlay images. For example, it generatesimages/water/01/water01.h5
for all images inimages/peaks/01/*
. -
The
Processor
class is responsible for processing the dataset, using pure signal images inimages/peaks/
(of respective dataset), then generating labeled images (heatmaps), and overlay images with the corresponding keV water image.- Namely, the
process_directory
function uses all the images inimages/peaks/
of the selected dataset (e.g. '01'), and uses the respectiveimages/water/water{0*}.h5
to generate the overlay images (e.g.images/water/01/water01.h5
to all images inimages/peaks/01/*.h5
).
- Namely, the
-
Long term goal: We will simulate diffraction intensity images of a protein, and then use a CNN to predict the protein structure given the diffraction intensity images.
-
Short term goal: Will simulate diffraction intensity images of protein 1IC6.pdb (at first), then apply water background noise to all the "peak only" images, then use both peak-only and peak + water noise images to train a CNN to predict the protein structure. Each 10,000 (20,000 including water-noise images) images will be generated at a given
clen
or camera length away from the interaction point. -
Given the 20,000 images, we want predict (a) camera length away from the interaction point, and (b) the protein structure from the peak patterns on the image.
-
ResNet: a Residual Network (ResNet), could serve as the backbone for feature extraction, thanks to its deep architecture and ability to combat vanishing gradients, allowing it to learn rich features even from complex images.
Overview: renowned for its ability to train very deep networks by usng skip connections or shortcuts to jump over some layers. These connections help mitigate the vanishing gradient problem, which is the problem of the gradients becoming increasingly small as the network becomes deeper, making it hard to train. Thus, this allows for deeper network architectures without degredation in performance.
Why: The ResNet variants (ResNet-50, ResNet-101, ResNet-152) enables it to learn a wide variety of features from diffraction images which is crucial for identifying subtle patterns indicitive of the protein structure. Its ability to effectively propogate gradients through many layers makes it ideal for learning from complex, high-dimensional data images typically in crystallagraphy.
-
U-Net: Originally for biomedical imaging segmentation, this architecture could be adapted to peak detection. Its ability to capture context at different resulutions (image sense of the word) and precisely localize the features makes it suitable for identifying the peaks in in noisy backgrounds.
Overview: Originally designed for biomedical image segmentation, U-Net's architecture is unique for its use of a contracting path to capture context and a symmetric expanding path that enables precise localization.
Why: The U-Net can be particularly effective if the task involves not just classifying the entire image but also identifying the specific regions within the image that correspond to peaks or other features relevant to protein structure prediction. Its structure is advantageous for tasks requiring precise spacial localization of features, which could be useful for peak detection with noisy backgrounds.
-
Inception Networks (GoogLeNet): The Inception Network is known for its efficient use of computational resources, thanks to its inception module. This module is a combination of 1x1, 3x3, and 5x5 convolutions, as well as max pooling, which allows the network to learn from features at different scales.
Overview: Inception networks utilize inception modules, which allow the network to choose from different kernel sizes at each block, enabling it to adaptively capture information at various scales.
Why: The varying sizes of the convolutional kernels in the inception modules allow the network to be efficient in terms of computation and parameters while being powerful in capturing features at different scales, which is critical for analyzing diffraction patterns that may vary significantly in size and shape across different protein structures.
-
Attention Mechanisms (e.g., SENet): Overview: Squeeze-and-Excitation Networks (SENet) introduce a mechanism to recalibrate channel-wise feature responses by explicitly modelling interdependencies between channels. This attention mechanism enhances the representational power of the network.
Why: Given the variability in the significance of different features in the diffraction images, especially with the introduction of noise, the ability of SENet to focus on the most relevant features could improve the accuracy of predicting protein structures from the images.
-
DenseNet (Densely Connected Convolution Networks):
Overview: DenseNet improves upon the idea of skip connections in ResNet by connecting each layer to every other layer in a feed-forward fashion. This architecture ensures maximum information flow between layers in the network.
Why: The feature reuse in DenseNet makes it particularly efficient for learning from limited data, which might be beneficial given the specific nature of the protein structures and the complexity introduced by noise in the images. DenseNet's efficiency and compactness could lead to better performance on tasks with high computational complexity like protein structure prediction from crystallography images.
-
ResNet + Attention Mechanisms: Quickly: (See ResNet above) ResNet architectures, especially deeper versions like ResNet-101 or ResNet-152, are highly capable of capturing complex patterns in images due to their deep structure supported by residual connections. These connections help in training very deep networks by addressing the vanishing gradient problem, making them highly effective for tasks requiring detailed feature extraction from images, such as identifying specific patterns in diffraction intensity images indicative of protein structures.
Enhancements: To further boost the performance of a ResNet-based model, you could integrate attention mechanisms, such as the Squeeze-and-Excitation blocks or non-local blocks, which would allow the model to focus more on relevant features within the diffraction images. This is particularly useful when dealing with images with variable noise levels, as it helps in distinguishing between noise and meaningful signal.
-
Transformer-based Models for Vision (ViT): Quickly: Given the recent success of Transformer models in various domains, including natural language processing and image recognition tasks, a Vision Transformer (ViT) model could be another excellent choice. Transformers treat the image as a sequence of patches and learn global dependencies between them, which could be particularly beneficial for understanding the global structure of protein based on local diffraction patterns.
Advantages: The Transformer architecture's ability to capture long-range interactions between different parts of the image could provide a significant advantage in predicting the protein structure from diffraction intensity images. This aspect is crucial when the relationship between different peaks (or features) in the image directly influences the inferred protein structure.
-
- Given the computational resouces at your disposal, and the goal of pushing the boundaries of what's achievable with current deep learning techniques in crystallography, starting with a ResNet-based architecture enhanced with attention mechanisms offers a solid foundation with proven capabilities in image processing tasks. This approach combines the strength of deep convolutional networks with the ability to focus on the most informative parts of an image, making it well-suited for the complexities of your task.
- Simultaneously, exploring Transformer-based models tailored for vision tasks could offer groundbreaking insights, especially in terms of learning global dependencies and handling the complex patterns found in diffraction intensity images.
- Prototyping: Given the novelty of the task and the resources available, an iterative approach that involves prototyping with both architectures, evaluating performance, and then refining or combining models based on empirical results would be the most effective strategy. This process allows for the discovery of the most suitable model architecture or combination thereof for predicting protein structures from diffraction intensity images accurately.
- Please ensure that you're logged into AGAVE first.
- Use AGAVE to run
ccn_test.py
using therun_ccn_SLURM.sh
script.
- Use gpu if available for significantly faster computation.
./run_ccn_SLURM.sh <RUN> <TASKS> <PARTITION> <QOS> <HOURS> <PATH> <TAG>
# example
./run_ccn_SLURM.sh test1 8 publicgpu wildfire 4
- Then use this command to watch the job:
- Copy the given
JOB_ID
and paste this
watch -n 2 squeue -j <JOB_ID>
# exmaple
watch "squeue -u amkurth"
- The output files are:
- errors: (
.err
) - slurm: (
.slurm
) - output: (
.out
)
- errors: (
The script ccn_test.py
is under development, but designed for peak detection and classification in 2D image data, utilizing deep learning techniques with PyTorch. Below is a detailed breakdown of the script's components, functionalities, and the mathematical concepts they incorporate:
The script aims to:
- Load and Preprocess Data: It reads 2D image data from
.h5
files. - Detect Peaks: It identifies significant points or peaks in the images which could represent important features or areas of interest.
- Classify Peaks: It uses a Convolutional Neural Network (CNN) to classify these peaks into categories (binary in this case).
import os
import glob
import h5py as h5
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from collections import Counter
import sys
from label_finder import(
load_data,
load_file_h5,
display_peak_regions,
# validate,
is_peak,
view_neighborhood,
generate_labeled_image,
main,
)
- Purpose: To process each image and identify potential peaks.
-
Methods:
-
set_threshold_value
: Sets a threshold. Any value above this is considered a potential peak. -
get_coordinates_above_threshold
: Returns coordinates of all points above the threshold, indicating potential peaks. Mathematically, for an image (I) and a threshold (T), the mask (M) is defined as:$$M(x, y) = \begin{cases} 1 & \text{if } I(x, y) > T \ 0 & \text{otherwise} \end{cases}$$
-
get_local_maxima
: Uses thefind_peaks
method to identify actual peaks from the potential ones in a flattened version of the image. -
flat_to_2d
: Converts flat indices to 2D coordinates. -
patch_method
: Normalizes the image data to a range of [0,1].
-
- Purpose: Manages a specific region around a peak in an image.
-
Methods:
-
set_peak_coordinate
&set_region_size
: Define the region around a peak. -
get_region
&extract_region
: Extract the specified region from the image, mathematically represented as:$$\text{Region} = I[x_c-s:x_c+s, y_c-s:y_c+s]$$ where
$(x_c, y_c)$ are the center coordinates and (s) is the size of the region.
-
- Purpose: Defines a Convolutional Neural Network for binary classification of peaks.
- Details: Includes layers like Convolutional, MaxPooling, Dropout, and Fully Connected layers. The forward pass can be represented as a series of mathematical operations: convolutions, ReLU activations, pooling, and linear transformations.
- Objective: To load
.h5
image data, detect peaks, and prepare labels. - Process:
load_tensor
: Loads image data from files and converts them into tensors suitable for PyTorch.is_peak
: Determines whether a given point is a peak based on its neighborhood.find_coordinates
: UsesPeakThresholdProcessor
to find peaks.validate
: Confirms peaks by comparing manual and script-detected ones.generate_label_tensor
: Creates a label tensor marking 1 at peaks and 0 elsewhere.
- Objective: To split the dataset into training and testing sets and create DataLoader objects for model training.
- Process: Reshapes tensors, splits the data using
train_test_split
, and prepares DataLoader objects for training and testing.
-
Objective: To train the CNN model with the training data.
-
Process: Initializes the model, sets up loss function and optimizer, and iteratively updates the model weights over multiple epochs based on the loss gradient. The loss function used is the Cross-Entropy Loss, mathematically expressed as:
$$\text{Loss} = -\sum_{c=1}^{M} y_{o,c} \log(p_{o,c})$$ where (M) is the number of classes (binary in this case), (y) is the binary indicator (0 or 1) if class label (c) is the correct classification for observation (o), and (p) is the predicted probability observation (o) is of class (c).
- Objective: To evaluate the trained model's performance on test data.
- Process: Tests the model on unseen data and calculates accuracy based on the number of correctly classified instances.
- Preprocessing: The script starts with preprocessing the data. It loads the
.h5
files, detects peaks, and generates labels. - Data Preparation: Next, it prepares the data for training by splitting it into training and test sets and creating DataLoader objects.
- Model Training: The CNN model is then trained on the training set.
- Evaluation: Finally, the trained model is evaluated on the test set to determine its accuracy.
This script provides an end-to-end solution for detecting and classifying peaks in 2D image data using deep learning. It encapsulates the entire process from data loading and preprocessing to model training and evaluation, making it a comprehensive tool for image analysis tasks involving peak detection and classification. The use of PyTorch and other scientific libraries in Python allows for efficient processing and flexibility in handling various data formats and model architectures.
The Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification, regression, and outlier detection. This implementation focuses on using SVM for classification purposes, particularly for classifying peaks in crystallography diffraction images.
The data is first flattened and standardized. Flattening is necessary as SVMs take in 1D feature vectors. Standardization (z-score normalization) ensures that each feature contributes equally to the distance calculations.
We use LinearSVC
from scikit-learn's svm
module for large datasets and support explicit setting of dual=False
when the number of samples is greater than the number of features for better performance.
- The
LinearSVC
model is defined with a maximum number of iterationsmax_iter=1000
. This value might need adjustment based on the specifics of the dataset. - The
dual
parameter is set toFalse
based on the warning suggesting its future change and to improve performance.
Hyperparameter tyning is performed using GridSearchCV
. This exhaustively searches over the specified parameter values for an estimator, in this case, the SVM classifier, to find the combination of paramters that results in the best performance.
Cross-Validation is used to ensure that the model's performance is robust and not dependent on the specific way the data is split. The model is trained and evaluated several times, each time with a different split of the data into training and testing sets.
Feature engineering involves creating or transforming features to improve the model's performance. This might include applying transformations like PCA to reduce dimensionality to capture more complex relationships in the data.
Post-training, the model is evaluated using a test set. Key metrics provided include a classification report (precision, recall, f1-score) and a confusion matrix.
-
Precision: The ratio of correctly predicted positive observations to the total predicted positives. High precision relates to a low false positive rate.
$$Precision = \frac{TP}{TP + FP}$$ -
Recall (Sensitivity): The ratio of correctly predicted positive observations to all observations in the actual class.
$$Recall = \frac{TP}{TP + FN}$$ -
F1 Score: The weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
$$F1 Score = 2*\frac{Precision * Recall}{Precision + Recall}$$ -
Confusion Matrix: A table used to describe the performance of the classification model on a set of test data for which the true values are known. It allows visualization of the performance of the algorithm.
Confusion Matrix:
- True Negative (top left) negative samples correctly identified of not peak
- False Positive (top right) negative samples incorrectly identified as peak
- False Negative (bottom left) positive samples incorrectly identified as not peak
- True Positive (bottom right) positive samples correctly identified as peak
To use this SVM implementation:
- Load your data: Ensure your data is in the correct format. For image data, each pixel or region can be a feature.
- Call
svm_classification
: Pass your features and labels to the function. Optionally, you can downsample your data for quicker testing and development. - Interpret the results: Use the provided metrics and confusion matrix to understand the model's performance and make any necessary adjustments.
PeakThresholdProcessor
: Processes image arrays based on a threshold to identify potential peak regions.ArrayRegion
: Extracts and handles specific regions from the array, typically centered around peaks.load_data
: Loads image data from specified paths.svm_hyperparameter_tuning
: Performs grid search to find the best parameters for the SVM model.svm_cross_validation
: Evaluates the SVM model using cross-validation.apply_pca
: Applies Principal Component Analysis (PCA) for dimensionality reduction.svm_classification
: The main function for SVM classification, incorporating data preparation, model training, hyperparameter tuning, and evaluation.downsample_data
: Optionally downsamples the data to make the model training faster and more manageable.
- Expand Hyperparameter Space: Explore a wider range of parameters and kernels in hyperparameter tuning.
- Automated Feature Selection: Implement automated feature selection techniques to identify the most informative features.
- Model Interpretability: Enhance model interpretability with techniques like feature importance scores or SHAP values.
- Scalability: Optimize the code for scalability and performance, especially for very large datasets.