microsoft
diff --git a/‎README.md
Lines changed: 74 additions & 24 deletions b/‎README.md
Lines changed: 74 additions & 24 deletions
diff --git a/‎model_evaluation/checkpoint_list.txt
Lines changed: 3 additions & 0 deletions b/‎model_evaluation/checkpoint_list.txt
Lines changed: 3 additions & 0 deletions
diff --git a/‎model_evaluation/model_evaluation.py
Lines changed: 137 additions & 0 deletions b/‎model_evaluation/model_evaluation.py
Lines changed: 137 additions & 0 deletions
diff --git a/‎model_evaluation/model_evaluation_bulk.sh
Lines changed: 101 additions & 0 deletions b/‎model_evaluation/model_evaluation_bulk.sh
Lines changed: 101 additions & 0 deletions
@@ -1,33 +1,83 @@
-# Project
+# Hierarchical cross-entropy loss improves atlas-scale single-cell annotation models
 
-> This repo has been populated by an initial template to help get you started. Please
-> make sure to update the content to build a great experience for community-building.
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
 
-As the maintainer of this project, please make a few updates:
+This repository contains the code used for "Hierarchical cross-entropy loss improves atlas-scale single-cell annotation models". The paper is available on [bioRxiv](https://doi.org/10.1101/2025.04.22.1234567).
 
-- Improving this README.MD file to provide a great experience
-- Updating SUPPORT.MD with content about this project's support experience
-- Understanding the security reporting process in SECURITY.MD
-- Remove this section from the README
+## Repository Information
+This repository is partially derived from the [scTab study](https://github.com/theislab/scTab). We have extended and modified the original codebase to implement the hierarchical cross-entropy loss and the experiments described in the paper.
 
-## Contributing
+## Training Data
+The model training uses the CELLxGENE census version "2023-05-15" preprocessed by [scTab](https://github.com/theislab/scTab), which must be downloaded manually from [this link](https://pklab.med.harvard.edu/felix/data/merlin_cxg_2023_05_15_sf-log1p.tar.gz).
 
-This project welcomes contributions and suggestions.  Most contributions require you to agree to a
-Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
-the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+## Evaluation Data
+For model evaluation, we use the CELLxGENE census version "2023-12-15" as referenced in the paper. This census version is automatically fetched by the code directly from the [CELLxGENE](https://cellxgene.cziscience.com/) portal when needed.
 
-When you submit a pull request, a CLA bot will automatically determine whether you need to provide
-a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
-provided by the bot. You will only need to do this once across all repos using our CLA.
+## Hierarchical Cross-Entropy Loss
+The hierarchical cross-entropy loss leverages inherent hierarchical structures within classification problems to improve model performance. Unlike standard cross-entropy which treats each class independently, this loss function accounts for inclusion relationships between classes. Here we provide a standalone implementation that can be applied to any hierarchical classification task, regardless of the domain or model architecture.
 
-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+### Reachability Matrix
+The function relies on a **reachability matrix** that encodes the hierarchical structure as a directed acyclic graph (DAG). In this matrix:
+- Element (i,j) equals 1 if class j is reachable from class i (meaning j is either i itself or j is a subclass of i in the hierarchy)
+- Element (i,j) equals 0 otherwise
 
-## Trademarks
+For example, consider this simple hierarchical structure:
+```
+    A
+   ↙ ↘
+  B   C
+ ↙ ↘ ↙
+D   E
+```
 
-This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
-trademarks or logos is subject to and must follow 
-[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
-Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
-Any use of third-party trademarks or logos are subject to those third-party's policies.
+The corresponding reachability matrix would be:
+```
+    A B C D E
+A | 1 1 1 1 1
+B | 0 1 0 1 1
+C | 0 0 1 0 1
+D | 0 0 0 1 0
+E | 0 0 0 0 1
+```
+
+The reachability relation encoded in this matrix is a partial order and has the following mathematical properties:
+- **Reflexive**: Every class is reachable from itself (diagonal elements are 1)
+- **Antisymmetric**: If class i can reach j and j can reach i, then i equals j
+- **Transitive**: If class i can reach j and j can reach k, then i can reach k
+
+### Implementation
+```python
+def hierarchical_cross_entropy_loss(logits, targets, reachability_matrix, weight=None):
+    """
+    Hierarchical Cross-Entropy loss
+    
+    Args:
+        logits: Raw model predictions (batch_size, num_classes)
+        targets: Ground truth class indices (batch_size)
+        reachability_matrix: Matrix encoding hierarchical relationships (num_classes, num_classes)
+        weight: Optional class weights
+    
+    Returns:
+        Hierarchical Cross-Entropy loss value
+    """
+    # Convert logits to probabilities using softmax
+    cell_type_probs = torch.softmax(logits, dim=-1)
+    
+    # Propagate probabilities through the hierarchy using the reachability matrix
+    cell_type_probs = torch.matmul(cell_type_probs, reachability_matrix.T)
+    
+    # Apply log transform (with numerical stability term) for NLL loss calculation
+    cell_type_probs = torch.log(
+        cell_type_probs + torch.tensor(1e-6, device=cell_type_probs.device)
+    )
+    
+    # Calculate negative log-likelihood loss with optional class weights
+    hce_loss = F.nll_loss(cell_type_probs, targets, weight=weight)
+    return hce_loss
+```
+
+## Contact
+For questions or issues, please contact davide.dascenzo.work@gmail.com or davide.dascenzo@unimi.it (likely not active from 2026).
+
+## Citation
+If you use this code or method in your research, please consider citing the following [paper](https://doi.org/10.1101/2025.04.22.1234567).
@@ -0,0 +1,3 @@
+/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p/tb_logs/cxg_2023_05_15_linear_hierarchical_loss/default/version_0/checkpoints/val_f1_macro_epoch=1_val_f1_macro=0.802.ckpt
+/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p/tb_logs/cxg_2023_05_15_mlp_hierarchical_loss/default/version_0/checkpoints/val_f1_macro_epoch=1_val_f1_macro=0.798.ckpt
+/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p/tb_logs/cxg_2023_05_15_tabnet_hierarchical_loss/default/version_0/checkpoints/val_f1_macro_epoch=0_val_f1_macro=0.789.ckpt
@@ -0,0 +1,137 @@
+import argparse
+import sys
+import os
+sys.path.append('../scTab')
+sys.path.append('../model_evaluation')
+import pandas as pd
+import matplotlib.pyplot as plt
+import numpy as np
+from utils import (
+    data_preparation, 
+    run_model, 
+    print_clf_report_per_class
+)
+
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description='Evaluate cell type classification models')
+    
+    # Data paths
+    parser.add_argument('--dataset_ids', type=str, help='Dataset ID or "diff_2023-05-15"')
+    parser.add_argument('--features_file', type=str, required=True,
+                      help='Path to features.parquet')
+    parser.add_argument('--var_file', type=str, required=True,
+                      help='Path to var.parquet')
+    parser.add_argument('--cell_type_mapping_file', type=str, required=True,
+                      help='Path to cell_type.parquet')
+    parser.add_argument('--cell_type_hierarchy_file', type=str, required=True,
+                      help='Path to child_matrix.npy')
+    
+    # Model paths and configuration
+    parser.add_argument('--model_type', type=str, required=True,
+                      choices=['tabnet', 'linear', 'mlp', 'celltypist'],
+                      help='Type of model to evaluate')
+    parser.add_argument('--checkpoint_path', type=str, required=True,
+                      help='Path to model checkpoint')
+    parser.add_argument('--hparams_file', type=str,
+                      help='Path to hyperparameters file (not needed for CellTypist)')
+    
+    # Output configuration
+    parser.add_argument('--output_dir', type=str, default='evaluation_results',
+                      help='Directory to save evaluation results')
+    parser.add_argument('--census_version', type=str, default='2023-05-15',
+                      help='CellXGene census version')
+    parser.add_argument('--force_download', action='store_true',
+                      help='Force re-download of data')
+    
+    # Add output root argument
+    parser.add_argument('--output_root', type=str, required=True,
+                      help='Root directory for storing AnnData chunks and results')
+    
+    return parser.parse_args()
+
+def save_results(clf_report_overall, clf_report_per_class, y_probs, y_pred, y_true, metadata, cell_type_mapping, args):
+    """Save evaluation results to files."""
+    os.makedirs(args.output_dir, exist_ok=True)
+    
+    # Save overall metrics
+    overall_path = os.path.join(args.output_dir, f'{args.model_type}_overall_metrics.csv')
+    clf_report_overall.to_csv(overall_path)
+    print(f"\nOverall metrics saved to: {overall_path}")
+    print("\nOverall Results:")
+    print(clf_report_overall)
+    
+    # Save per-class metrics
+    per_class_path = os.path.join(args.output_dir, f'{args.model_type}_per_class_metrics.csv')
+    clf_report_per_class.to_csv(per_class_path)
+    print(f"\nPer-class metrics saved to: {per_class_path}")
+    
+    # Generate and save visualization
+    plt.figure(figsize=(20, 10))
+    print_clf_report_per_class(
+        clf_report_per_class, 
+        args.cell_type_mapping_file, 
+        title=f'{args.model_type.capitalize()} Performance by Cell Type'
+    )
+    plot_path = os.path.join(args.output_dir, f'{args.model_type}_performance_plot.png')
+    plt.savefig(plot_path, bbox_inches='tight', dpi=300)
+    plt.close()
+    print(f"\nPerformance plot saved to: {plot_path}")
+    
+    # Create and save detailed results dataframe
+    print("\nCreating detailed results dataframe...")
+    
+    # Convert numeric indices to cell type labels
+    cell_type_mapping_df = pd.read_parquet(args.cell_type_mapping_file)
+    cell_type_mapping_dict = dict(zip(range(len(cell_type_mapping_df)), cell_type_mapping_df['label']))
+    
+    # Create the detailed results dataframe
+    detailed_df = pd.DataFrame({
+        'y_true': [cell_type_mapping_dict[idx] for idx in y_true],
+        'y_pred': [cell_type_mapping_dict[idx] for idx in y_pred]
+    })
+    
+    # Add other columns from metadata
+    detailed_df = pd.concat([detailed_df, metadata.reset_index(drop=True)], axis=1)
+    
+    # Add probabilities as a column
+    detailed_df['y_probs'] = list(y_probs)
+    
+    # Save the detailed results
+    detailed_path = os.path.join(args.output_dir, f'{args.model_type}_detailed_results.parquet')
+    detailed_df.to_parquet(detailed_path, index=False)
+    print(f"\nDetailed results saved to: {detailed_path}")
+
+def main():
+    """Main execution function."""
+    args = parse_args()
+    
+    print(f"\nPreparing data for {args.model_type} model evaluation...")
+    output_folder, genes, cell_mapping = data_preparation(
+        args.dataset_ids,
+        args.features_file,
+        args.var_file,
+        args.cell_type_mapping_file,
+        census_version=args.census_version,
+        force_download=args.force_download,
+        output_root=args.output_root
+    )
+    
+    print(f"\nRunning evaluation for {args.model_type} model...")
+    results = run_model(
+        args.model_type,
+        args.checkpoint_path,
+        args.hparams_file,
+        args.cell_type_hierarchy_file,
+        genes,
+        cell_mapping,
+        output_folder
+    )
+    
+    # Unpack results (now including probabilities and additional metadata)
+    clf_report_overall, clf_report_per_class, y_probs, y_pred, y_true, metadata = results
+    
+    save_results(clf_report_overall, clf_report_per_class, y_probs, y_pred, y_true, metadata, cell_mapping, args)
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,101 @@
+#!/bin/bash
+
+# Common data paths
+DATA_ROOT="/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p"
+
+# Function to evaluate a single checkpoint
+evaluate_checkpoint() {
+    CHECKPOINT_PATH="$1"
+    
+    # Dynamically extract log and model folders from the checkpoint file path
+    REL_PATH=${CHECKPOINT_PATH#${DATA_ROOT}/}
+    LOG_DIR=$(echo "$REL_PATH" | cut -d'/' -f1)
+    MODEL_NAME=$(echo "$REL_PATH" | cut -d'/' -f2)
+    DEFAULT_DIR=$(echo "$REL_PATH" | cut -d'/' -f3)
+    VERSION_DIR=$(echo "$REL_PATH" | cut -d'/' -f4)
+    MODEL_PATH="${DATA_ROOT}/${LOG_DIR}/${MODEL_NAME}/${DEFAULT_DIR}/${VERSION_DIR}"
+    
+    CHECKPOINT_NAME=$(basename "$CHECKPOINT_PATH" .ckpt)
+    
+    # Determine model type based on checkpoint path
+    if [[ $CHECKPOINT_PATH == *"mlp"* ]]; then
+        MODEL_TYPE="mlp"
+    elif [[ $CHECKPOINT_PATH == *"tabnet"* ]]; then
+        MODEL_TYPE="tabnet"
+    elif [[ $CHECKPOINT_PATH == *"linear"* ]]; then
+        MODEL_TYPE="linear"
+    else
+        echo "Unknown model type in path: $CHECKPOINT_PATH"
+        return 1
+    fi
+    
+    HPARAMS_PATH="${MODEL_PATH}/hparams.yaml"
+    OUTPUT_DIR="${MODEL_PATH}/checkpoints/${CHECKPOINT_NAME}"
+    
+    # Create output directory
+    mkdir -p "$OUTPUT_DIR"
+    
+    # Run evaluation
+    python model_evaluation.py \
+        --dataset_ids "diff_2023-05-15" \
+        --features_file "/home/sebacultrera/label_smoothing_celltype/label_smoothing_celltype/scTab/notebooks/store_creation/features.parquet" \
+        --var_file "${DATA_ROOT}/var.parquet" \
+        --cell_type_mapping_file "${DATA_ROOT}/categorical_lookup/cell_type.parquet" \
+        --cell_type_hierarchy_file "${DATA_ROOT}/cell_type_hierarchy/child_matrix.npy" \
+        --model_type "${MODEL_TYPE}" \
+        --checkpoint_path "${CHECKPOINT_PATH}" \
+        --hparams_file "${HPARAMS_PATH}" \
+        --output_dir "${OUTPUT_DIR}" \
+        --output_root "${DATA_ROOT}" \
+        --census_version "2023-12-15" \
+        > "${OUTPUT_DIR}/eval.out" 2> "${OUTPUT_DIR}/eval.err"
+}
+
+export -f evaluate_checkpoint
+
+find_checkpoints() {
+    local dir="$1"
+    find "$dir" -name "*.ckpt"
+}
+
+# Check if checkpoint list file is provided
+if [ $# -ne 1 ]; then
+    echo "Usage: $0 <checkpoint_list_file>"
+    exit 1
+fi
+
+CHECKPOINT_LIST="$1"
+MAX_PARALLEL=8  # Maximum number of parallel processes
+running=0       # Counter for running processes
+
+# Read checkpoints and run evaluations
+while IFS= read -r path; do
+    # Skip empty lines
+    [ -z "$path" ] && continue
+    
+    if [ -d "$path" ]; then
+        # If path is a directory, find all checkpoints
+        while IFS= read -r checkpoint; do
+            # Wait if we've reached max parallel processes
+            if [ $running -ge $MAX_PARALLEL ]; then
+                wait -n
+                running=$((running - 1))
+            fi
+            
+            evaluate_checkpoint "$checkpoint" &
+            running=$((running + 1))
+        done < <(find_checkpoints "$path")
+    else
+        # Handle single checkpoint file
+        if [ $running -ge $MAX_PARALLEL ]; then
+            wait -n
+            running=$((running - 1))
+        fi
+        
+        evaluate_checkpoint "$path" &
+        running=$((running + 1))
+    fi
+done < "$CHECKPOINT_LIST"
+
+# Wait for remaining processes to finish
+wait
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p/tb_logs/cxg_2023_05_15_linear_hierarchical_loss/default/version_0/checkpoints/val_f1_macro_epoch=1_val_f1_macro=0.802.ckpt`
	`2`	`+/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p/tb_logs/cxg_2023_05_15_mlp_hierarchical_loss/default/version_0/checkpoints/val_f1_macro_epoch=1_val_f1_macro=0.798.ckpt`
	`3`	`+/data/sebacultrera/merlin_cxg_2023_05_15_sf-log1p/tb_logs/cxg_2023_05_15_tabnet_hierarchical_loss/default/version_0/checkpoints/val_f1_macro_epoch=0_val_f1_macro=0.789.ckpt`