

# AI for Astrophysics: Advanced Neural Networks

• David Cornu

*LUX, Observatoire de Paris, PSL*

**Doctoral course - ED AAIF  
2025/2026**

# Lessons materials

Slides, exercises, codes, corrections and datasets are **available on GitHub** and will be updated regularly:

[http://github.com/Deyht/ML\\_OSAE\\_M2](http://github.com/Deyht/ML_OSAE_M2)

```
git clone https://github.com/Deyht/ML_OSAE_M2  
git pull
```

Or download the repository in zip file

Avoid losing your work on forced pull updates by copying all files from the cloned repository into a working directory!

Do not copy and past content from git-hub pages (lead to format errors).  
Use python up to 3.10 but not more recent.

# Neural Networks for images

Fully connected networks has shown one weakness

→ **They are inefficient for handling images !**

- Images are highly dimensional (lots of pixels!)
- They have a very high degree of invariance  
(mainly translation but also luminosity, color, rotation, ...)

*Classical ANN can deal with images by considering each pixel of an image as an individual input but it is STRONGLY inefficient.*



A **highly dimensional** “dog”  
with ~0.5 Million pixels.  
**Quite difficult to classify ...**

Driven by the computer vision and pattern recognition community these issues have found a solution in the 90s with:

→ **Convolutional Neural Networks !**

# Spatially coherent information

Classical ANN can deal with images by considering *each pixel* of an image *as an individual input* but it is **STRONGLY inefficient**.



Simple digit representation as a  $6 \times 7$  binary pixel image



Representation of a simple cross pattern on a  $5 \times 5$  image as input, and the corresponding localization prediction on an equivalent size output image



# Spatially coherent information : Pattern recognition



Looking for specific patterns can be automatized by **scanning all the possible positions** in the image.

In contrast, training a fully connected network to do the same task would require learning the presence or non-presence of the pattern at every possible position instead of learning the pattern once and only checking its presence at every position.

**How to circumvent this behavior ?** → Use **Convolutional layers** !

# Convolution filter



# Filter effect examples

No filter



Sharpen

$$\begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix}$$



Gaussian blur

$$\frac{1}{16} \begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{bmatrix}$$



Edge detector

$$\begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end{bmatrix}$$



Axis elevation

$$\begin{bmatrix} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{bmatrix}$$



# Pattern recognition with several filters

Input images



Superimposed output images



Filters



Each filter will produce its own activation map.

Combining the information from different activation maps allows to **construct more complex patterns**.

\*Here the different activation maps are superimposed using color coding per filter

# Convolutional Neural Networks



$$g \left( \sum_i X_i \circ W_i \right) = a$$



Each filter can be seen as a **single neuron** with one weight per input dimension in the filter.

BUT, the same weights are used at every position and the outputs are independent.

→ **Translational equivariance !**

A network made of stacked convolutional layers can be tuned for **Translation invariance**.



# Examples of filter maximization



Edges (layer conv2d0)

Patterns (layer mixed4a)

Objects (layers mixed4d & mixed4e)

Example of input images that maximize specific filters activation at different depth in a classification network. 11 / 83

# Vanishing Gradient Issue

From He et al. 2015



In principle, the deeper the network, the higher its expressivity should be as long as it is trained with enough data. However, it is not the case in practice due to the gradient slowly getting smaller and smaller as it goes through more layers.

Still many approaches can mitigate this issue to construct network with hundreds of layers (e.g., changing the activation or having skip connections between layers that are far away in the network).

# The Rectified Linear Unit (ReLU)

The **ReLU** (or its variance, the leaky-ReLU) has proven **more efficient for CNN**. It preserves a form of non-linearity and its constant derivative reduces **vanishing gradient problems**.

It is scale-invariant, and much faster to compute than other activation functions.

Using this activation, it becomes possible to construct **much deeper networks**.

$$a_j = g(h_j) = \begin{cases} h_j & \text{if } h_j \geq 0 \\ 0 & \text{if } h_j < 0 \end{cases}$$

or

$$a_j = g(h_j) = \max(0, h_j)$$



$$a_j = g(h_j) = \begin{cases} h_j & \text{if } h_j \geq 0 \\ \lambda h_j & \text{if } h_j < 0 \end{cases}$$

or

$$a_j = g(h_j) = \max(0, h_j) + \min(0, \lambda h_j)$$



# Dimensionality reduction: Pooling

A classical convolution operation is tuned to preserve the spatial dimensionality.

Still, it is most of the time necessary to **reduce the “image” size progressively**.

For classification tasks, the output layer is often reduced to a dense layer with a few neurons.

One way to reduce the spatial dimensionality it to use **Pooling layers** !



Different pooling methods exist, the most common being **Max-Pooling** and **Average-Pooling**. The pooling size can be modified, but most of the network architectures reduce each spatial dimension by a factor of two.

# Learning the filters

Like fully connected layers, the **convolutional filters** can be learned using **backpropagation** of the error measured at the output layer.

The error is propagated using a **transposed convolution operation**, which can be expressed with a classical convolution operation using simple transformations on specific layer elements.



Learning the convolutional filters is often considered to be the definition of “**Deep Learning**”.

# The Im2col transformation



im2col



Filter 1



Matrix multiply!



# Convolutional layer complete matrix formalism



$X$  Batched Inputs

$W$  Weight filters

$a$  Activation maps

$E_a$  Activation errors

$E_x$  Propagated Input errors

# What about computing on GPU?

**GPU (Graphical Processing Unit) are massively parallel computing chips dedicated to SIMD like operations (thousands of cores).** Most image processing algorithm apply the same transformation to millions of pixels, hence the SIMD formalism.

GPU have the same form factor than CPU but usually come as a dedicated daughter board with their own large cooling system as they can have a much higher power draw than CPU!

**GPUs are not suited for all tasks, but for those they were designed for, they pack a huge amount of computing power, which include matrix multiplication!**



# Nvidia H100 GPU spec-sheet (AI dedicated)

| Graphics Processor |                         |
|--------------------|-------------------------|
| GPU Name:          | <a href="#">GH100</a>   |
| Architecture:      | Hopper                  |
| Foundry:           | TSMC                    |
| Process Size:      | 4 nm                    |
| Transistors:       | 80,000 million          |
| Density:           | 98.3M / mm <sup>2</sup> |
| Die Size:          | 814 mm <sup>2</sup>     |
| Graphics Features  |                         |
| DirectX:           | N/A                     |
| OpenGL:            | N/A                     |
| OpenCL:            | 3.0                     |
| Vulkan:            | N/A                     |
| CUDA:              | 9.0                     |
| Shader Model:      | N/A                     |

| Graphics Card     |                           |
|-------------------|---------------------------|
| Release Date:     | Mar 21st, 2023            |
| Generation:       | Tesla Hopper (Hxx)        |
| Predecessor:      | <a href="#">Tesla Ada</a> |
| Production:       | Active                    |
| Bus Interface:    | PCIe 5.0 x16              |
| Board Design      |                           |
| Slot Width:       | Dual-slot                 |
| Length:           | 268 mm<br>10.6 inches     |
| Width:            | 111 mm<br>4.4 inches      |
| TDP:              | 350 W                     |
| Suggested PSU:    | 750 W                     |
| Outputs:          | No outputs                |
| Power Connectors: | 1x 16-pin                 |
| Board Number:     | P1010 SKU 200             |

| Clock Speeds  |                                |
|---------------|--------------------------------|
| Base Clock:   | 1095 MHz                       |
| Boost Clock:  | 1755 MHz                       |
| Memory Clock: | 1593 MHz<br>3.2 Gbps effective |

| Memory       |            |
|--------------|------------|
| Memory Size: | 80 GB      |
| Memory Type: | HBM2e      |
| Memory Bus:  | 5120 bit   |
| Bandwidth:   | 2,039 GB/s |

| Render Config  |                 |
|----------------|-----------------|
| Shading Units: | 14592           |
| TMUs:          | 456             |
| ROPs:          | 24              |
| SM Count:      | 114             |
| Tensor Cores:  | 456             |
| L1 Cache:      | 256 KB (per SM) |
| L2 Cache:      | 50 MB           |

| Theoretical Performance |                    |
|-------------------------|--------------------|
| Pixel Rate:             | 42.12 GPixel/s     |
| Texture Rate:           | 800.3 GTexel/s     |
| FP16 (half):            | 204.9 TFLOPS (4:1) |
| FP32 (float):           | 51.22 TFLOPS       |
| FP64 (double):          | 25.61 TFLOPS (1:2) |



From Nvidia

# GPU architecture

The GPU is specialized for highly parallel computations and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. The schematic [Figure 1](#) shows an example distribution of chip resources for a CPU versus a GPU.



Figure 1: *The GPU Devotes More Transistors to Data Processing*

# GPU architecture



Ex. of a GP104 - Tesla P100

The GPU is equipped with a shared “on board” memory. All SM share the same large L2 cache.

Each SM has a dedicated cache and can launch multiple instruction blocs through multiple warp schedulers.

Inside each SM, there are several CUDA cores and other dedicated compute units.

# Distribution in GPU clusters

## Summit Node

(2) IBM Power9 + (6) NVIDIA Volta V100



# Nvidia Tensor Cores

Optimized Warp Matrix Multiply Add (WMMA) instructions !

CuBLAS can be set to used tensor core through the gemmEX function.



$$D = \begin{pmatrix} A_{0,0} & A_{0,1} & A_{0,2} & A_{0,3} \\ A_{1,0} & A_{1,1} & A_{1,2} & A_{1,3} \\ A_{2,0} & A_{2,1} & A_{2,2} & A_{2,3} \\ A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \end{pmatrix}_{\text{FP16}} \begin{pmatrix} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ B_{1,0} & B_{1,1} & B_{1,2} & B_{1,3} \\ B_{2,0} & B_{2,1} & B_{2,2} & B_{2,3} \\ B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \end{pmatrix}_{\text{FP16}} + \begin{pmatrix} C_{0,0} & C_{0,1} & C_{0,2} & C_{0,3} \\ C_{1,0} & C_{1,1} & C_{1,2} & C_{1,3} \\ C_{2,0} & C_{2,1} & C_{2,2} & C_{2,3} \\ C_{3,0} & C_{3,1} & C_{3,2} & C_{3,3} \end{pmatrix}_{\text{FP16 or FP32}}$$

FP16 storage/input      Full precision product      Sum with FP32 accumulator      Convert to FP32 result



# Nvidia Tensor Cores reduced quantization



# Quantization effect on AI model accuracy

All models quantized to INT8, accuracy for ImageNET-2012



PTQ = Post training quantization

QAT = Quantization aware training

# Mixed precision training



To further reduce this issue, a scaling is applied on the output loss. All propagated values are naturally scaled so they are more likely to be in the proper range.

At weight update time the correction is scaled down by the same factor.

Reduced bit count variables have smaller representable ranges. This can lead to strong gradient vanishing problems.

This problem can first be mitigated by preserving an FP32 copy of the weights for accumulating the updates.



From Nvidia « Mixed precision training »



## Development team



D. Cornu

G. Sainton

*Convolutional Interactive Artificial Neural Networks by/for Astrophysicists*

**General purpose framework BUT** developed for **astronomical applications**



*Work on a wide variety of hardware from IoT to super-computing facilities*



[github.com/Deyht/CIANNA](https://github.com/Deyht/CIANNA)

Open source - Apache 2 license

July 24, 2024 (V-1.0.0.0)

Software

Open

Deyht/CIANNA: CIANNA V-1.0

# Simple examples: MNIST

The well-known **MNIST** (Modified NIST's special dataset) dataset consists of handwritten digits from 500 different writers expressed as 28x28 grayscale images.

It is freely accessible in the form of a 60000 image training set, 10000 images validation, set and a 10000 image test set.

Very simple CNN architectures can achieve over **99.3 classification accuracy** on this dataset.

LeNet 5 network architecture (from LeCun et al. 1998):



# How to use CIANNA for MNIST ?

## Example script over MNIST, with an LeNet5 *like* network.

Interface is similar to widely adopted frameworks.

The **full interface documentation** is available on the github repo as a Wiki page listing all available functions with their descriptions.

Several example scripts are provided as Google Colab notebooks.

```
1 #CIANNA initialization
2 cnn.init(in_dim=i_ar([28,28]), in_nb_ch=1, out_dim=10,
3           bias=0.1, b_size=16, comp_meth="C_CUDA",
4           dynamic_load=1, mixed_precision="FP32C_FP32A")
5
6 #Create data subsets (from numpy arrays)
7 cnn.create_dataset("TRAIN", size=60000, input=data_train, target=target_train)
8 cnn.create_dataset("VALID", size=10000, input=data_valid, target=target_valid)
9 cnn.create_dataset("TEST", size=10000, input=data_test, target=target_test)
10
11 #Define the network structure sequentially
12 cnn.conv(f_size=i_ar([5,5]), nb_filters=8 , padding=i_ar([2,2]), activation="RELU")
13 cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
14 cnn.conv(f_size=i_ar([5,5]), nb_filters=16, padding=i_ar([2,2]), activation="RELU")
15 cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
16 cnn.dense(nb_neurons=256, activation="RELU", drop_rate=0.5)
17 cnn.dense(nb_neurons=128, activation="RELU", drop_rate=0.2)
18 cnn.dense(nb_neurons=10, strict_size=1, activation="SMAX")
19
20 #Training loop configuration and launch
21 cnn.train(nb_iter=20, learning_rate=0.004, momentum=0.8, confmat=1, save_every=0)
22
23 #Evaluate network prediction after training
24 cnn.forward(repeat=1, drop_mode="AVG_MODEL")
```

# Example results on MNIST

| Actual    | Class | Predicted |       |       |       |       |       |       |       |       |       | Recall |
|-----------|-------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--------|
|           |       | C0        | C1    | C2    | C3    | C4    | C5    | C6    | C7    | C8    | C9    |        |
|           | C0    | 976       | 0     | 1     | 0     | 0     | 0     | 1     | 1     | 1     | 0     | 99.6%  |
|           | C1    | 0         | 1132  | 1     | 0     | 1     | 0     | 0     | 0     | 1     | 0     | 99.7%  |
|           | C2    | 1         | 1     | 1027  | 0     | 1     | 0     | 0     | 1     | 1     | 0     | 99.5%  |
|           | C3    | 0         | 0     | 1     | 1004  | 0     | 3     | 0     | 1     | 1     | 0     | 99.4%  |
|           | C4    | 0         | 0     | 1     | 0     | 972   | 0     | 1     | 0     | 1     | 7     | 99.0%  |
|           | C5    | 0         | 0     | 0     | 4     | 0     | 886   | 1     | 0     | 0     | 1     | 99.3%  |
|           | C6    | 3         | 2     | 0     | 0     | 1     | 2     | 949   | 0     | 1     | 0     | 99.1%  |
|           | C7    | 0         | 2     | 3     | 0     | 0     | 0     | 0     | 1020  | 1     | 2     | 99.2%  |
|           | C8    | 0         | 0     | 1     | 1     | 0     | 1     | 1     | 1     | 968   | 1     | 99.4%  |
|           | C9    | 0         | 0     | 0     | 0     | 3     | 1     | 0     | 4     | 0     | 1001  | 99.2%  |
| Precision |       | 99.6%     | 99.6% | 99.2% | 99.5% | 99.4% | 99.2% | 99.6% | 99.2% | 99.3% | 98.9% | 99.35% |

## Practical work:

Reproduce this result using the provided notebooks. Try to modify the architecture (*filter size, #filters, #neurons, batch size, learning rate, etc.*) and estimate the impact of individual changes on the training / inference time and final accuracy.