

# The Neural Cryptanalyst: Machine Learning-Powered Side-Channel Attacks — A Comprehensive Survey

Ahmed Taha

## Abstract

Machine learning has fundamentally transformed side-channel attacks on cryptographic implementations. This survey describes how CNNs, LSTMs, Transformers recover keys from power traces with orders of magnitude less expertise and data than classical statistical methods.

We examine attacks on AES, RSA, and ECC, demonstrating that nets need far less traces than DPA/DFFA, with greater success. Published results show CNNs can recover keys from first-order masked AES with as few as 500–1,000 traces under optimized conditions vs. 5,000–10,000 for DPA. Even constant-time code leaks power consumption. Our Transformer implementations beat CNNs by 20–40% on the hardest datasets, while LSTMs excel on desynchronized traces.

We also look at countermeasures: higher-order masking, hiding, blinding; each increases complexity but if enough traces are available, ensemble net attacks still succeed against combined countermeasures. Our open-source implementation provides reproducible attack and defense evaluation tools.

These findings demonstrate that current defenses inadequately protect against ML-based attacks, posing a significant threat to embedded cryptographic systems worldwide.

**Keywords:** Side-channel analysis, Machine learning, Power analysis, Neural networks, Cryptographic implementations, AES, RSA, ECC

## 1. Introduction

Side-channel attacks are a sophisticated type of cryptanalysis that target not the calculated, formulaic understanding of an algorithm but information gleaned unintentionally through physical channels while the cryptographic algorithm is running. These physical channels include power consumption, electromagnetic emanations, acoustic emissions, or execution time. Side-channel analysis (SCA) can be traced back to Kocher's 1996 timing attacks paper [12] and his subsequent 1999 differential power analysis work [11], and the growth of SCA since the late 1990s combined with the integration of machine learning (ML) and deep learning (DL) environments has created what we now term AI-enhanced side-channel attacks.

Where before extensive attacker involvement was required to the extent that statistical tests needed to be run to potentially determine a successful attack, now learning the parameters to identify a successful attack is no longer required knowledge and attack success rates have improved significantly [2]. Picek et al. [3] synthesize findings from previously published works to document this transformation that DL has enabled for physical SCA, allowing researchers to pursue more streamlined yet highly successful attacks. For example, where differential power analysis (DPA) and correlation power analysis (CPA) required the attacker to know what the algorithm was doing, how the hardware or software operated, AI attacks require basic background knowledge [3] of the target algorithm and collection of power traces from the device to potentially extract secret keys.

This survey examines the advancements, methods of attack and results of AI-based side-channel attacks focusing on power analysis attacks against crypto-processing hardware and software. The reader will learn the architecture of the neural networks used to attack, the vulnerabilities exposed by trained attacks on predominant crypto standards, and the effectiveness of defenses against such novel attacks.

The complete implementation accompanying this survey—including all neural network architectures, preprocessing pipelines, attack implementations, countermeasures, and evaluation metrics—is available as open-source software at [github.com/ahmedtaha100/Neural-Cryptanalyst](https://github.com/ahmedtaha100/Neural-Cryptanalyst).

## 2. Technical Foundations

### 2.1 Power Analysis Attacks

The concept of power analysis attacks stems from the idea that the power a device consumes to calculate information represents the information itself. Therefore, if one can physically determine how much power a device exerts while calculating, it may be possible to extract sensitive information. This theory is valid because integrated circuits consume different power amounts under different operating conditions and with different data values. For instance, complementary metal-oxide-semiconductor (CMOS) integrated circuits—which make up almost every computing device produced and utilized today—consume power one way during a charge (changing configurations) and another when static (off). Further, CMOS integrated circuits consume power to switch states when determining binary 1s and 0s as input data [11].

The standard power calculation equation for CMOS is [11, 19]:

$$P = \alpha \cdot C \cdot V^2 \cdot f \quad (1)$$

Where:  $P$  = Average Power Consumption,  $\alpha$  = Activity Factor,  $C$  = Load Capacitance,  $V$  = Voltage,  $f$  = Frequency

Where  $\alpha$  typically ranges from 0.01 to 1.0 depending on circuit switching activity.

The total equation for power consumption is the sum of three factors [19]:

$$P_{\text{total}} = P_{\text{dynamic}} + P_{\text{static}} + P_{\text{sc}} \quad (2)$$

Where  $P_{\text{dynamic}} = \alpha \cdot C \cdot V^2 \cdot f$  is the power consumed due to switching when logic changes occur,  $P_{\text{static}} = I_{\text{leak}} \cdot V$  is the power consumed by leakage current which occurs when transistors do not switch properly, and  $P_{\text{sc}} = I_{\text{sc}} \cdot V \cdot f$  is the power consumed by the momentary short-circuit current that exists when PMOS and NMOS are turned on simultaneously for a very short time during switching.

The physics of CMOS switching creates an exploitable correlation between processed data and power consumption. For instance, in a CMOS design, a logic gate consists of complementary pairs of p-type and n-type MOSFETs. When logic toggles, current passes through the gates to charge or discharge the output capacitance. Thus, the total power used is directly proportional to how many gates toggled, and the logic is informed directly by what data is processed. Thus, if more logic toggles, power consumption follows suit. In addition, static CMOS power consumption is negligible when high or low states are active; at the same time, there is an incremental power increase when switching states.

Thus, when involving encryption, the power consumption activity factor  $\alpha$  becomes data dependent, which attackers can then use to their advantage. Researchers term this a side-channel vulnerability, which occurs based on leakage modeling expectations. The two most significant leakage models are Hamming Weight (HW) and Hamming Distance (HD), although what occurs in the real world may differ from what is purely expected.

The HW model indicates that power is drawn based on how many bits in the value are equal to one:

$$\text{HW}(x) = \sum x_i \text{ for } i = 0 \text{ to } n-1 \quad (3)$$

This is an empirical approximation because manipulating a value with more 1 bits will probably require more signal toggling in the combinational logic needed to generate it.

The HD model indicates that power consumed is based on how many bit transitions occur:

$$\text{HD}(x, y) = \text{HW}(x \oplus y) \quad (4)$$

This models reality better because it applies directly to how many transitions are made when using dynamic power and the sequential logic required to switch registers from value  $x$  to value  $y$ .

Therefore, these are all models of leakage that are analogous to what one might find in the real world; however, they do not account for all leakage. For example, with integrated circuits, there

could be leakage based on some bits more than others (certain bit positions leak worse than others) or value-dependent leakage (certain data values give a different signature of power). Either way, the best method to determine which type of leakage model applies to one's situation is through real testing on the actual targeted device [19].

In addition to hardware side-channel attacks with considered leakage, much is at stake for assessment of side-channel attacks as well. The Nyquist-Shannon theorem states that sampling must occur at greater than double the highest frequency of interest [13]. A 100 MHz device has power consumption with harmonics extending into the GHz range. While capturing all harmonic content would require very high sampling rates, in our experiments and published literature, side-channel attacks succeed with sampling frequencies of 1 GS/s or lower, since cryptographic leakage is often concentrated at lower frequencies.

The quality of measurements follows the Signal-to-Noise Ratio (SNR):

$$\text{SNR} = \text{Var}(S_{\text{signal}}) / \text{Var}(S_{\text{noise}}) \quad (5)$$

where  $\text{Var}(S_{\text{signal}})$  is the variance of the signal dependent measurement and  $\text{Var}(S_{\text{noise}})$  is all sources of noise from thermal noise to quantization noise to environmental interference to algorithmic noise from simultaneous operations on the device. The SNR across measurements in dB is as follows:

$$\text{SNR\_dB} = 10 \cdot \log_{10}(\text{SNR}) \quad (6)$$

In our controlled laboratory environment with sophisticated oscilloscopes, shielding, and stable triggering, we achieved SNR values greater than 20 dB for unprotected measurements. For the protected measurements where countermeasures take place, the SNR may be negative for the leakage of interest, which means more extensive statistical testing is required. For attacks in the field or on devices where countermeasures are applied, SNR values below 0 dB can occur, requiring extensive signal processing and many more traces for successful key recovery.

Traces also need to be triggered at the proper time to ensure time-domain alignment. When traces are out of alignment, effective SNR drops, and attacks fail. IO patterns, EM emissions for specific operations, or dedicated trigger circuits can provide triggering, as discussed in standard SCA references [19].

Despite power traces being one of the easiest and most common traces, EM traces can be even more effective. EM emissions are easier to localize than power emissions—they can be restricted to certain areas of a chip and avoid some detection and prevention measures that power might be vulnerable to [16]. The same leakage models and leakage analysis that exist for power traces also apply to EM.

Whereas one needs traces, one also needs to understand the points of interest (POI) within the traces. That is, which samples/times demonstrate leakage? For example, when an implementation has no protection, the POI will coincide with when vulnerable intermediate values are read/written. Thus, POI detection can be automated through statistical means (e.g., t-test) or mutual information analysis [2].

These physical characteristics are leveraged by a number of classical power analysis techniques. For instance, Simple Power Analysis (SPA) has an attacker simply looking at power traces and determining what has happened based on what they see along the way. For instance, the square-and-multiply RSA algorithm generates unique traces for where squaring occurs and where multiplication occurs to the extent that one can determine what bits are associated with the secret exponent [17]. SPA requires intimate knowledge of the implementation, but it only needs a single trace.

DPA is somewhat similar because it too relies upon power consumption and vulnerabilities generated from such analysis; however, this time, the attacker must correlate consumption with the likeliest values generated from set keys. The attacker takes a key guess and computes what intermediate values would yield during execution—computes power consumption—and repeats for each successive key guess, enabling them to determine which hypothesis has the highest correlation with consumed power. DPA does not require implementation awareness but instead of a single trace, it needs thousands [11].

CPA is essentially DPA but with the Pearson correlation coefficient. It measures how well the predicted power and measured power correlate with each other. CPA does this in a transparent fashion by using a leakage model like HW or HD to predict the amount of power used. So, if a leakage model holds true for the device under attack, then CPA is better than DPA [20].

The best type of power analysis attack relative to information-theoretic gain is called the Template Attack [18]. This requires the attacker to profile an identical device which allows for multivariate Gaussian templates—meaning every possible intermediate value has its own power consumption distribution. The attack phase then applies maximum likelihood estimation to correlate the observed traces to the appropriate template. This attack can use pooled covariance matrices across the templates so as to lessen the profiler’s need for access. This is the most optimal power attack; it works from minimal traces but requires extensive profiling access.

These traditional attacks operate under the assumption that there is no tamper protection. However, contemporary cryptographic devices employ countermeasures including masking (splitting critical variables into random shares), shuffling (randomization of operation order), and hiding (noise or dummy operations) that make these attacks significantly more difficult. For instance, these countermeasures can cause SNRs to become negative and demand higher-order assessment or more complicated approaches, which are the basis for the ML techniques assessed in later sections.

## 2.2 Threat Model and Assumptions

This subsection describes the threat model for the side-channel attacks studied in this research, what the attacker can (and cannot) do and under what attack parameters.

### *Attacker Capabilities*

The attacker has physical access—power measurements occur centimeters away, EM (electromagnetic) measurements millimeters away from the chip—and thus, effective operation is only from nearby. There is no virtual/remote operation. The attacker does NOT destroy the device or intentionally interfere with it. This attack is non-invasive. A semi-invasive attack—

where a chip is decapsulated although no interference occurs with the circuitry now visible when the chip is opened—is associated with other research.

The attacker has the following at their disposal: a digital oscilloscope with a sampling rate of at least 100 MS/s and the common 8-12 bit resolution utilized in side-channel attacks; current probes or resistive shunts for IC power measurements; differential probes for achieving a good enough SNR; near-field EM probes for H-field and E-field characterizations; stable triggering.

The attacker has access to DL training. The models evaluated in our work, for instance, operate on GPUs with at least 8GB memory and require hours or days to train depending on the architecture complexity and the size of the training and testing dataset.

The attacker knows the encryption scheme the side-channel attack is targeting—AES, RSA, or ECC—but does not know the secret keys or random masks. The attacker knows the operating environment, whether hardware/software based, and the estimated clock frequency. For profiled attacks, the attacker possesses the targeted device or a similar one and has control over the input during profiling.

The fact that data collection allows for profiling means the attacker has access to a wealth of traces from the victim device. The profiling step occurs on a scale of hundreds of thousands or millions of traces with the key known. This does not mean, however, that this is the case during the attack step; typically unprotected implementations require 50–200 traces, first-order protected implementations require 500–5,000 traces, and second-order (or more) protected implementations could require in excess of 50,000 traces.

### *Attacker Limitations*

There are several important limitations that restrict the attacker’s capabilities. The attacker cannot have invasive access, meaning no internal circuit manipulation, no fault injection, nor operational changes to the device. The attacker cannot inject malware via software or firmware changes to the victim device. In some cases, the attack collection window is limited (for instance, a payment card attack can happen in seconds while the card is engaged). The attacker has no access to key storage/key generation aside from side-channel leakage.

### *Environmental Assumptions*

The attack environment consists of noise sources and operational conditions. Noise sources include thermal noise (the thermal noise floor is  $-174 \text{ dBm/Hz}$  at  $290\text{K}$  per  $\text{kTB}$ ), quantization noise (for example,  $\text{SNR}$  is approximately  $6.02N + 1.76 \text{ dB}$  for an ideal  $N$ -bit ADC with full-scale input), electromagnetic noise from the surrounding area, and algorithmic noise (i.e., the attacker works simultaneously with work being done on the attacked device).

Operational conditions are that the attacked device works under nominal conditions—voltage standard  $\pm 10\%$  and temperature 0–70 degrees C. There is a reliable clock source; jitter occurs under 1% of the clock period, and operation is repeatable so that the same operation produces statistically similar traces.

The following measurement conditions represent our experimental setup: Sampling rate is at least  $5\times$  the anticipated clock frequency for power analysis. Measurement bandwidth is DC to at least  $2\times$  clock frequency for capturing relevant leakages. Triggering is stable with jitter below 10% of anticipated operating time. SNR is 20 dB (without protection) and -10 dB (with protection; masking countermeasures should be applied).

### *Attack Scenarios*

The attack scenario is drawn from the following real-life situations. Pre-certification testing occurs because device manufacturers generate test vectors during development to identify vulnerabilities pre-certification arrival. Laboratory assessment is where security assessors have access to everything and collections are unlimited for profiling and attacking stages. Finally, certification assessment occurs when a third-party assessment laboratory assesses the device against Common Criteria or FIPS 140-3 requirements either as a standardized test vector.

Field attacks represent the final deployment, meaning attackers only have limited access for a limited amount of time in a noncontrolled environment having not much more profiling than what's provided during final deployment. Supply chain attacks mean that attackers have access but only for as long as packaging or shipping, which means they are more likely to profile their attack sample based on the samples received during production. Remote power analysis means that attackers can perform attacks based on unintentional electromagnetic emanation that is perceivable at a distance or based on power consumption activities observable in shared power supply systems.

### *Survey of Attacks and Exclusions*

This survey studies passive side-channel attacks relative to power consumption and does not include active fault injection attacks (voltage/clock glitching and laser fault injection), invasive attempts (probing, reverse engineering), software vulnerability/protocol level exploits, acoustic/optical or other exotic side channels, and composite attacks requiring multiple side-channel activation simultaneously.

This threat model holds for the ML techniques proposed here. Other evaluations would be necessary for more robust attackers with invasive access or more minimal attackers operating remotely.

## 2.3 Machine Learning in Side-Channel Analysis

ML transforms side-channel attacks from complex exploits requiring unique expertise to automated attack processes through two main approaches: profiled attacks using identical devices for training, and non-profiled attacks that learn directly from the target.

The first is a profiled attack, meaning that the attacker has the same or a similar device to train on. The profiling phase entails gathering power traces from the profiling device while it engages in cryptographic activity with known keys; this process creates labeled datasets that allow ML models to understand the classification of power patterns to specific key values or intermediate functions. The attack phase utilizes the trained model on the attack power traces from the attack device with unknown keys. This involves supervised learning and requires only a few attack power traces to achieve high success rates.

Non-profiled attacks bring the assault directly to the victim device, implying that it attacks a device without ever being trained on the same device beforehand. This means that the ML model must acquire information from the non-labeled traces and likely rely on unsupervised or semi-supervised learning techniques to determine feature correlation with expected hidden information. This creates a much more complicated scenario, albeit a much more realistic one, should a malicious entity not be able to create a profiling device.

Ultimately, the attack's ML-oriented approach has the same logic to create a linear pipeline. For example, for data collection, oscilloscopes record power traces from target devices, as demonstrated in various acquisition frameworks [14]. Each power trace is generated from one side-channel attack corresponding to a power consumption pattern that occurs across timestamps during its cryptographic operation. Therefore, for memory efficiency versus a slight loss of accuracy, power traces are stored in float32 arrays.

Preprocessing happens such that raw traces become cleaned versions with improved signals and fewer artifacts. Features are standardized to achieve a mean of zero and unit variance, or normalized into [0,1] or [-1,1]. Temporal misalignment is adjusted after triggering via cross-correlation or dynamic time warping. High-frequency noise or 50/60Hz noise gets filtered out. Dimensionality reduction occurs via principal component analysis or feature selection. Data augmentation relative to preprocessing includes adding Gaussian noise, faux desynchronization, and scaling of random amplitudes.

Feature extraction happens where POI occur within the cleaned traces that denote moments of leakage. Ideally, with DL, the network learns features automatically without the need for feature engineering, a significant advantage over statistically learned, traditional approaches.

Training happens through a neural network that learns the predictive distribution of what constitutes key material. For profiled attacks, this means supervised training with a device under attack correlating power traces with its generated key bytes. For non-profiled attacks, unsupervised learning applies clustering techniques to identify predicted traces that are based upon the use of keys. The problem of class imbalance poses issues when some predicted key values are more likely than others, but solutions exist through balanced sampling and weighted loss functions to ensure a more even output during training operations. Trained models use K-fold cross-validation when power traces are scarce and holdout validation when there is enough data.

Key recovery occurs when the resultant model is tested on a new, predicted power trace for the intended device. The model assesses what's been trained and what's expected, outputting a probability distribution of expected key values for each byte. These are ultimately combined across the different traces via maximum likelihood estimation, Bayesian estimation, or simple averaging to recover the full expected key value. For profiled attacks, more stringent classification with negatives is required to determine how likely all values of potential key bytes were incorrect.

DL approaches have better key recovery in comparison to traditional statistical methods [2, 4]. With a neural network, the most appropriate features are selected automatically, and they function properly with complicated leakage types (higher order effects, masked implementations). In addition, overfitting is always a problem since the number of trace

features is greater than the number of observations, so regularization and early stopping based on key recovery metrics (not validation loss) is essential.

## 2.4 Neural Network Architectures

There are many neural network architectures that have been successful for SCA, providing certain advantages for different attack scenarios. These include methodology for comparing success rates across attack types, as surveyed by Hettwer et al. [8]. Therefore, solutions of a more recent vintage use ensemble methods and transfer learning as well, either blending the subsequent architectures or using the learned models from one implementation to attack a similarly trained implementation based on only a handful of trained samples.

### 2.4.1 Convolutional Neural Networks

One of the most powerful architectures applied to side-channel attacks is CNN, which is capable of evaluating spatial features from power traces with little to no pretreatment of the data. In essence, using multiple layers of convolutions, the network examines the channels of the input data to determine which parts of the power traces are relevant without relying on a sensitive manual feature extraction process for guidance.

The major architectural design choices for CNNs related to side-channel attacks stem from the nature of power traces. For instance, one-dimensional convolutions are required since power traces are one-dimensional time series; thus, 1D layer convolutions are needed as opposed to 2D convolutions used for images. Moreover, kernel sizes are often large (11 or greater in SCA literature) to identify long-range dependencies in power traces as opposed to small kernels in imaging; some kernels are size 21 or 31. Max pooling is avoided in favor of average pooling as average pooling maintains magnitude information—low amplitude leakage information should not be discarded, but average pooling can retain such low values. Furthermore, layers contain batch normalization, which is important for stabilized training and faster convergence. This is true because different power measurements are scaled differently across devices.

Zaid et al. [7] note that optimized CNN architectures achieve comparable performance to VGG-style models while reducing parameters by an order of magnitude, making them suitable for resource-constrained evaluation environments and real-time attack scenarios. GPU memory limitations become significant for traces longer than 100,000 samples with deep architectures.

### 2.4.2 Recurrent Neural Networks and LSTMs

Recurrent Neural Networks (RNN) and specifically Long Short-Term Memory (LSTM) networks have the greatest potential for detecting temporal dependencies contained in power traces. LSTMs maintain an internal state across time steps, allowing them to determine the association between two operations even when they appear in time-disjointed, yet temporally connected, sequences.

The structure of LSTM is such that it processes power traces in order with a set of gates that control what information comes in and out. There is a forget gate that determines what information can be lost, an input gate that controls what new information can be accepted, an output gate that decides which outputs are suitable to render, and the cell state which represents the long-term memory. LSTM networks excel at capturing temporal dependencies in power traces by maintaining state information across time steps, employing gradient clipping to prevent exploding gradients from RNN overtraining on lengthier sequences.

Another method includes implementing a Bidirectional LSTM to process the power trace in time-disjointed, yet aligned, sequences as well. This helps capture any interdependencies that may be lost during a simple forward pass of the time series data. For example, when leakage from an operation that occurs after the viewed operation relies on an earlier read, such a technique helps. Using power coupling or electromagnetic radiation, for example, might obscure the first operation's detection due to the time delay in arrival of information relating to the operation that came first but rendered later.

#### *2.4.3 Hybrid CNN-LSTM Architectures*

These networks are a merger between CNN feature extraction and LSTM temporal processing, thus holding the benefits of both worlds. Since CNN layers focus on local features, and LSTM layers deal with what features change over time in relation to the operations conducted during the cryptographic process, such an architecture works best when protected implementations are under attack and time-dependent leakage is observed over various instances over time.

#### *2.4.4 Transformer-Based Architectures*

The latest breakthrough in DL for side-channel attacks has come from Transformers, which use a self-attention mechanism to process an entire trace all at once rather than sequentially. Due to self-attention, a Transformer learns the relationship between any two points in a trace regardless of spacing and then generates the attention weights for each element of the trace based on its contribution to predicting each corresponding key byte. Therefore, the global receptive field exceeds that of CNNs and RNNs that can only learn features in a local input field or sequentially. In addition, note that attention weights are visible for interpretability as researchers can see which segment of the trained trace was most responsible for successfully retrieving a specific key; this is opposite to CNN and LSTM designs from which no human interpretability is possible. However, visualizing the attention weights requires extra coding during implementation.

In 2024, Bursztein et al. [1] introduced GPAM (Generalized Power Analysis Model), a Transformer-based architecture that represents a significant advancement in automated side-channel attacks. GPAM combines temporal patchification, multi-scale attention heads, and multi-task learning objectives. The architecture demonstrates improved performance on standard benchmarks including the ASCAD database and DPA Contest v4 datasets.

TransNet, proposed by Hajra et al. [9], addresses the shift-invariance problem in SCA through specialized positional encodings and augmentation strategies. The model maintains robust performance on misaligned traces.

#### 2.4.5 Computational Complexity

The proposed CNN architecture has  $O(L \cdot k \cdot n \cdot d^2)$  computational complexity where  $L$  = number of layers,  $k$  = kernel size,  $n$  = sequence length, and  $d$  = number of filters, with parallelizability of operations. The proposed RNN and LSTM have  $O(n \cdot d^2)$  complexity with sequential operations that must be done in order and therefore, not parallelizable. The Transformer architecture has  $O(n^2)$  memory complexity for attention matrices where  $n$  = sequence length, plus  $O(n \cdot d)$  for activations, with parallelization occurring.

Whereas CNNs provide the most optimal performance and efficiency balance across the board—ideal for real-time attack applications with average trace lengths between 10,000–100,000 samples (executing within milliseconds on typical GPUs)—In our experiments, Transformers excel with shorter, preprocessed traces (less than 10,000 samples) and when a prediction needs explanation through attention visualization. GPU memory constraints in our setup limited transformer traces to approximately 50,000 samples, needing to square the memory used.

The highest F1 score was achieved by an ensemble merging all three architectures and other approaches; however, this was obtained through a vast extent of computational resources. Finally, the approach of transfer learning lessens the necessity to profile all traces where one trained on one device’s implementation was transferable on a similar implementation using the already trained model with minimal profiling needed.

### 2.5 Preprocessing Algorithms

Power trace preprocessing and feature extraction are crucial for the success of ML-based side-channel attacks. Power traces are extremely noisy with anywhere between 10,000 to 1,000,000 samples per encrypt/decrypt operation, making alignment and denoising as well as extracting useful features critical before any neural network processing can take place.

Time complexity for such processing is dependent upon which processing algorithms are used. The time complexity of alignment through cross-correlation is  $O(n \cdot m)$  in the spatial domain,  $O(n \cdot \log(n))$  if performed via FFT acceleration, where  $n$  is the length of the trace and  $m$  is the size of the sliding window. The time complexity of feature selection can range from  $O(n \cdot m)$  for t-tests to  $O(n \cdot m \cdot k \cdot \log(k))$  for k-NN mutual information, where  $m$  is the number of traces. The space complexity is approximately  $O(n \cdot m)$  for rendering the traces plus some temporary memory usage for transient calculations. If the dataset is too large, memory-mapped files are needed for streaming processing to avoid memory use in RAM.

#### 2.5.1 Trace Preprocessing Algorithms

The preprocessing pipeline addresses a few essential concerns with power traces. For example, DC offset is removed, which represents constant additions from measuring tools that corrupt raw measurements usually in the millivolt range to volts from various power testing tools and configurations. Detrending removes linear/polynomial additions that happen over time either through temperature drift or power supply adjustments. Alignment accommodates trigger jitter (which can range from 1 to 100+ clock cycles depending on the setup) and differences in the sampling clock. Filtering removes the high-frequency

noises yet keeps the cryptographic leakage, with low cut-off frequencies typically between 0.1 to 0.5 of the clock device frequency running. Standardization ensures scaling is uniform across various collections when necessary for model portability.

### *2.5.2 Feature Selection and Dimensionality Reduction*

POI Selection occurs when researchers know which samples in the power traces will be most beneficial, meaning that fewer calculations need to be performed and successful attack percentages increase. The Sum of Squared T-statistics (SOST) is used to determine which points the power consumption varies the greatest for the different key hypotheses, meaning that it exhibits high leakage. It calculates class means for each key hypothesis and looks at the variance between class means compared to the weighted means of the sizes of each class.

Mutual information determines how much information is transmitted from the power traces to the key hypotheses. It captures linear and non-linear relationships, and therefore, this information theoretic method may reveal leakages that a simple correlation-based assessment would not.

Principal Component Analysis (PCA) reduces dimensionality while keeping variance. Incremental PCA allows for data to be processed in batches as well, meaning that even a dataset larger than the current RAM can get processed. The number of components can be selected to retain a specific amount of variance after trimming, ideally 95–99% for any reduction in side-channel detection.

### *2.5.3 Data Augmentation*

Data augmentation provides the illusion of additional data points and can improve the generalizability of the model, which is important when limited experimental profiling traces exist. Adding Gaussian noise improves model generalization [5]; we use amplitudes of 5–10% of the trace's standard deviation in our experiments. Random shifting teaches the models to learn from misaligned shifts where no alignment is guaranteed [5]; we train with shifts of  $\pm 50\text{--}100$  samples. Amplitude scaling is important to integrate gain differences from the same measurement taken on different days, usually differing from 0.9–1.1.

Synthetic colored noise can be used instead of white noise. For instance, pink noise ( $1/f$  spectrum) mimics flicker noise found in many hardware components. Brown noise ( $1/f^2$  spectrum) mimics thermal drift and low-frequency variances. The noise is adjusted to maintain a realistic SNR but provided enough variance to train effectively without overfitting.

### *2.5.4 Integration Pipeline*

Our pipeline uses 1,000–5,000 selected features from traces of 50,000–500,000 samples, meaning 50–100 $\times$  dimensionality reduction. Effective dataset diversity increases through data augmentation, which helps generalization of the model, particularly for obfuscated implementations. In our implementation, processing speeds range from 1,000–10,000 traces per second, depending on whether or not preprocessing is required and how long that preprocessing takes.

### 3. Analysis

#### 3.1 Cryptographic Implementation Vulnerabilities

Widespread, popular cryptographic implementations have vulnerabilities exploited by DL side-channel based attacks. The shift in the area of attack from traditional, statistical, resource-heavy investigation to trained, predictable, DL founded projections suggests easier, more efficient penetrations are on the rise.

##### 3.1.1 AES

AES implementations are susceptible as well. For example, an authenticating neural network-based power analysis attack reveals higher accuracy in its software implementations via SubBytes assessment where the non-linearity of S-box during SubBytes creates a distinguishable power consumption profile. Therefore, CNNs trained—especially with larger convolution kernels—find these patterns faster (50–100 traces for attacking unprotected SubBytes implementations in our experiments as opposed to hundreds with traditional investigations) [2].

Yet even first-order masked AES implementations in the ASCAD database are susceptible to sophisticated ML attacks. Neural network-based attacks have demonstrated key recovery capabilities [2]; in our experiments on the ASCAD dataset, we achieve key recovery with 1,000–2,000 traces, learning how to measure the leakage from various operations that, under standard statistical analysis, appear uncorrelated and thus irrelevant. Thus, this is a major flaw for masking constructions based upon the idea that an attacker can never effectively combine information from multiple leakage points.

Similarly, constant-time implementations rely upon the ability to eliminate timing side channels that even the traces resulting in a live implementation will not stand as an attack vector. While constant-time compilation efforts prevent data-dependent timing fluctuations, power consumption and subsequent power analysis can still be utilized. This occurrence is because, even under constant time, through the physical properties of the CMOS circuit, the HW of various data values will consume different levels of power. Therefore, an attacker can use a neural network to generate and scrutinize power traces with a success rate of 70–85% in our experiments for significant, albeit low-level, differences while traditional DPA in our tests showed no significant leakage with 1,000–2,000 traces.

##### 3.1.2 RSA

Due to the reliance on power analysis, ML attacks target RSA implementations that many in the industry utilize, particularly for modular exponentiation. The square-and-multiply technique, even with hardening attempted across the board, still gives away information based on power draw that an ANN categorizes upon. As demonstrated in our implementation, LSTMs can recover a significant portion of the secret exponent bits, reducing the remaining search space substantially and enabling algebraic methods to complete key recovery with additional analysis.

The multiplication function called the Montgomery ladder seeks to give the same number of operations regardless of which bits are in an exponent; however, transitions between operators still show minimal differences in power draw that ML can find. Under the best circumstances, Bidirectional LSTMs can recover a significant portion of bits since they learn long-term dependencies between the various actions executed in the ladder. Our implementation demonstrates bit-recovery rates of 60–70% (percentage of secret exponent bits correctly identified) on simulated RSA-2048 traces under optimal laboratory conditions. Thus, these models reveal the telltale signs of the conditional swap operations that a well-trained deep neural network can learn as features even if the human eye cannot detect them.

Using Chinese Remainder Theorem (CRT) shortcuts to accelerate RSA exposes weaknesses as well. Being able to perform partial exponentiations in parallel creates distinguishable power signatures that a neural network can separate and learn/analyze. We demonstrate how attacking RSA-CRT reduces the amount of traces necessary to learn the private key by 30–50% compared to attacking standard RSA implementations.

### 3.1.3 ECC

Elliptic Curve Cryptography (ECC) implementations are not any different. They have their own vulnerabilities relative to ML-based side-channel attacks. For instance, the scalar multiplication operation that secures ECC generates power traces based upon the value of the scalar and the points of the curve being multiplied. Deep learning trained on such power traces can effectively recover bits of the key. Our implementation requires only 1,000–5,000 traces for successful attacks. Our implementation achieves 60–75% bit-recovery rates (percentage of scalar bits correctly identified) on ECDSA P-256 traces without defenses.

Our contribution also includes specific attacks against Curve25519 and demonstrates that even this curve, designed to mitigate side-channel attacks, still has vulnerabilities.

Machine learning techniques have been applied to Curve25519 side-channel analysis [6]. Our CNN-based approach differentiates between point doubling and point addition by learning subtle power consumption differences. Our transformer-based implementation achieves 40–55% classification accuracy (distinguishing point operations) against protected Curve25519 implementations using 5,000–10,000 traces.

Furthermore, additional attacks stem from nonce leakage in the ECDSA signing generation process. For example, we show with our lattice attack implementation that after performing an ML-based nonce bit extraction via a power analysis to distinguish the nonce bits, obtaining 4–8 bits per nonce after power analysis over 200–500 signatures can lead to private key recovery in specific scenarios. The NN components yield 70–80% accuracy in identifying these bits as nonce bits from power traces, rendering future lattice attacks computationally feasible.

Non-Adjacent Form (NAF) was used to reduce point additions and creates specific ternary patterns in power usage. Hence, our CNN architectures—trained on the NAF-generated scalar multiplications—are able to recreate portions of the ternary representation with 65–75% accuracy in our experiments, which may lead to a partial recovery of the key using algebraic methods.

It's also important to note that these attack success rates occur in a lab setting with idealized measurements. Real-world applications may employ compensating noise, environmental shielding, and compensating factors that greatly decrease an attack's success. For example, ideal lab settings present SNR values greater than 20 dB, stable triggering without trigger jitter, and controlled temperature. Real-world applications must deal with electromagnetic noise from other devices, temperature fluctuations masking the baseline of power consumption, physical shielding and casings absorbing emanations, and live compensating activities not present on experimental boards. Furthermore, legitimate devices use many of these at the same time, so the compounding effect lowers these success rates far below what individual countermeasure evaluation would suggest.

### 3.2 Comparative Performance

The effectiveness of different neural network architectures varies significantly based on the specific characteristics of the target implementation and available attack constraints. Our comprehensive evaluation reveals distinct advantages and trade-offs for each architectural approach.

CNNs demonstrate strong performance for general-purpose SCA, particularly when attacking implementations with localized leakage. The ability of CNNs to automatically learn spatial features through their convolutional filters eliminates the need for manual feature engineering. Our experiments show that optimized CNN architectures can achieve 70–85% key-byte recovery success rates (as defined in Section 3.3) against first-order masked AES implementations using 1,000–2,000 traces, representing a 2–5x reduction compared to traditional statistical methods (DPA/CPA) in our experiments. The use of large kernel sizes (11 or greater) proves beneficial for capturing extended temporal dependencies in power traces, while average pooling preserves magnitude information that max pooling would discard.

LSTM networks are well-suited for time-dependent attacks and non-synchronized traces. As a type of recurrent network, the LSTM retains state information from one timestep to the next, making it capable of learning trends over time from the applied operations. Bidirectional LSTMs achieve success on desynchronized traces even with significant misalignments where CNN accuracy degrades. Our experiments show 60–75% success rates with 500-sample desynchronizations. This is important because many real-world attacks will not result in perfectly synchronized traces.

The other possibility is a hybrid CNN-LSTM network. Here, the advantages of spatial detection through the CNN can be utilized with time detection from the LSTMs. Thus, the convolutional layers evaluate local features and then pass those features to LSTM layers to sense change over time. CNNs hybridized with LSTMs reduce the number of required traces by 15–25% compared to standard CNNs in our tests when attacking protected implementations where the leakage happens over multiple timesteps as opposed to one.

Transformers represent another new architecture that has emerged for SCA in the past few years. The self-attention mechanism allows for transformers to assess the entire trace at once, determining if two points are correlated no matter how far apart they are in the

overall sequence. We develop a GPAM-type architecture based on transformers and achieve a success rate between 20–40% higher than that of CNNs in our tests on the same datasets, though at significantly higher computational cost. The attention heads per layer function at different temporal resolutions; therefore, they learn to pay attention over time to short-term vs. long-term, localized patterns of focus, and macro-level focus of the entire trace.

Furthermore, ensemble techniques that aggregate different architectures always outperform individual models since they can make up for each other’s strengths. For example, we test the ensemble of architectures with CNN, LSTM, and transformer-based models and find that it takes 20–30% fewer traces than the best single model in our experiments. The ensemble works best against second-order masked implementations where an attack can be successfully mounted with decent conditions for 3,000–5,000 traces by different trained architectural views of the same power consumption.

Table 1: Comparison of neural network architectures for side-channel attacks on first-order masked AES (ASCAD dataset). Trace counts represent published optimal results from literature; our experimental results (Section 3.2) required approximately 2 $\times$  more traces under realistic conditions.

| Architecture | Strengths                                               | Weaknesses                               | Typical Traces Required |
|--------------|---------------------------------------------------------|------------------------------------------|-------------------------|
| CNN          | Effective feature extraction, resistant to noise        | Less effective for temporal dependencies | 500–1,000               |
| LSTM         | Excellent temporal modeling, captures sequence patterns | Higher computational requirements        | 700–1,200               |
| CNN-LSTM     | Combines spatial and temporal analysis                  | Complex to optimize                      | 400–900                 |
| Transformer  | Superior for desynchronized traces, generalizes well    | Highest computational requirements       | 100–300                 |

### 3.3 Evaluation Metrics

A formal quantitative assessment is necessary to evaluate the effectiveness of these ML-based side-channel attacks. Our implementation provides all metrics needed for baseline and relative modifications between attacks and defenses.

#### 3.3.1 Guessing Entropy

Guessing entropy [15] represents the average number of guesses required to try before obtaining the correct recovery of the key; therefore, it signifies a successful attack. The value is generated from all possible key guesses based on how much scoring is attributed to each relative to the total scoring of the fully guessed key string determined by where the correct key falls on that sorted list. Our implementation generates this from log-probabilities to ensure the value does not exceed numerical limits over time, as probabilities could otherwise do.

Therefore, successful attacks consistently lower guessing entropy across multiple traces; successful defenses require 5–20x more traces to achieve comparable guessing entropy reduction compared to unprotected implementations.

#### 3.3.2 Success Rate

The success rate is the probability of guessing the key correctly within  $x$  guesses, determined predominantly by first order success (the key in the first spot). We compile the success rate of our implementation after multiple runs of the experiment to account for statistical variance.

The success rate that should be recorded relative to the number of traces tends to create a sigmoidal function where the middle value is the point of no return for the number of traces for successful and accurate key recovery.

#### 3.3.3 Mutual Information

Mutual information is the degree of overlap between power leakages and the secret data, essentially a model-independent leakage metric for assessing distinguishing capabilities. We demonstrate that our implementation can generate significantly more mutual information than a template attack from the same number of traces—up to 15–30% in our experiments, depending on implementation and noise levels.

#### 3.3.4 Signal-to-Noise Ratio

ML models can extract features that improve the effective signal-to-noise ratio in the learned feature space compared to raw measurements. For instance, our analysis techniques assess both expected SNR and an “algorithmic SNR,” which assesses to what extent the model can increase the significance of the signal via learned functions. In our experiments, transformer models achieved the highest increase in algorithmic SNR, on average 6–12 dB above raw trace SNR.

## 3.4 Training Procedures

Training procedures and loss functions are vital for the success of neural network-based side-channel attacks as they are uniquely tailored to the purpose of key recovery.

### 3.4.1 Ranking Loss

By implementing ranking loss, we train directly to minimize the rank of the correct key, meaning that we train the attack to focus on what it needs to succeed. This is particularly beneficial for scenarios where relative key ranking is prioritized over global probability calibration; it decreased guessing entropy by 20–30% relative to cross-entropy training in our experiments.

### 3.4.2 Focal Loss

This loss ratio benefits class imbalance. For example, in a side-channel attack, out of 256 keys, only one key is the correct one. Thus, by giving more weight to the training of hard examples while simultaneously under-weighting easier negatives, this loss function facilitates better convergence speed and improved attack performance.

Focal loss results in improved success rates for hard targets such as masked implementations [10]. Our experiments show 10-15% improvement on masked implementations.

### 3.4.3 Curriculum and Multi-Task Learning

Our GPAM implementation benefits from curriculum learning and multi-task learning for SCA. Attacking from increasingly complex starting points and attacks from different tasks increases the transfer of knowledge and ability to do the main job.

Multi-task models have comparable key recovery performance with 15–25% fewer trained traces, indicating better sample efficiency.

### 3.4.4 Curriculum Learning Attacks

We utilize curriculum learning attacks where learned attacks are trained. Tracing clean and then noisy, misaligned, and masked generates models that generalize far better in the wild. In our experiments, accuracy is 10–20% better on unseen devices compared to non-curriculum models.

## 3.5 Countermeasures

As ML advances at a rapid pace and previously known side-channel attacks become compounded, new techniques must be applied to counter the new enhanced predictability learned from trained neural networks.

### 3.5.1 Higher-Order Masking

We implement full boolean masking countermeasures where all sensitive computations are split into random shares. For instance, in the masked S-box implementation, we ensure that no intermediate value exists unmasked.

However, with this attack, we demonstrate that even with second-order masking, the impacts are still susceptible to trained ensemble neural network attacks, which can compile the leakage of multiple shares together given enough traces. The key, however, is to prevent neural networks from learning the associative property of the masked values.

This theory is not without fault, however, as more recent studies indicate that deep enough networks can learn to re-add masked shares as high as third-order masking—but the trace requirements grow exponentially. For instance, first-order masking can be undone with as little as 1,000–5,000 traces, second-order masking needs greater than 10,000–50,000 traces, and third-order masking can exceed 100,000–500,000 traces to recover the key. The exponential increase exists because the deeper the masking, the more advanced recombination functions the neural network must learn—and how the information overlaps across timestamps/mask shares. In addition, the preprocessing needs become more advanced as well, often requiring centered product combining or greater statistical moments to appear as usable features. Therefore, while it may be exploitable in theory, the practical requirements for data render such attacks unlikely to occur on systems that either have transient lifespans or are otherwise inaccessible by malicious actors.

### *3.5.2 Hiding Techniques*

Our software-based hiding countermeasures rely on manipulating the in-order execution algorithm to decrease the SNR of sensitive operations (making leakage harder to detect). For example, random delay insertion increases temporal noise, yet falls short against alignment-invariant architectures.

Operation shuffling allows operations of the same type that are positioned to execute in tandem to be shuffled as long as dependencies are met. We implement a random topological sort to provide the greatest source of entropy while still guaranteeing proper execution. This increases attack traces by 2–5x, yet it cannot prevent the attack from succeeding.

### *3.5.3 Constant-Time Implementations*

The library of constant-time implementations shows which sensitive operations can be implemented resiliently to diminish data-dependent paths.

These implementations successfully negate potential timing differentiation; however, circuit-level power differentials cannot be avoided. Our asserting tools allow for comparison of timing differentials based on various inputs, and we discover sub-microsecond differentials; however, power leakage can always be hijacked.

### *3.5.4 Blinding and Randomization*

Our implementation incorporates advanced blinding. We perform sensitive operations under the guidance of randomizing the intermediate values.

Blinding factor  $r$  changes per operation, meaning an adversary cannot take the average of multiple averages across the traces to reverse the randomization. This complicates matters for the adversary, although it remains vulnerable by nontrivial ML methods if the adversary can get a sufficient number of traces.

### *3.5.5 ML-Based Attack Detection*

Our implementation has systems that detect side-channel attack attempts in real time, based on neural networks.

Such detection systems are capable of rendering 85–92% true positive rates with less than 5% false positive rates in our controlled tests under a controlled laboratory environment, but allow for detection to dynamically invoke harsher mitigation strategies in response to attack detection. The detector assesses the statistical characteristics of power traces and detects anomalies that suggest an organized measurement attempt.

### *3.5.6 Combined Defenses*

We apply our countermeasures in a coalition with other defenses and adaptive measures.

This feature makes life more difficult for the attacker, as they must go above and beyond to bypass all independent countermeasures at the same time. While it is true that some specific neural network can bypass each countermeasure independently, the coalition in a larger system makes them have to live with heightened intrusiveness, upwards of 5–20x traces required for key recovery than if they were not protected at all.

### *3.5.7 Synthetic Trace Generation*

Our system features inclusive support for synthetic side-channel trace generation with customizable leakage features and countermeasures. This enables researchers to test attacks and defenses in realistic settings.

The ability to generate synthetic traces with different masking orders, noise, and implementation features allows for reproducible testing and algorithm development. However, synthetic traces lack some physical properties of real hardware—manufacturing differences across chips, temperature-dependent leakage variations, and coupling between nearby circuit elements. Therefore, in our experiments, algorithms trained purely on synthetic data underperformed by 15–25% when assessed on real hardware traces. Nevertheless, generating traces synthetically is useful for initial algorithm development, for comparative analysis of attacks, and when physical device access is limited. The best results seem to come from training on synthetic data and then fine-tuning with real traces instead of relying exclusively on either approach for data requirements and attack performance.

### *3.5.8 Benchmarking Tools*

A collection of benchmarking and visualization tools enables the comparative measurement of attacks in our library across different neural network designs, datasets, and defenses. The visualization tools permit deeper insight into attack processes, attentional heat maps of trained networks, and areas where leakage often occurs.

Stripped down to the core, these tools give stable, comparable assessments across different approaches and reveal what filtering, architecture, and training results in the most successful outcomes for championed attacks.

## 4. Conclusion

This research has demonstrated that ML drastically alters side-channel attacks—both how one might attempt them and how one would defend against power analysis attacks. The results of implementation and analysis of this research suggest that regardless of attacker expertise, implementation time, and financial investment in attack development, incorporation of neural networks decreases the necessity of all three to successfully achieve cryptographic key recovery.

The primary contributions of our findings indicate that performance using ML is better than performance without ML in the following categories:

1. **Reduced Data Requirements:** Where typical statistical attacks achieve key recovery through 50–90% more traces, DL achieves successful key recovery with substantially fewer. For example, CNNs are able to recover keys for first-order masked AES implementations with only 1,000–2,000 traces while typical DPA requires 5,000–10,000 traces. This lower success threshold renders such attacks easier to accomplish out in the field where traces may only be able to be taken under certain circumstances.
2. **Architecture-Specific Benefits:** Each trained neural network architecture offers benefits for attacking based on the scenario. For example, CNNs learn features on their own from aligned traces and achieve key-byte success rates between 70–85% on masked implementations. LSTMs perform better with desynchronized samples since they achieve 60–75% success rates even with 500-sample desynchronizations. Ultimately, Transformers require more resources and time for training, but they exceed the others by 20–40% on the most complicated datasets.
3. **Attacks Work on All Implementations:** The success of our implementation can attack AES, RSA, and ECC implementations regardless of whether countermeasures are employed. Even the most sophisticated implementations—second-order masking, Montgomery ladder implementations, and constant-time coding—are vulnerable to the attack of ensemble neural networks as long as sufficient traces are collected.
4. **Empowering Practical Attacks:** Where one would expect relying upon statistical findings, practical attacks that work against the statistical findings are vulnerable to ML attacks. For example, a masking scheme based on the notion that no one would ever be able to aggregate the findings from different points of leakage would work—but it doesn't because even if a human thinks it's impractical to aggregate results, a neural network will practically do so for itself and learn those associations. However, attacks are 5–20x harder when multilayer defenses are employed.
5. **Democratized Attacks:** DL is automated, meaning an attacker does not necessarily need prior knowledge of cryptography or feature engineering by hand. We give the tools of implementation, and those wanting to experiment and attack others' implementations do not need knowledge of SCA, nor even need to be experts; everyone from academic researchers to security auditors is at risk per our results.

## **Implications:**

- *Security Assessment*: Enterprises must reassess the security of their systems using contemporary cryptography features, particularly those systems that rely on features that previously leveraged only classical defenses.
- *Defenses to Anticipate*: Defenses for the future should be made from an ML perspective that may include adversarial training and randomization that are classically anti-ML.
- *Certifications to Expect*: Existing security certifications must utilize ML testing methods to ensure that any and all attack vectors learned and explored for vulnerabilities have been addressed.
- *Defenses We've Introduced*: We've shown that our ability to detect in real-time via ML has proven that we can implement active defenses in the middle of the attack.

## **Future Research Directions:**

1. *Adversarial Machine Learning*: Training methods that make models immune to the defensive measures created against them while assessing defenses that exploit vulnerabilities in ML models.
2. *Transfer Learning Advancement*: Improved transfer of trained models between different implementations/devices to reduce the need for profiling in real attacks.
3. *Automated Architecture*: Ability to automatically assess the optimal neural network architecture for any given side-channel application instead of relying on pre-defined network architectures.
4. *Hardware-Software Integration*: Solutions that depend on unified protective strategies such that algorithmic countermeasures presume software design as attack-aware.
5. *Real World Attacks*: Moving beyond proof of concept in the lab to real-world attacks on actual off-the-shelf devices with realistic access and trace quality limitations.
6. *Comprehensive Benchmark*: A complete, elaborate benchmark for valid evaluation of attack and defense strategies under various implementation and attack/design thresholds.

The side-channel security landscape has been transformed by the advent of DL. On the one hand, attacks using neural networks are easy and efficient; on the other hand, they enable ML for detection and trained responses for defense. Because attacks and defenses will always be in a state of flux, side-channel security will sustain an expanding line of research.

We facilitate future research with our open-source implementation at [github.com/ahmedtaha100/Neural-Cryptanalyst](https://github.com/ahmedtaha100/Neural-Cryptanalyst). Our endeavor—made publicly available to the security community—aims to serve as a one-stop vulnerability test to facilitate the incremental security of any cryptographic implementation with its new findings and potential defenses against other hidden vulnerabilities.

As embedded applications, IoT applications, and even cloud applications continue to rely on such cryptographic implementations more and more, these implementations need to be assessed for vulnerabilities to side-channel attacks. This project illustrates how we need to question our prior assumptions of security now that we're in a world dominated by ML, and we need to conduct research which generates new assumptions of security that hold up against what today's neural networks can quickly ascertain.

## 5. Acknowledgments

This manuscript originated as a term project for EN.695.641 Cryptology at Johns Hopkins University (Spring 2025). The author thanks Prof. Tom McGuire for guidance and feedback.

**Data Availability:** Implementation, datasets, and experimental scripts are available at [github.com/ahmedtaha100/Neural-Cryptanalyst](https://github.com/ahmedtaha100/Neural-Cryptanalyst).

## References

- [1] E. Bursztein, L. Invernizzi, K. Kral, D. Moghimi, J.-M. Picod, and M. Zhang, "Generalized Power Attacks against Crypto Hardware using Long-Range Deep Learning," IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2024, no. 3, pp. 472-499, 2024.
- [2] L. Masure, C. Dumas, and E. Prouff, "A Comprehensive Study of Deep Learning for Side-Channel Analysis," IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020, no. 1, pp. 348-375, 2020.
- [3] S. Picek, G. Perin, L. Mariot, L. Wu, and L. Batina, "SoK: Deep Learning-based Physical Side-Channel Analysis," ACM Comput. Surv., vol. 55, no. 11, pp. 1-35, 2023.
- [4] L. Wu, G. Perin, and S. Picek, "I choose you: Automated hyperparameter tuning for deep learning-based side-channel analysis," Cryptology ePrint Archive, Paper 2020/1293, 2020.
- [5] E. Cagli, C. Dumas, and E. Prouff, "Convolutional Neural Networks with Data Augmentation Against Jitter-Based Countermeasures," in Proc. CHES 2017, pp. 45-68, 2017.
- [6] L. Weissbart, L. Chmielewski, S. Picek, and L. Batina, "Systematic Side-Channel Analysis of Curve25519 with Machine Learning," J. Hardw. Syst. Secur., vol. 4, no. 4, pp. 314-328, 2020.
- [7] G. Zaid, L. Bossuet, A. Habrard, and A. Venelli, "Methodology for Efficient CNN Architectures in Profiling Attacks," IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020, no. 1, pp. 1-36, 2020.
- [8] B. Hettwer, S. Gehrer, and T. Guneyşu, "Applications of machine learning techniques in side-channel attacks: a survey," J. Cryptogr. Eng., vol. 10, pp. 135-162, 2020.
- [9] S. Hajra, S. Saha, M. Alam, and D. Mukhopadhyay, "TransNet: Shift Invariant Transformer Network for Side-Channel Analysis," in Progress in Cryptology - AFRICACRYPT 2022, LNCS, vol. 13503, Springer, 2022.

- [10] M. Kerkhof, L. Wu, G. Perin, and S. Picek, "Focus is Key to Success: A Focal Loss Function for Deep Learning-based Side-Channel Analysis," Cryptology ePrint Archive, Paper 2021/1408, 2021.
- [11] P. Kocher, J. Jaffe, and B. Jun, "Differential Power Analysis," in Advances in Cryptology - CRYPTO '99, LNCS, vol. 1666, Springer, 1999, pp. 388-397.
- [12] P. C. Kocher, "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems," in Advances in Cryptology - CRYPTO '96, LNCS, vol. 1109, Springer, 1996, pp. 104-113.
- [13] C. E. Shannon, "Communication in the Presence of Noise," Proc. IRE, vol. 37, no. 1, pp. 10-21, Jan. 1949.
- [14] Y. Bai, R. Y. Acharya, and D. Forte, "SPERO: Simultaneous Power/EM Side-channel Dataset Using Real-time and Oscilloscope Setups," arXiv preprint arXiv:2405.06571, May 2024.
- [15] F.-X. Standaert, T. G. Malkin, and M. Yung, "A Unified Framework for the Analysis of Side-Channel Key Recovery Attacks," in Advances in Cryptology - EUROCRYPT 2009, LNCS, vol. 5479, Springer, 2009, pp. 443-461.
- [16] J.-J. Quisquater and D. Samyde, "ElectroMagnetic Analysis (EMA): Measures and Counter-Measures for Smart Cards," in E-smart 2001, LNCS, vol. 2140, Springer, 2001, pp. 200-210.
- [17] J. P. Mishra and S. K. Sahay, "Modern Hardware Security: A Review of Attacks and Countermeasures," arXiv preprint arXiv:2501.04394, Jan. 2025.
- [18] S. Chari, J. R. Rao, and P. Rohatgi, "Template Attacks," in Cryptographic Hardware and Embedded Systems - CHES 2002, LNCS, vol. 2523, Springer, 2003, pp. 13-28.
- [19] S. Mangard, E. Oswald, and T. Popp, Power Analysis Attacks: Revealing the Secrets of Smart Cards, Springer, 2007.
- [20] E. Brier, C. Clavier, and F. Olivier, "Correlation Power Analysis with a Leakage Model," in Cryptographic Hardware and Embedded Systems - CHES 2004, LNCS, vol. 3156, Springer, 2004, pp. 16-29.