



Max Planck Institute for Security and Privacy

# Cryptographic Engineering in Post-Quantum Cryptography

Vincent Hwang

November 6th, 2025, National Taiwan University, Taipei, Taiwan

# About Myself



- ▶ BSc., NTU CSIE, Taiwan (2016.09.01 ~ 2021.06.31).
  - ▶ ~ 2020: graph theory/algorithms, complexity theory (surveying).
  - ▶ 2020 ~ 2021, cryptographic engineering:
    - ▶ Courses Post-Quantum Cryptography.
    - ▶ Lattice-based cryptography (course).
    - ▶ Assembly optimizations (internship).
- ▶ MSc., NTU CSIE, Taiwan (2021.09.01 ~ 2022.06.31).
  - ▶ Cryptographic engineering (assembly, lattices).
- ▶ PhD (on going), MPI-SP, Germany (2023.01.01 ~ Now).
  - ▶ Cryptographic engineering:
    - ▶ Assembly programming for lattices.
    - ▶ Formal verification of assembly programs.
  - ▶ Elliptic-curve discrete logarithm (ongoing research).

# Cryptography



$P_a = P_b$ . Challenge: solve  $(d_a, d_b, P_a, P_b)$  from  $(G, Q_a, Q_b)$ .

# Cryptographic Engineering (Traditional Way)





## Cryptosystem

# Cryptographic Engineering (Holistic Way)



Algorithms



Cryptosystem

# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Holistic Way)



# Cryptographic Engineering (Life Cycle)



## Recommendations

Which parameters?  
Module dimension  
Polynomial dimension  
Coefficient ring

How to compute?

FFT  
Float  
NTT  
Modular arith.

Which hashes?

SHA-2  
SHA-3

Randomness sampling?

## High-level descriptions



## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
Variable-time inst.  
Power side-channels  
Performance (hardware)  
Area  
Latency  
Performance (software)  
Low-level arithmetic  
Register pressure  
Memory access  
Vectorization

# Post-Quantum Cryptography



- ▶ Pre-quantum: integer factorization (IF) and elliptic-curve discrete logarithm (ECDLP).
- ▶ Post-quantum (believed to be secure against quantum computers):
  - ▶ Lattice-based.
    - ▶ NTRU, LWE, M/R-LWE, M-LWR, M-SIS.
    - ▶ Polynomial multiplications, matrix multiplications over  $\mathbb{Z}_q$  ( $\mathbb{Z}/q\mathbb{Z}$ ).
  - ▶ Supersingular-isogeny-based.
  - ▶ Multivariate-based.
  - ▶ Code-based.
  - ▶ Hash-based.
  - ▶ Rich algebraic structures except for the hash-based ones.

# NIST Post-Quantum Cryptography Standards



Post-Quantum Cryptography Standardization by the National Institute for Standards and Technology (NIST).

- ▶ Selecting cryptosystems: 2016 ~ 2025. Four rounds.
- ▶ FIPS203 (ML-KEM, Kyber), 2024.
  - ▶ Lattice. Module-Learning-With-Error (M-LWE).
- ▶ FIPS204 (ML-DSA, Dilithium), 2024.
  - ▶ Lattice. M-LWE, Module-Short-Integer-Solution (M-SIS).
- ▶ FIPS205 (SLH-DSA, SPHINCS<sup>+</sup>), 2024.
  - ▶ Hash-based.
- ▶ FIPS206 (FN-DSA, Falcon), 2025?
  - ▶ Lattice. SIS over NTRU, fast Fourier sampling,
- ▶ HQC, 2027?
  - ▶ Code-based.

# What Am I Doing?



- ▶ Assembly optimization for polynomial multiplications.
  - ▶ Assembly: [Armv7-M](#) (microcontrollers), Armv8-A and AVX2 (smartphones, personal computers).
  - ▶ Lattices: [Dilithium](#), Kyber, NTRU, NTRU Prime, and Saber.
  - ▶ Polynomial rings:  
 $\mathbb{Z}_q[x]/\langle x^{256} + 1 \rangle$  ,  $\mathbb{Z}_{2^{13}}[x]/\langle x^{256} + 1 \rangle$  ,  $\mathbb{Z}_{2^k}[x]/\langle x^n - 1 \rangle$  ,  $\mathbb{Z}_q[x]/\langle x^p - x - 1 \rangle$
  - ▶ [Optimizations for modular arithmetic](#).
  - ▶ Optimizations for polynomial transformations.
- ▶ Formal verification of assembly.
  - ▶ [Floating-point in Falcon \(lattice\)](#).
- ▶ Cryptanalysis.
  - ▶ [ECDLP](#) on data-center-level GPUs.

This talk.

## Multiplying Polynomials without Powerful Multiplication Instructions

Vincent Hwang, YoungBeom Kim, and Seog Chung Seo

- ▶ Variable-time long multiplication instructions.
- ▶ Constant-time modular arithmetic.
- ▶ Prime modulus: generalized Barrett multiplication suitable for multi-limb arithmetic.
- ▶ Power-of-two modulus: Nussbaumer.
- ▶ Paper: <https://tches.iacr.org/index.php/TCHES/article/view/11926>.
- ▶ Artifact: [https://github.com/vincentvh/PolyMul\\_Without\\_PowerfulMul](https://github.com/vincentvh/PolyMul_Without_PowerfulMul).

# Where Are We?



## Recommendations

Which parameters?  
Module dimension  
Polynomial dimension  
Coefficient ring

How to compute?

FFT  
Float  
NTT  
**Modular arith.**

Which hashes?

SHA-2  
SHA-3

Randomness sampling?

## High-level descriptions



## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
**Variable-time inst.**  
**Emulation**

Power side-channels  
Performance (hardware)  
Area  
Latency

Performance (software)  
**Low-level arithmetic**  
Register pressure  
Memory access  
Vectorization

# What Is Constant-Time in Cryptographic Implementations?



- ▶ Constant-time: ~~constant execution time~~, **secret-independent** execution time.
- ▶ Variable-time: **secret-dependent** execution time.
- ▶ Common computations that leak timing side-channels if input is secret.
  - ▶ Conditional branches.
  - ▶ Table lookup.
  - ▶ Indexing with ~~secret index~~.
  - ▶ Some assembly instructions.
    - ▶ Divisions.
    - ▶ Floating-point arithmetic.
    - ▶ [Long multiplications](#).
    - ▶ Typical resolution: emulate with constant-time instructions.

# Modular arithmetic in ML-DSA



- ▶ ML-DSA.
- ▶ Module-lattice-based digital signature.
- ▶  $q = 2^{23} - 2^{13} + 1$ . Modular arithmetic in  $\mathbb{Z}_q$ .
- ▶ Let `int32_t a b`; we often need one of the following:
  - ▶ A 64-bit product.
    - ▶ `int64_t c = (int64_t)a * b;`
    - ▶ The highest 32 bits of a 64-bit product.
      - ▶ `int64_t c = (int32_t)((int64_t)a * b) >> 32;`
  - ▶ Cortex-M3: 32-bit Armv7-M instruction set architecture (ISA).
    - ▶ 32-bit long multiplications are variable-time → leak secret information.
    - ▶ Multi-limb arithmetic for the long product → significant overhead.

# Multiplication Instructions



R is a power of two,  $2^{32}$  on Cortex-M3.



# Generalized Barrett Multiplication (2-Limb)



Goal: compute  $a \cdot c \equiv ab \pmod{q}$ .  $\left\lfloor \frac{ab}{q} \right\rfloor \approx \left\lfloor \frac{ab'}{R} \right\rfloor$ ,  $b' = \left\lfloor \frac{bR}{q} \right\rfloor$ .

Originally, compute  $c = ab - \left\lfloor \frac{ab'}{R} \right\rfloor q$  with  $\left\lfloor \frac{ab'}{R} \right\rfloor$  as:



Instead, compute  $c = ab - \left\lceil \frac{ab'}{R} \right\rceil q$  with  $\left\lceil \frac{ab'}{R} \right\rceil$  as:





- ▶ For a  $\delta > 0$ ,  $\delta$ -integer approximation  $\llbracket \cdot \rrbracket$ :  $\forall r \in \mathbb{R}, |r - \llbracket r \rrbracket| \leq \delta$ .
- ▶ For a  $b$ , write  $b_l + b_h \sqrt{R} = \left\lfloor \frac{bR}{q} \right\rfloor$ . Define  $\llbracket \cdot \rrbracket_b$  as

$$\forall r \in \mathbb{R}, \llbracket r \rrbracket_b := \left\lfloor \frac{a_l b_h}{\sqrt{R}} \right\rfloor + \left\lfloor \frac{a_h b_l}{\sqrt{R}} \right\rfloor + a_h b_h$$

where  $a_l + a_h \sqrt{R} = \frac{rR}{\left\lfloor \frac{bR}{q} \right\rfloor}$ ,  $b_l + b_h \sqrt{R} = \left\lfloor \frac{bR}{q} \right\rfloor$ .

- ▶ Obviously,  $|\llbracket r \rrbracket_b - r| < 3$  (see previous slide).
- ▶  $|\llbracket r \rrbracket_b - \lfloor r \rfloor| < 3$  (see paper).
- ▶  $\left| \left( ab - \llbracket \frac{ab}{R} \rrbracket_b q \right) - ab \bmod q \right| \leq 3q$ .

# Results



- ▶  $1.92 \times$  faster modular multiplication.
  - ▶  $\approx$  as a long multiplication.
  - ▶ One of the operands must be a constant.
  - ▶ An extra precomputed constant.
  - ▶ No assumptions on the modulus.
  - ▶ No assumptions on the constants.
  - ▶ Above conditions are usually met in lattice-based cryptosystems.
- ▶  $1.51 \times$  faster NTT (for poly. mul.) in  $\mathbb{Z}_q[x]/\langle x^{256} + 1 \rangle$ .

Table: Cycle count on Cortex-M3.

|            |    |
|------------|----|
| Long       | 11 |
| Montgomery | 23 |
| Barrett    | 12 |

Optimize **a modular multiplication** instead of **a long multiplication + a modular reduction**.

# Formal Verification of Emulated Floating-Point Arithmetic in Falcon

Vincent Hwang

- ▶ Paper: [https://link.springer.com/chapter/10.1007/978-981-97-7737-2\\_7](https://link.springer.com/chapter/10.1007/978-981-97-7737-2_7).
- ▶ Artifact: [https://github.com/vincentvh/Float\\_formal](https://github.com/vincentvh/Float_formal).

# Where Are We?



## Recommendations

Which parameters?  
Module dimension  
Polynomial dimension  
Coefficient ring

How to compute?

FFT  
Float

NTT

Modular arith.

Which hashes?

SHA-2  
SHA-3

Randomness sampling?

## High-level descriptions



## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
Variable-time inst.  
Emulation

Power side-channels  
Performance (hardware)  
Area  
Latency

Performance (software)  
Low-level arithmetic  
Register pressure  
Memory access  
Vectorization



- ▶ FIPS206, FN-DSA. Drafting.
- ▶ Lattice-based digital signature scheme.
- ▶ Hash-and-sign.
- ▶ Double-precision floating-point arithmetic in signing.
- ▶ Concerns of floating-point arithmetic.
  - ▶ Not always constant-time.
  - ▶ Double-precision FPU does not even exist on some microcontrollers!
    - ▶ Cortex-M3 (Armv7-M).
    - ▶ Cortex-M4F (Armv7E-M + single-precision).
  - ▶ Emulating with integer and logical arithmetic.

# Floating-Point Arithmetic



Double-precision in this talk.



- ▶  $0 < e < 2047$  (normal values):

$$(-1)^s 2^{e-1075} (2^{52} + m).$$

- ▶ Zeros:  $e = m = 0$ .
- ▶ Other values:  $e = 0 \wedge m \neq 0$ ,  $e = 2047$ .

# Motivations of Formal Verification for Floats



**Incorrect zeroization** in floating-point multiplication (FP mul.).

- ▶ Emulated floating-point arithmetic (by Falcon submitters) in C and Armv7-M assembly.
- ▶ Zeroizing and rounding in the wrong order.
- ▶  $\exists$  **normal non-zero** floating-point products  $\approx \pm 2^{-1074}$  being **zeroized**.
- ▶ 692/2048 (34%) float constants in FFT admit such float products!

Questions.

- ▶  $\exists$  such floats in Falcon? Experimentally no, but no proofs.
- ▶ What are the programs actually doing?



- ▶  $\exists$  such float products in Falcon? Range checking.
  - ▶ No in FFTs of Falcon.
  - ▶ With **input conditions**,  $\neg\exists$  floats with incorrect zeroizations.
  - ▶ If FFT inputs are integers drawn from  $[-2^{15}, 2^{15}]$ , every floats in  $[2^{-476}, 2^{27}(2^{52} + 605182448294568)]$  ✓.
  - ▶ 0.3 seconds for an FP addition.
  - ▶ 2.6 seconds for an FP multiplication.
- ▶ What are the programs actually doing? Functional equivalence.
  - ▶ == more readable programs.
  - ▶ Armv7-M assembly  $\longleftrightarrow$  CryptoLine  $\longleftrightarrow$  Jasmin.
  - ▶  $\approx$  1 minute for each FP addition  $\longleftrightarrow$ .
  - ▶ 5 ~ 60 seconds for each FP multiplication  $\longleftrightarrow$ .

# More Published Works I



## Recommendations

Which parameters?  
Module dimension  
Polynomial dimension  
Coefficient ring

How to compute?

FFT

Float

NTT

Modular arith.

Which hashes?

SHA-2

SHA-3

Randomness sampling?

## High-level descriptions



Armv7-M

## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
Variable-time inst.  
Power side-channels  
Performance (hardware)  
Area  
Latency  
Performance (software)  
Low-level arithmetic  
Register pressure  
Memory access  
Vectorization

# More Published Works II



## Recommendations

Which parameters?  
Module dimension  
Polynomial dimension  
**Coefficient ring**

How to compute?  
FFT  
Float  
NTT  
Modular arith.

Which hashes?  
SHA-2  
SHA-3  
Randomness sampling?

## High-level descriptions



## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
Variable-time inst.  
Power side-channels  
Performance (hardware)  
Area  
Latency  
Performance (software)  
Low-level arithmetic  
Register pressure  
Memory access  
Vectorization

# More Published Works III



## Recommendations

Which parameters?  
Module dimension  
**Polynomial dimension**

Coefficient ring

How to compute?

FFT

Float

NTT

**Modular arith.**

Which hashes?

SHA-2

SHA-3

Randomness sampling?

## High-level descriptions



## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
Variable-time inst.  
Power side-channels  
Performance (hardware)

Area

Latency

Performance (software)  
Low-level arithmetic  
**Register pressure**  
**Memory access**  
**Vectorization**



## Recommendations

Which parameters?  
Module dimension  
Polynomial dimension  
Coefficient ring

How to compute?  
FFT  
Float  
NTT  
Modular arith.

Which hashes?  
SHA-2  
SHA-3  
Randomness sampling?

## High-level descriptions



Armv8-A Neon, x86 AVX2

## Implementations

Timing side-channels  
Conditional jumps  
Table lookups  
Variable-time inst.  
Power side-channels  
Performance (hardware)  
Area  
Latency  
Performance (software)  
Low-level arithmetic  
Register pressure  
Memory access  
Vectorization

# Publications



Authors in alphabetic order.

ACCEHHLNSWY20 CHKSSY21 ACCHKY21

Microcontroller opt.

AHKS22 HKS24 BHKPY22

AABEHOPSS25

Formal verification

HLSSTYY22

Survey of opt.

BHKYY21 AHY22 CCHY24 HLY24

H24a H24c

Mid-tier/high-end processor opt.

# Ongoing Works



- ▶ Deniable authenticated KEM.
  - ▶ USENIX 2026, rebuttal phase.
  - ▶ Feasibility results in practice (mid-tier/high-end processors).
  - ▶ Coauthors: theory.
  - ▶ My job: implementation.
- ▶ Elliptic-curve discrete logarithm on GPUs.
  - ▶ A random curve over a generic 131-bit prime.
  - ▶ H100 GPU.
  - ▶ Projection: 66,000 H100-days.
  - ▶ Low-level optimizations (inline ptx).
  - ▶ Memory access.



- ▶ Optimizations.
  - ▶ More architectures: Armv9-A and AVX-512.
  - ▶ More cryptosystems: Falcon.
- ▶ Formal verification of Falcon.
  - ▶ Transformation from high-dim. to one-dim. samplers.
    - ▶ Functional correctness.
    - ▶ Range properties.
  - ▶ Security arguments for the samplers.
    - ▶ Rényi divergence.
    - ▶  $\exists$  math. proofs for the one-dimensional. case in literature.
    - ▶  $\neg\exists$  math. proofs for the high-dimensional case.
- ▶ Cryptanalysis on GPUs.
  - ▶ Elliptic-curve method in the general number field sieve for **IF**.
    - ▶ Random curves over random  $\sim 130$ -bit integers.
  - ▶ Lattice sieving for shortest vector problem.
    - ▶ GPU-memory-friendly bucketing.
    - ▶ Inner products.



Thank you for attention