

# Harshith Kantamneni

kantamneniharshith@gmail.com | +1 (414) 916-5799 | linkedin.com/in/hk4231 | github.com/Drogon4231  
Madison, WI

M.S. ECE student specializing in CUDA Kernel Optimization, GPU Memory Systems, and Performance Engineering

## SKILLS

- **Languages:** C, C++, Python, CUDA
- **Parallel Programming:** Triton, OpenMP
- **Tools & Infrastructure:** PyTorch, Slurm, gem5, McPAT, Git
- **Domains:** GPU Kernel Optimization, GPU Memory Systems, High-Performance Computing, Computer Architecture, ML for Systems

## EDUCATION

|                                                                                                                      |                                            |
|----------------------------------------------------------------------------------------------------------------------|--------------------------------------------|
| <b>University of Wisconsin–Madison</b><br><i>M.S. Electrical and Computer Engineering</i>                            | <i>Madison, USA</i><br>Sep 2024 – Dec 2025 |
| • Coursework: High Performance Computing, Advanced Computer Architecture, Machine Learning, Fault-Tolerant Computing |                                            |

|                                                                                                      |                                        |
|------------------------------------------------------------------------------------------------------|----------------------------------------|
| <b>Vellore Institute of Technology</b><br><i>B.Tech in Electronics and Communication Engineering</i> | <i>Amaravati, India</i><br>2020 – 2024 |
|------------------------------------------------------------------------------------------------------|----------------------------------------|

## PROJECTS

|                                                                                                                                                  |                               |
|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
| <b>GPU Kernel Performance Benchmarking (CUDA &amp; Triton)</b><br><i>C++, CUDA, Triton, Python</i>                                               | Dec 2025 – Present            |
| • Implemented and optimized GPU kernels for <b>GEMM</b> and <b>parallel reductions</b> using both <b>CUDA</b> and <b>Triton</b> kernels.         |                               |
| • Applied shared-memory tiling, block-level parallelism, and launch configuration tuning to study performance trade-offs across implementations. |                               |
| • Benchmarked kernel throughput and effective memory bandwidth using <b>CUDA event timing</b> across varied configurations.                      |                               |
| • Analyzed sensitivity to tile shape, block size, and memory access patterns to identify occupancy and bandwidth bottlenecks.                    |                               |
| • Built a reusable benchmarking harness to enable fair, repeatable performance comparisons between CUDA and Triton kernel variants.              |                               |
| <b>ML-Guided CUDA Kernel Configuration Optimizer</b><br><i>Python, PyTorch, CUDA, Slurm</i>                                                      | Jan 2025 – May 2025<br>GitHub |
| • Trained a learning-based model to predict near-optimal <b>CUDA grid and block sizes</b> from workload characteristics.                         |                               |
| • Reduced exhaustive kernel tuning overhead by <b>&gt;95%</b> , achieving up to <b>30% runtime improvement</b> on matrix workloads.              |                               |
| • Automated large-scale kernel benchmarking using Slurm pipelines and CUDA event-based timing.                                                   |                               |
| <b>MI300X Memory-System Reverse Engineering &amp; gem5 Calibration</b><br><i>CUDA, gem5, Python</i>                                              | Sep 2025 – Dec 2025           |
| • Developed CUDA microbenchmarks (pointer-chasing, stride sweeps) to characterize <b>L2/L3/HBM latency and bandwidth</b> behavior.               |                               |
| • Analyzed access patterns and timing behavior to infer chiplet-level cache organization and memory hierarchy effects.                           |                               |
| • Calibrated gem5 MI300X memory parameters (cache size, line size, latency) to better align simulation with measured hardware trends.            |                               |
| • Established a reproducible methodology for studying modern GPU memory systems using microbenchmarks and simulation.                            |                               |
| <b>ML-Assisted Task Graph Partitioning</b><br><i>Python, XGBoost, Slurm</i>                                                                      | Jan 2025 – May 2025<br>GitHub |
| • Modeled task-graph structural features to predict runtime-efficient GPU partitioning strategies.                                               |                               |
| • Achieved <b>&lt;5% prediction error</b> , reducing design-space exploration time by <b>25%</b> .                                               |                               |
| • Enabled faster performance evaluation by replacing exhaustive sweeps with learned configuration selection.                                     |                               |
| <b>ABFT-GEMM Reliability and Performance Analysis</b><br><i>Python, Slurm</i>                                                                    | Sep 2025 – Dec 2025           |
| • Performed Monte-Carlo fault injection on GEMM workloads to study numerical robustness under soft-error conditions.                             |                               |
| • Measured performance and accuracy trade-offs across matrix sizes using batch execution on HPC clusters.                                        |                               |
| • Analyzed resilience overheads in compute-intensive kernels relevant to large-scale GPU workloads.                                              |                               |