

## Task 1.1 — Describe the XIAO ESP32S3

| البند               | (تفصيل) الموصفات                                                                        | TinyML الأهمية لـ                                                                         |
|---------------------|-----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| Microcontroller     | Espressif ESP32-S3، Xtensa LX7 شرئي النواة، مع تعليمات متوجهة /AI. ([espressif.com][1]) | قدرة حسابية جيدة للاستدلال، وتعليمات متوجهة تسرّع طبقات الشبكات العصبية عبر ESP-NN. ([1]) |
| Operating Frequency | 240 MHz. ([seeedstudio.com][2])                                                         | . زمن استدلال أقصر $\Rightarrow$ تردد أعلى ([2])                                          |
| On-chip SRAM (DRAM) | 512 KB SRAM (+ ~16 KB RTC SRAM). ([alldatasheet.com][3])                                | Tensor Arena هذه الذاكرة تستضيف؛ أهم قيد عملی TFLite Micro الخاصة بـ ([3])                |
| External PSRAM      | XIAO ESP32S3 على إصدارات 8 MB عادة (Sense/Plus). ([kiwi-electronics.com][4])            | الصور؛ يمكن تهيئة الكبيرة buffers لاستخدامها لكنها أبطأ من TFLM SRAM. ([4])               |
| Flash               | 8 MB أو 16 MB (Sense/Plus). ([docs.zephyrproject.org][5])                               | ؛ لا tflite لتخزين برنامجك وملف النموذج. ([5])                                            |
| Wireless            | Wi-Fi 2.4 GHz + BLE 5.0. ([seeedstudio.com][2])                                         | مناسب لإرسال نتائج الاستدلال وتحديثات OTA. ([2])                                          |
| Power Consumption   | ـ وضع السكون العميق حوالي 14 $\mu$ A. ([Paradisetronic.com][6])                         | أساسي للتطبيقات المعتمدة على البطارية بين عمليات الاستدلال. ([6])                         |

Note — Variants: If you have the XIAO ESP32S3 Plus, the flash is typically 16 MB; for the XIAO ESP32S3 and Sense, it's usually 8 MB flash + 8 MB PSRAM. Check your product page or the printing on the box/board.

<https://www.espboards.dev/esp32/xiao-esp32s3-plus>

2) Why these points matter for TinyML in practice

512 KB SRAM is the most important constraint: the Tensor Arena for running TFLite Micro must live in contiguous memory; you can offload parts to PSRAM on the ESP32, but it's still slower, so smart design and quantization are essential.

[https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/docs/memory\\_management.md](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/docs/memory_management.md)

ESP-NN on the S3 provides accelerated int8 kernels (Conv/Depthwise/FC/Softmax...), which significantly improves latency when using full-integer quantization.

<https://github.com/espressif/esp-nn>

## Task 2 — TFLite & Quantization (Theory)

### 2.1 What TensorFlow Lite (TFLite) does and why it's needed

TFLite is a deployment toolkit designed to run ML models on devices with limited compute, memory, and storage. It solves two key problems. First, it converts a Keras/TensorFlow model into a compact FlatBuffer .tflite file that can be parsed efficiently on the device. Second, it replaces the heavy TensorFlow runtime with a tiny interpreter—TFLite on general edge devices and **TFLite Micro** on MCUs—that contains only the operators your model actually uses.

Edge devices like the XIAO ESP32S3 benefit from TFLite because the binary and model footprint are small enough to fit into constrained flash. On MCUs, TFLite Micro uses static memory planning and pre-allocates a single **tensor arena**, avoiding dynamic allocations at run time. TFLite also enables fast, low-power **int8** inference by providing integer kernels, often with MCU-specific optimized implementations.

### 2.2 What quantization is and what PTQ does

Quantization reduces numerical precision—typically from 32-bit floating point to 8-bit integers—for both weights and activations. **Post-Training Quantization (PTQ)** performs this conversion after training by using a small **representative dataset** sampled from your real input distribution to calibrate value ranges. During calibration, the converter learns a scale and zero-point for each tensor so it can map real values to/from int8 accurately.

Quantization usually yields a model that is about **4× smaller** (32-bit → 8-bit), with real shrink slightly less due to metadata and overhead. Integer operations are typically faster and more energy-efficient than floating-point on MCUs, so latency and power draw improve. Accuracy often drops only slightly (commonly <1–2% for simple vision/classification tasks) and can be minimized with **per-channel** weight quantization for conv/FC layers and by ensuring the representative dataset mirrors your preprocessed inputs.

## Task 3 — Conversion & Quantization (Theory only)

### 3.1 MLP → Full-Integer TFLite (conceptual steps)

Begin with a trained Keras MLP. Choose **full-integer int8** quantization so that both weights and activations use int8 and the entire pipeline runs with integer kernels. Prepare a representative dataset of roughly **100–1000** samples, passed through the **same preprocessing** used at inference time, to cover the normal dynamic range for calibration. Configure the converter to optimize for size/latency, restrict supported ops to built-in int8, and set `inference_input_type/inference_output_type` to int8 to keep the graph fully integer. For an MLP, expect the .tflite file to be roughly **¼ the float size**—e.g., a ~0.9 MB float model might become

**~0.225 MB** ( $\pm$  overhead)—with accuracy typically very close to the float baseline if calibration data matches deployment data.

### 3.2 CNN → Full-Integer TFLite (conceptual steps)

The pipeline mirrors the MLP case, with two practical notes. Use **per-channel** quantization for convolution weights to better preserve accuracy, and pay attention to **activation memory**, since intermediate feature maps can dominate RAM on MCUs. For a small CNN of ~0.2 MB (float32 weights), a full-int8 TFLite model often lands near **~0.05 MB** ( $\pm$  overhead). With a good representative set, accuracy commonly remains close to float on small datasets.

#### Common pitfalls (both models)

Calibration that does not reflect **real, preprocessed inputs** leads to poor quantization ranges and larger accuracy drops. Leaving parts of the model in float (due to unsupported ops) results in **mixed precision**, which can block MCU deployment. Finally, do not equate **file size with RAM needs**: on MCUs, the main RAM consumer is the **tensor arena** (activations + scratch), while the model binary itself typically resides in flash.

### Task 4 — Deployment Feasibility (Theory)

#### 4.1 Size comparison & SRAM reasoning (theoretical, based on your earlier sizes)

| Model | Keras Size<br>(Float32, MB) | Quantized TFLite<br>Size (int8, MB,<br>$\sim\frac{1}{4}\times$ ) | SRAM<br>Constraint<br>(XIAO) | Can Model Fit in SRAM?*                                                    |
|-------|-----------------------------|------------------------------------------------------------------|------------------------------|----------------------------------------------------------------------------|
| MLP   | $\approx 0.90$              | $\approx 0.225$                                                  | <b>512 KB</b>                | Likely yes, if input dims are small and arena $<\sim 300\text{--}350$ KB   |
| CNN   | $\approx 0.20$              | $\approx 0.050$                                                  | <b>512 KB</b>                | Likely yes, but watch activation maps; still typically fine for small CNNs |

On XIAO ESP32S3, the **.tflite** model lives in flash while **SRAM** is consumed by the tensor arena (peak activations + scratch), so keep the arena well below 512 KB—around  $\leq 300$  KB leaves headroom for stack and I/O. With **full-int8** models, file sizes near **~0.225 MB (MLP)** and **~0.050 MB (CNN)** are plausible and both can run if activations fit; with small inputs (e.g., 28×28 grayscale) and compact layers you can typically get  $< 100$  ms per inference on the dual-core 240 MHz MCU using optimized int8 kernels—i.e., real-time classification is feasible. You’ll struggle with large input tensors or very wide/deep layers that inflate activations, with ops lacking int8 support (forcing float), or with poor calibration. Best practices: keep the graph fully int8, build a representative dataset that matches deployment preprocessing, profile peak

arena usage and tune width/stride/padding; if you’re near limits, prune channels, reduce filters or input size, or use depthwise-separable convolutions.

## Task 4 — Deployment Feasibility

| Model | Keras Size<br>(Float32, MB) | Quantized TFLite Size<br>(int8, MB) | SRAM Constraint<br>(XIAO) | Can Model Fit in<br>SRAM?* |
|-------|-----------------------------|-------------------------------------|---------------------------|----------------------------|
| MLP   | 1.283 MB                    | 0.107 MB                            | 512 KB                    | Yes (very likely)          |
| CNN   | 0.691 MB                    | 0.059 MB                            | 512 KB                    | Yes (very likely)          |

## 4.2 Conclusion on XIAO ESP32S3

The main constraint is the **512 KB SRAM**. With fully quantized **int8** models and small inputs such as **28×28×1**, the expected tensor arena is very small, making both models feasible on the XIAO ESP32S3. In terms of performance, the dual-core processor running up to **240 MHz** with optimized **int8** kernels can easily perform **MNIST-scale inference in under 100 ms**—often much faster—achieving near real-time operation.

