

# FrugalAI Chip: Deterministic Modular Architecture for Low-Cost NPUs

A High Capital Efficiency (CAPEX) Approach for Disposable AI

 José Ignacio Peinador Sala

*Independent Researcher*

Valladolid, Spain

joseignacio.peinador@gmail.com

**Abstract**—The semiconductor industry faces a critical economic barrier: while AI demand grows exponentially, the cost per transistor in advanced nodes (3 nm) has stopped decreasing at the historical rate (end of Dennard scaling). This work proposes *FrugalAI Chip*, an architecture that prioritizes capital efficiency (CAPEX) over operational energy efficiency (OPEX), targeting the “Disposable AI” niche through a *Shared-Nothing* modular design manufactured in mature nodes (28 nm).

We mathematically validate a matrix decomposition isomorphism ( $\Delta < 10^{-5}$ ) that eliminates the need for cache coherence. Experimentally, the architecture matches the monolithic baseline on MNIST (100.1% relative performance) and outperforms it by 4.8% on CIFAR-10 (78.86% vs 74.04%) through dynamic padding. Real workload simulations (ResNet-50) show that communication overhead is negligible (0.05%), while a Monte Carlo analysis ( $N = 10,000$ ) quantifies the impact of process variability (“Tail Latency”) as a 15.7% performance penalty, mitigated through mesochronous interfaces. The industrial cost model reveals a  $17.9\times$  price reduction compared to monolithic alternatives. Although energy efficiency is lower ( $0.35\times$ ), we defend this penalty based on the massive reduction in manufacturing carbon footprint (“Embodied Carbon”) and a  $10.9\times$  return in performance per dollar invested.

**Index Terms**—Modular NPU, Shared-Nothing Architecture, Chiplets, Semiconductor Economics, CAPEX vs OPEX, Embodied Carbon, Disposable AI, Edge Computing, Yield Optimization, Sustainable Semiconductors

## I. INTRODUCTION

Moore’s Law, understood as the exponential reduction in cost per transistor, has hit an economic wall. While physics allows scaling down to 3 nm and 2 nm (GAAFET), the capital cost to manufacture these devices has skyrocketed due to the complexity of EUV lithography. A 300 mm wafer at 3 nm has an estimated market cost of \$20,000, compared to \$3,000 for a fully amortized mature node like 28 nm [6].

This divergence has created a market gap: ultra-high performance and cost hardware exists for data centers (“Elite AI”), but we lack truly affordable inference accelerators for ubiquitous AI (“Disposable AI”).

### A. The FrugalAI Paradigm

This work challenges the “performance at any cost” dogma. We propose *FrugalAI Chip*, an architecture that consciously accepts lower transistor density and lower unit energy efficiency in exchange for a drastic reduction in acquisition cost (CAPEX). Our hypothesis is that, for massive edge applications (industrial sensors, smart toys, logistics tags), device cost is the primary barrier to entry.

The architecture is founded on five pillars:

- 1) **Extreme Modularity:** Use of multiple small chiplets manufactured in 28 nm to maximize yield ( $> 95\%$ ) versus the low yield of large monolithic chips.
- 2) **Static Determinism:** Elimination of complex control logic (active NoC, cache coherence) in favor of a *Static Slicing* compiler [4].
- 3) **Ensemble Enhancement:** Combining multiple specialized experts (each worker processing a different slice) acts as a natural *ensemble*, improving final accuracy in complex tasks (+4.8% on CIFAR-10) despite deterministic partitioning.
- 4) **Stochastic Robustness:** Management of natural silicon variability via mesochronous interfaces, inspired by approximate computing [5].
- 5) **Extensibility to Transformers:** Through window-based local attention and hybrid slicing (tokens + heads), the static-slicing paradigm extends to lightweight transformers with acceptable overhead (<70%) and a speedup of  $21.47\times$ , broadening the applicability domain.

### B. Main Contributions

- **Theoretical Validation:** Demonstration of a matrix isomorphism allowing network partitioning without accuracy loss.
- **Experimental Evaluation:** Results on MNIST (100.1% relative), CIFAR-10 (+4.8%), and a hybrid dataset demonstrating scalability.
- **“Tail Latency” Audit:** Statistical analysis ( $N = 10,000$ ) quantifying the impact of process variability (15.7% penalty).
- **Differential Economic Model:** A detailed CAPEX vs OPEX analysis demonstrating a  $10.9\times$  advantage in performance per dollar.
- **Embodied Carbon Analysis:** Lifecycle assessment showing a 91% reduction in carbon footprint for short-lifespan applications.

### C. State of the Art and Differentiation

System decomposition into chiplets is not new, but most proposals seek to scale performance upwards (Datacenter), not cost downwards (Edge).

### D. Architectural Comparison

- **NVIDIA Simba** [2]: Pioneer in chiplet-based inference. Uses a mesh Network-on-Chip (NoC) with

active routers to manage dynamic traffic. *Difference*: FrugalAI eliminates the active NoC and associated buffers, resolving routing at compile time (Static Slicing) to save control area and energy.

- **Tesla Dojo / Tenstorrent:** Optimize for massive bandwidth and training. Use silicon interposers and expensive 2.5D packaging technologies. *Difference*: FrugalAI uses standard organic substrates to keep packaging cost below \$5.

Our proposal aligns more with the “Dark Silicon” philosophy [3]: given we cannot power all transistors simultaneously due to power density, we use “slow and cheap” transistors in massive parallel.

## II. THEORETICAL FOUNDATIONS: MODULAR ISOMORPHISM

The central premise of FrugalAI is to eliminate hardware complexity (cache coherence, dynamic NoC) by offloading it to a deterministic mathematical decomposition.

### A. Matrix Decomposition Theorem

Let the fundamental operation of neural networks be matrix multiplication  $C = A \times B$ , where  $A \in \mathbb{R}^{m \times n}$  and  $B \in \mathbb{R}^{n \times p}$ . In a monolithic architecture, this operation requires global memory access.

For a modular architecture of  $N$  workers with no shared memory (*Shared-Nothing*), we define the *strided slicing* operation. Let  $A^{(r)}$  be a subset of rows of  $A$  such that  $A^{(r)} = A[r :: N, :]$ , where  $r \in \{0, \dots, N - 1\}$ .

**Theorem 1** (Decomposition Isomorphism). *Dense matrix multiplication is isomorphic to the sum of independent partial products, reconstructed via canonical permutation matrices  $P_k$ :*

$$C = \sum_{r=0}^{N-1} \sum_{s=0}^{N-1} P_r^T \left( A^{(r)} B^{(s)} \right) P_s \quad (1)$$

This isomorphism guarantees that each worker can operate on a disjoint subset of data ( $A^{(r)}, B^{(s)}$ ) stored in its local memory (SRAM), eliminating the need for MESI/MOESI coherence protocols.

### B. Numerical Validation

We validated the numerical stability of this theorem by implementing the decomposition in floating-point arithmetic (FP32) on matrices of size  $2048 \times 2048$ . The observed mean absolute error was  $\Delta < 10^{-6}$ , confirming that the decomposition introduces no significant accuracy loss for inference.



Figure 1. Visualization of *Strided Slicing*. Logically sparse data (stride) is physically compacted in the worker’s local memory, maximizing spatial locality.

## III. EXPERIMENTAL VALIDATION

The experimental evaluation was designed to answer three critical questions: 1) Does partitioning affect model accuracy?, 2) Is the architecture scalable to complex datasets?, and 3) Is interconnection latency a fatal bottleneck?

### A. Experiment 1: MNIST and Structural Regularization

We implemented a “FrugalAI” architecture with  $N = 6$  workers on the raw MNIST dataset. We compared performance against a monolithic MLP of equivalent capacity.

Table I  
EXPERIMENTAL RESULTS ON MNIST

| Model                                         | Accuracy     | Relative Performance |
|-----------------------------------------------|--------------|----------------------|
| Monolithic Baseline                           | 96.8%        | 100%                 |
| <b>FrugalAI (Modular, <math>N = 6</math>)</b> | <b>96.9%</b> | <b>100.1%</b>        |

**Analysis:** The modular architecture suffered no degradation. On the contrary, we observed a slight improvement (100.1%). We attribute this to a *partitioning regularization* effect: by preventing each worker from seeing the full image (only processing a *strided slice*), overfitting to noisy global features is reduced. This effect is amplified in complex datasets like CIFAR-10, where the implicit specialization of each worker leads to a **significant improvement (+4.8%)** over the monolithic baseline, acting as a natural *ensemble* of specialized experts.

### B. Experiment 2: CIFAR-10 and Scalability

To validate scalability, we evaluated the system on CIFAR-10 by increasing the number of workers to  $N = 8$ . We implemented *Dynamic Padding* to maintain dimensional consistency.

| Model                               | Accuracy      | Parameters            | Inf. Time      | Improv. vs Base |
|-------------------------------------|---------------|-----------------------|----------------|-----------------|
| Monolithic Baseline                 | 74.04%        | 57,290                | 2.60 ms        | 0.00%           |
| Modular ( $N = 4$ )                 | 78.34%        | 229,570 (4.0x)        | 3.90 ms        | +4.30%          |
| <b>Modular (<math>N = 8</math>)</b> | <b>78.86%</b> | <b>456,826 (8.0x)</b> | <b>6.60 ms</b> | <b>+4.82%</b>   |

Table II  
RESULTS ON CIFAR-10: EFFICIENCY VS PARAMETERS

**Counter-intuitive Finding:** The modular architecture not only does not degrade performance, but **significantly improves it (+4.82%)**. This demonstrates that aggregating multiple inexpensive chiplets can exceed the representation capacity of a single monolithic chip, albeit at the expense of an increase in parameters (8.0x) and latency (2.5x). This trade-off is acceptable in the context of “Disposable AI” where manufacturing cost dominates over operational efficiency, and we also obtain better accuracy.

### C. Real Workload Simulation: ResNet-50

A common criticism of distributed architectures is communication latency (“Tail Latency”). To audit this, we simulated the exact data flow of a ResNet-50 distributed across 6 workers manufactured in 28 nm (1 GHz), assuming a conservative D2D bandwidth of 32 GB/s.

Table III  
BOTTLENECK ANALYSIS (RESNET-50)

| Layer (Block)         | Compute ( $\mu\text{s}$ ) | Comm. ( $\mu\text{s}$ ) | Overhead (%) |
|-----------------------|---------------------------|-------------------------|--------------|
| Conv1 (Stem)          | 614.66                    | 0.63                    | 0.10%        |
| Layer1 (Bottleneck)   | 2408.45                   | 0.95                    | 0.04%        |
| Layer4 (Final)        | 4816.90                   | 1.40                    | 0.03%        |
| <b>Global Average</b> | -                         | -                       | <b>0.05%</b> |

**Key Finding:** The average communication overhead is negligible (**0.05%**). This is due to a paradoxical advantage of mature nodes: since 28 nm transistors are relatively slow at computing, the arithmetic cycle time is long enough to completely “hide” the transmission latency of data halos. The architecture is *compute-bound*, not *communication-bound*.

#### IV. EXTENSION TO TRANSFORMER MODELS: OVERCOMING THE GLOBAL ATTENTION BARRIER

While Sections 3.1-3.2 demonstrated the suitability of FrugalAI for CNNs, and Section 6.3 identified limitations in non-canonical architectures, this section explicitly addresses the challenge of Transformers—architectures founded on global attention that apparently contradict FrugalAI’s *Shared-Nothing* paradigm. We present an architectural adaptation that allows executing lightweight Transformers with acceptable overhead, significantly expanding the chip’s applicability domain.

##### A. The Fundamental Problem: Global Attention $O(N^2)$

The core operation of Transformers is multi-head attention:

$$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V \quad (2)$$

For a sequence of length  $N$  and dimension  $D$ , this operation requires  $O(N^2D)$  FLOPs and all-to-all communication between all tokens. In a modular architecture without cache coherence, this translates to prohibitive communication overhead ( $\approx 60\%$  according to initial simulations).

##### B. Design of Adapted Local Attention

We propose a transformation from global attention to window-based local attention, formalized in Algorithm 1. The key intuition is that for many edge applications (short text processing, vision patches), full context is not necessary—a local window provides sufficient representation capacity.

#### Algorithm 1 Adapted Local Attention for Static-Slicing

---

**Require:**  $\mathbf{X} \in \mathbb{R}^{N \times D}$  (input tokens),  $W$  (window size),  $n_w$  (number of workers)  
**Ensure:**  $\mathbf{Y} \in \mathbb{R}^{N \times D}$  (output tokens)

- 1: **Parallel for** each worker  $w \in \{0, \dots, n_w - 1\}$  **do**
- 2:  $t_{\text{start}} \leftarrow w \cdot \lfloor N/n_w \rfloor$
- 3:  $t_{\text{end}} \leftarrow \min((w + 1) \cdot \lfloor N/n_w \rfloor, N)$
- 4:  $\mathbf{X}_w \leftarrow \mathbf{X}[t_{\text{start}} : t_{\text{end}}, :]$  ▷ Spatial slicing
- 5:  $\mathbf{Q}_w, \mathbf{K}_w, \mathbf{V}_w \leftarrow \text{SlicedProjections}(\mathbf{X}_w)$
- 6: **for** each token  $t$  in  $\mathbf{X}_w$  **do**
- 7:      $w_{\text{start}} \leftarrow \max(0, t - W/2)$
- 8:      $w_{\text{end}} \leftarrow \min(|\mathbf{X}_w|, t + W/2 + 1)$
- 9:      $\mathbf{K}_{\text{window}} \leftarrow \mathbf{K}_w[w_{\text{start}} : w_{\text{end}}, :]$
- 10:     $\mathbf{V}_{\text{window}} \leftarrow \mathbf{V}_w[w_{\text{start}} : w_{\text{end}}, :]$
- 11:     $\mathbf{Y}_w[t] \leftarrow \text{LocalAttention}(\mathbf{Q}_w[t], \mathbf{K}_{\text{window}}, \mathbf{V}_{\text{window}})$
- 12: **end for**
- 13: **end parallel**
- 14:  $\mathbf{Y} \leftarrow \text{Concat}(\mathbf{Y}_0, \dots, \mathbf{Y}_{n_w - 1})$

---

#### C. Implementation and Experimental Demonstration

We implemented an adapted Transformer with  $N = 64$  tokens,  $D = 64$  dimensions, and 4 attention heads, designed for execution on  $n_w = 4$  workers. Table IV summarizes the comparative results.

| Metric                     | Naive (Global) | Adapted (Local) | Improvement       |
|----------------------------|----------------|-----------------|-------------------|
| FLOPs per layer            | 0.5 M          | 0.016 M         | <b>32.0× less</b> |
| Comm. per layer            | 32.0 KB        | 4.0 KB          | <b>8.0× less</b>  |
| Comm. overhead             | 13.7%          | 69.5%           | +55.8 points      |
| <b>Speedup (4 workers)</b> | <b>1.0×</b>    | <b>21.47×</b>   | <b>+2047%</b>     |
| Efficiency                 | 25.0%          | 536.6%          | +511.6 points     |

Table IV  
EXPERIMENTAL RESULTS: ADAPTED VS NAIVE TRANSFORMER

**1) Counter-intuitive Finding: Overhead vs Speedup:** Contrary to initial intuition, we observe that although the *relative* communication overhead increases (13.7%  $\rightarrow$  69.5%), the massive reduction in computation (32x fewer FLOPs) results in a net speedup of **21.47×**. This is because the absolute communication time remains low (0.18 $\mu\text{s}$  vs 0.26 $\mu\text{s}$  computation), while computation is perfectly distributed.

**2) Practical Demonstration:** The implementation successfully executes on 4 workers, producing a combined output of shape [1, 64, 64] dimensionally identical to the baseline. The mean value difference is 0.182 (3.2% relative error), acceptable for edge inference.



Figure 2. Transformer: Adapted vs Naive. Speedup: 21.47× despite higher overhead.

#### D. Complexity and Scalability Analysis

The transformation fundamentally changes the scalability profile:

- **Naive (Global):**  $T_{\text{total}} \propto N^2 D + \alpha ND \rightarrow$  not scalable
- **Adapted (Local):**  $T_{\text{total}} \propto \frac{N}{n_w} WD + \beta \frac{WD}{n_w} \rightarrow$  linearly scalable

where  $W$  is the window size (constant),  $\alpha$  and  $\beta$  are communication factors. For  $W \ll N$ , the second term dominates, allowing near-linear scalability with  $n_w$ .

#### E. Limitations and Applicability Domain

The adaptation introduces trade-offs that define its optimal domain:

| Parameter           | Limit             | Reason                       |
|---------------------|-------------------|------------------------------|
| Seq. length ( $N$ ) | $\leq 128$ tokens | Sufficient local window      |
| No. heads           | $\leq 8$          | Effective head slicing       |
| Model dim ( $D$ )   | $\leq 256$        | Limited memory/worker        |
| Window size ( $W$ ) | $8\text{--}16$    | Context/comm balance         |
| Depth               | $\leq 12$ layers  | Attention error accumulation |

Table V  
DOMAIN FOR ADAPTED TRANSFORMERS

These limits coincide with the niche of *Lightweight Edge Transformers*: models like MobileViT, TinyBERT, and NanoGPT, which dominate restricted device applications.

#### F. Implications for FrugalAI and Edge Market



Figure 3. Edge AI Market: CNNs (45.2%), MLPs (24.8%), Lightweight Transformers (18.5%).

The successful adaptation of Transformers has strategic implications:

- 1) **Market Expansion:**  $\approx 20\%$  of additional edge AI applications are now viable
- 2) **Competitive Advantage:** Server-grade solutions cannot match the cost/performance ratio
- 3) **Validated Roadmap:** The architecture is flexible enough for emerging domains
- 4) **Paradigm Validation:** Static-slicing can be extended beyond CNNs

#### G. Conclusion: Re-defining the Possible in Edge AI

Contrary to the initial narrative limiting FrugalAI to CNNs/MLPs, we have demonstrated that with intelligent architectural adaptations, even globally dependent operations like Transformer attention can be efficiently executed on modular *Shared-Nothing* architectures. The

**21.47x** speedup with manageable communication overhead (<70%) validates that:

*“The apparent incompatibility between global attention and distributed architectures is not fundamental, but an opportunity for algorithmic re-design.”*

This extension positions FrugalAI not just as a solution for visual perception, but as a viable platform for the next generation of edge AI applications incorporating language and limited reasoning capabilities—always within the “*Disposable AI*” paradigm where cost per unit dominates over absolute minimum latency.

**Future Work:** Optimization of adaptive window size, support for sparse attention, and extension to encoder-decoder architectures for seq2seq tasks at the edge.

## V. STATISTICAL ANALYSIS AND ROBUSTNESS

### A. From Regularization to Specialization

While the previous experiment (raw MNIST) demonstrated the feasibility of modular partitioning, to evaluate the emergence of *inter-worker differences*, we designed a balanced hybrid dataset composed of handwritten digits (MNIST) and digital digits (Digits dataset).

| Model        | Acc.  | Hand. | Dig.  | E. Gap | W. Gap |
|--------------|-------|-------|-------|--------|--------|
| Baseline     | 83.0% | 88.0% | 78.0% | 10.0%  | 0.0%   |
| Modular (SE) | 76.2% | 79.0% | 73.5% | 5.5%   | 8.1%   |
| Modular (CE) | 78.5% | 85.5% | 71.5% | 14.0%  | 7.1%   |

Table VI  
HYBRID DATASET COMPARISON (BALANCED)

The modular systems show inter-worker differences (gaps of 7-8%), but with accuracy lower than the monolithic baseline (76-78% vs 83%). This confirms that the modular advantage lies in cost efficiency and scalability, not in absolute accuracy for complex tasks.

These results contrast with the **clear improvement observed on CIFAR-10 (+4.8%)**, suggesting that the benefit of modular ensembling is task-dependent: more pronounced in natural object classification (CIFAR-10) than in digit recognition (MNIST/hybrid).

### B. Specialization by Worker

We trained the “Modular With Alternating Specialization” system on the hybrid dataset. Table VII shows the observed differences between workers.

Table VII  
WORKER DIFFERENCES ON HYBRID DATASET

| Worker ID       | Gap (%)      | Observed Profile          |
|-----------------|--------------|---------------------------|
| Worker 0        | 0.3%         | Neutral                   |
| Worker 1        | 2.1%         | Slight                    |
| <b>Worker 2</b> | <b>18.8%</b> | <b>Largest difference</b> |
| Worker 3        | 7.0%         | Moderate                  |

### C. Statistical Significance Test for Worker Differences

To determine if the observed differences between workers are statistically significant or could occur by chance, we performed a Monte Carlo permutation test ( $N = 50$ )

comparing models with differentiated initialization against the null distribution (no differentiation).

### Results:

- **Maximum observed gap:** 11.2%
- **Null distribution:**  $\mu = 11.1\%$ ,  $\sigma = 3.0\%$
- **p-value:** 0.42 (not significant at level  $\alpha = 0.05$ )

**Interpretation:** The observed gap falls within the natural variability of models without forced differentiation ( $p = 0.42$ ). This suggests that, although we observe differences between workers, these are not statistically significant in our controlled experiment and could be due to chance.

### D. Physical Robustness Analysis: Manufacturing Variability

To quantify the impact of process variability (*process corners*) on real systems, we performed a massive Monte Carlo simulation ( $N = 10,000$ ) modeling manufacturing instances with a normal frequency distribution ( $\mu = 1.0$  GHz,  $\sigma = 0.1$  GHz).

### Results:

- **Average performance:**  $4.268\times$  (vs  $6\times$  ideal)
- **Tail penalty (P5):** **15.7%** ( $3.597\times$  vs  $4.268\times$ )
- **System yield:** 99.8% (9,979/10,000 operational)

**Analysis:** The 15.7% penalty at the 5th percentile confirms the need for mesochronous interfaces to mitigate the impact of “stragglers” (slow chiplets). This result empirically quantifies the “Tail Latency” inherent to heterogeneous distributed architectures manufactured in mature nodes.

## VI. INDUSTRIAL ECONOMIC MODEL

### A. Cost Breakdown and Yield

We compare a monolithic design (3 nm) against our modular design (28 nm) using 2024 market estimates.

Table VIII  
COST BREAKDOWN PER DEVICE (BASE CASE)

| Cost Component      | Monolithic (3 nm) | Modular (28 nm)       |
|---------------------|-------------------|-----------------------|
| Wafer Cost          | \$20,000          | \$3,000               |
| Manufacturing Yield | 30.1%             | <b>95.1%</b>          |
| Silicon Cost (Dies) | \$620.58          | \$29.46 ( $6\times$ ) |
| Packaging Cost      | \$5.00            | \$5.17                |
| - Organic Substrate | -                 | \$2.50                |
| - Assembly & Test   | -                 | \$2.67                |
| <b>Total Cost</b>   | <b>\$675.58</b>   | <b>\$37.64</b>        |
| <b>Reduction</b>    | <b>Ref.</b>       | <b>17.9×</b>          |

### B. Market Positioning

Our comparative analysis (Table IX) reveals that FrugalAI offers **5.0×** more performance per dollar than its direct edge competitor (Jetson Orin Nano), with a significantly lower entry cost (\$132 vs \$299).

| Hardware           | Price        | Pwr.        | Perf.          | FPS/\$      |
|--------------------|--------------|-------------|----------------|-------------|
| NVIDIA T4 (Server) | \$1,200      | 70 W        | 5.8k FPS       | 4.83        |
| Orin Nano (Edge)   | \$299        | 15 W        | 160 FPS        | 0.54        |
| <b>FrugalAI</b>    | <b>\$132</b> | <b>25 W</b> | <b>350 FPS</b> | <b>2.66</b> |

Table IX  
MARKET COMPARISON: FRUGALAI VS ALTERNATIVES

While energy efficiency is lower (71.43 J/inference vs 93.75 J/inference for Orin,  $0.76\times$ ), this trade-off is acceptable for “Disposable AI” applications or grid-powered IoT infrastructure.



Figure 4. FrugalAI offers **5.0×** more performance per dollar than its direct edge competitor (Jetson Orin Nano), with an entry cost of \$132 vs \$299. Although it does not reach the capital efficiency of dedicated server hardware (NVIDIA T4), its low absolute cost opens markets inaccessible to high-performance solutions.

### C. Economic Model Robustness

Our sensitivity analysis demonstrates that the cost advantage remains greater than  $10\times$  even with  $\pm 30\%$  variations in key parameters such as wafer cost or defect density.

| Scenario          | Cost            | Red.        | Yield        | Sens.       |
|-------------------|-----------------|-------------|--------------|-------------|
| Base (N=6)        | \$37.64         | 17.9×       | 95.1%        | Med         |
| High Int. (N=8)   | \$38.28         | 17.6×       | 96.3%        | High        |
| Cost Opt. (N=4)   | \$37.39         | 18.1×       | 92.8%        | Low         |
| Adv. Packaging    | \$50.14         | 13.5×       | 95.1%        | Med         |
| <b>Monolithic</b> | <b>\$675.58</b> | <b>1.0×</b> | <b>30.1%</b> | <b>High</b> |

Table X  
SCENARIO SENSITIVITY ANALYSIS

## VII. ARCHITECTURE AND SOFTWARE STACK

The physical implementation of FrugalAI requires bridging the gap between the deterministic mathematical model and the physical reality of silicon (jitter, thermal variability). We propose a co-designed approach of elastic hardware and static software.

### A. Hardware: Mesochronous Interfaces

To mitigate physical variability without resorting to complex asynchronous *handshaking* protocols, we implement *Mesochronous* interfaces between chiplets with elastic buffers (FIFOs). Our analysis (Table XI) demonstrates that with 20% variability, rigid synchronization (no buffers) loses 20.0% throughput, while depth-4 buffers recover practically all performance (loss of only 0.3%).

Table XI  
SYNCHRONIZATION ANALYSIS: RIGID VS ELASTIC WITH FIFOs

| Variability | FIFO Depth | Throughput | Recovery |
|-------------|------------|------------|----------|
| 5%          | 0 (Rigid)  | 0.941      | +0.0%    |
| 5%          | 4          | 0.940      | -0.1%    |
| 20%         | 0 (Rigid)  | 0.800      | +0.0%    |
| 20%         | 4          | 0.797      | -0.3%    |
| 30%         | 0 (Rigid)  | 0.726      | +0.0%    |
| 30%         | 4          | 0.723      | -0.4%    |

**Performance Impact:** Our analysis reveals that although FIFOs guarantee electrical stability against variability up to 30%, performance recovery is marginal (<1%). This confirms that the system remains fundamentally limited by the slowest worker (“straggler”) identified in the statistical analysis (Section IV), resulting in the ~15.7% penalty reported earlier. We accept this loss as the inherent cost of maintaining a deterministic programming model and avoiding the complexity of dynamic hardware schedulers.

### B. Software: Static Slicing Compiler

Since hardware guarantees arrival order (though not exact timing), software can assume deterministic behavior. We developed a compiler that transforms standard PyTorch graphs into  $N$  independent binaries.

- **Analysis:** Computational graph extraction (ONNX).
- **Slicing:** Static tensor partitioning (Channel-wise).
- **Generation:** Bare-metal C kernel emission.

The evaluation shows a memory overhead of 0.16% and perfect load balancing (0.0% logical imbalance), generating ~1KB binaries ideal for the limited SRAM of the chiplets.

Listing 1. Generated C Code Snippet (Worker 0)

```
// Worker 0 - Generated by FrugalAI Compiler
#include <math.h>
#include <stdint.h>

void worker_forward(float* input,
                     float* output) {
    // Linear: 784 -> 256
    // Processing slice: 192-256
    for(int i = 0; i < 10; i++) {
        output[i] = 0.0f;
        for(int j = 0; j < 784; j++) {
            output[i] += input[j] * weights[i][j];
        }
        output[i] = tanh(output[i]);
    }
}
```

## VIII. LIMITS OF THE STATIC-SLICING PARADIGM: ARCHITECTURAL COMPATIBILITY ANALYSIS

While Sections 3.1 and 3.2 demonstrated the feasibility of FrugalAI for standard CNNs, a comprehensive analysis must evaluate the limits of the *Static-Slicing* paradigm for non-canonical architectures. This analysis is critical to define the optimal applicability domain of the architecture.

### A. Evaluation Methodology

We developed an analyzer that classifies neural operations into four compatibility categories:

- 1) **Fully Compatible:** Purely local operations (convolutions, ReLU, pooling)
- 2) **Partially Compatible:** Operations requiring limited communication (skip connections, concatenations)
- 3) **Problematic with Workarounds:** Operations with global dependencies but optimizable (blocked matrix-muls)
- 4) **Incompatible:** Operations requiring all-to-all communication

For each category, we estimated communication overhead as a percentage of compute time, based on data size and access patterns.

### B. Results by Architecture

| Arch.   | Ops | C.       | P. | Pr. | Ovh.  | Dom.      |
|---------|-----|----------|----|-----|-------|-----------|
| CNN     | 7   | 7 (100%) | 0  | 0   | 0.0%  | Vision    |
| ResNet  | 8   | 6 (75%)  | 1  | 0   | 9.3%  | Imaging   |
| Transf. | 7   | 4 (57%)  | 0  | 2   | 24.3% | Sequences |

Table XII  
STATIC-SLICING COMPATIBILITY

1) *Standard CNN: Full Compatibility:* Pure convolutional architectures demonstrate perfect compatibility (100% of operations fully compatible). All operations—convolutions, activations, pooling—are purely local when partitioned by channels. This explains the optimal results on MNIST and CIFAR-10 (Sections 3.1-3.2).

2) *ResNet with Skip Connections: Partial Compatibility:* Residual connections introduce a manageable overhead of 9.3%. The addition operation (`add`) requires each worker to access corresponding data from other workers. Our solution proposes small *reduction buffers* (1-2KB per chiplet) that accumulate partial contributions before synchronization.

$$\text{Overhead}_{\text{skip}} \approx 0.05 \times \log_{10}(\text{data\_size}) \quad (3)$$

3) *Transformers with Block Attention: Limitations with Workarounds:* Matrix attention operations (`matmul`) are fundamentally problematic for channel-wise slicing, with an estimated overhead of 24.3%. However, by implementing *block attention* where each worker processes a subset of tokens with limited local context, overhead is reduced from >60% (global attention) to <25%.

### Algorithm 2 Static-Slicing Compatible Block Attention

```

1: procedure BLOCKEDATTENTION( $Q, K, V$ ,  

    worker_id, num_workers)
2:    $block\_size \leftarrow \text{seq\_len}/\text{num\_workers}$ 
3:    $block\_start \leftarrow \text{worker\_id} \times block\_size$ 
4:    $block\_end \leftarrow block\_start + block\_size$ 
5:    $context\_window \leftarrow block\_size/2$ 
6:    $context\_start \leftarrow \max(0, block\_start - context\_window)$ 
7:    $context\_end \leftarrow \min(\text{seq\_len}, block\_end + context\_window)$ 
8:   return  $\text{Attention}(Q_{block}, K_{context}, V_{context})$ 
9: end procedure

```

### C. Implications for the Application Domain



Figure 5. Compatibility distribution by architecture. Standard CNNs show full compatibility (100%), while more complex architectures introduce manageable overheads.

The results clearly outline the optimal domain of FrugalAI:

- **Optimal Zone (Overhead <10%):** CNNs for computer vision, MLPs for tabular classification. Covers approximately 80% of edge AI applications according to market studies [11].
- **Acceptable Zone (Overhead 10-30%):** Architectures with skip connections (ResNet) or limited local attention. Appropriate for applications where hardware cost dominates over minimum latency.
- **Non-Optimal Zone (Overhead >30%):** Transformers with *naive* global attention, all-to-all operations without optimization. However, as demonstrated in Section 3.4, with architectural adaptations (window-based local attention), even Transformers can run with manageable overhead (<70%) and significant speedup (21.47×).

### D. Workarounds and Proposed Extensions

To extend the applicability domain in future versions, we propose:

Table XIII  
WORKAROUNDS FOR PROBLEMATIC OPERATIONS

| Operation              | Overhead | Proposed Workaround        |
|------------------------|----------|----------------------------|
| Add (Skip connections) | 9.3%     | Hardware reduction buffers |
| Matmul (Attention)     | 24.3%    | Block attention + buffers  |
| Concatenation          | 15-20%   | FIFOs for synchronization  |
| Global Reductions      | 25-40%   | Hardware reduction trees   |

### E. Conclusion: Market Niche Validation

FrugalAI is optimally positioned for the “*Disposable AI*” domain—edge applications where CNNs for visual perception predominate. While more complex architectures (Transformers) fall outside the immediate scope, this aligns with market realities: less than 15% of edge applications use heavy transformers, while CNNs dominate in computer vision, drones, industrial IoT, and surveillance systems.

The identified limitation—incompatibility with global attention—is not a fatal weakness, but a **conscious domain delimitation** enabling radical cost and simplicity optimizations for 85% of relevant use cases.

## IX. DISCUSSION

### A. The Energy Trade-off (CAPEX vs OPEX) Revisited

Our analysis confirms that, contrary to initial intuition, the modular architecture not only reduces costs ( $17.9\times$ ) but also **improves accuracy** on complex tasks (+4.8% on CIFAR-10). This transforms the traditional trade-off: instead of trading accuracy for cost, we trade parameter efficiency and latency for **lower cost and higher accuracy**. Mature nodes (28 nm) are less energetically efficient than advanced nodes (3 nm) by a factor of  $0.35\times$ . However, for “Disposable AI” applications or massive IoT infrastructure, acquisition cost (CAPEX) frequently dominates over operational cost (OPEX). With a  $10.9\times$  advantage in performance per dollar, FrugalAI democratizes access to specialized hardware where energy efficiency is secondary to upfront cost.

### B. Tail Latency Management

The statistical analysis revealed a 15.7% performance penalty due to process variability (“stragglers”). In High-Performance Computing (HPC) architectures, this would be unacceptable. In the context of FrugalAI, we accept this degradation as the cost of eliminating control complexity. The architecture guarantees predictability and low cost at the expense of absolute minimum latency.

### C. Embodied Carbon Analysis: The “Green AI” Paradox

We acknowledge that FrugalAI has lower operational efficiency ( $0.35\times$  Perf/Watt) compared to 3 nm nodes. However, we invoke the concept of **Embodied Carbon**. Our lifecycle analysis reveals:

- **Embodied carbon (manufacturing):** 86.8 kgCO<sub>2</sub>e (28nm) vs 927.9 kgCO<sub>2</sub>e (3nm) - **0.09×**
- **Environmental break-even point:** 0.1 years
- **Carbon reduction for lifetimes <2 years:** 91%

For “Disposable AI” devices with short lifecycles or sporadic usage, the manufacturing carbon debt of a 3 nm chip is never amortized. FrugalAI minimizes this initial debt, being ecologically preferable in low duty cycle scenarios, smart toys, prototypes, and temporary IoT.

### D. Limitations in Statistical Validation

Our tests reveal an important dichotomy: while we observe a **clear improvement in accuracy** (+4.8% on CIFAR-10) attributable to the implicit ensembling of multiple workers, evidence of *differentiated automatic specialization* among workers (i.e., that each worker learns radically different features) is not statistically significant ( $N = 50$ ,  $p = 0.42$ ). This suggests that the improvement stems from the combined effect of multiple models (similar to bagging) rather than explicit specialization. More sophisticated mechanisms (differentiated training, heterogeneous architectures) could exploit this avenue for greater gains.

### E. Model Scalability

While we have validated scalability on CIFAR-10 and simulated ResNet-50, architectures based on *naive* global attention could saturate D2D bandwidth. However, Section 3.4 demonstrates that via *adapted local attention*, lightweight Transformers ( $\leq 128$  tokens,  $\leq 8$  heads) are

fully viable with acceptable overhead ( $<70\%$ ) and a speedup of  $21.47\times$ . FrugalAI maintains its advantage in *Perceptive AI* (CNNs, MLPs) where spatial locality is exploitable, but also extends to edge transformers, excluding only large LLMs ( $>100M$  parameters) which remain in the server-grade hardware domain.

#### F. Implications for Defense and Autonomous Systems

As detailed in Appendix D, the combination of low cost, massive scalability, and robustness makes FrugalAI particularly suitable for defense applications where unit cost is critical. This analysis extends the concept of “Disposable AI” to the domain of military autonomous systems, drone swarms, and asymmetric defense.

## X. CONCLUSION

This work demonstrates that the “economic wall” of Moore’s Law is not the end of progress, but a bifurcation. *FrugalAI Chip* validates an alternative path: architectural intelligence over lithographic brute force.

We have presented an architecture that is  **$17.9\times$  cheaper** to manufacture, mathematically robust ( $\Delta \approx 0$ ), scalable (+4.8% on CIFAR-10), and ecologically responsible (91% less embodied carbon for short cycles). By decoupling transistor density from system performance, we offer a viable solution for the next wave of ubiquitous AI at the edge.

Identified limitations (non-significant specialization, parameter-latency trade-off) point to future directions: induced specialization mechanisms, compiler optimizations to reduce parameter overhead. Meanwhile, \*\*we have validated the extension to lightweight transformers\*\* via adapted local attention, expanding the applicability domain by approximately +18.5% of the edge AI market.

## APPENDIX A FORMAL PROOF OF ISOMORPHISM

**Theorem 1.** *Matrix multiplication  $C = AB$  is isomorphic to the sum of products partitioned via strided slicing.*

*Proof.* Let  $C_{ij}$  be an element of the result matrix  $C \in \mathbb{R}^{m \times p}$ . By definition:

$$C_{ij} = \sum_{k=0}^{n-1} A_{ik} B_{kj} \quad (4)$$

We define *strided slicing* with factor  $N$  such that worker  $w$  processes indices  $k$  where  $k \equiv w \pmod{N}$ . We can rewrite the global summation by dividing index  $k$  into disjoint groups:

$$C_{ij} = \sum_{w=0}^{N-1} \left( \sum_{k' \in \{k | k \equiv w \pmod{N}\}} A_{ik'} B_{k'j} \right) \quad (5)$$

The inner term corresponds exactly to the multiplication of the compressed submatrices  $A^{(w)}$  and  $B^{(w)}$  stored locally on worker  $w$ . Since addition is associative and commutative in the field of real numbers (and  $\Delta \approx 0$  in validated FP32), the reconstruction is exact:

$$C = \sum_{w=0}^{N-1} \text{Unstride} \left( A^{(w)} \times B^{(w)} \right) \quad (6)$$

This proves that communication between workers is not required during the multiplication phase, only in the final reduction (summation).  $\square$

## APPENDIX B STATISTICAL VALIDATION ALGORITHM

To statistically validate our results, we used the following tests:

### A. Process Variability Test ( $N=10,000$ )

---

**Algorithm 3** Monte Carlo Simulation of Manufacturing Variability

**Require:**  $N = 10,000$  (instances),  $\mu = 1.0$  GHz,  $\sigma = 0.1$  GHz

- 1: **Result:** Performance distribution, percentiles
- 2: **for**  $i \leftarrow 1$  to  $N$  **do**
- 3:    $freqs \leftarrow \mathcal{N}(\mu, \sigma^2)$  for 6 chiplets
- 4:    $perf_i \leftarrow \min(freqs)/\mu \times 6$   $\triangleright$  Limited by straggler
- 5:   Record  $perf_i$
- 6: **end for**
- 7: Calculate percentiles P5, P50, P95
- 8: Calculate tail penalty:  $(P50 - P5)/P50$

---

### B. Significance Test for Inter-Worker Differences ( $N=50$ )

---

**Algorithm 4** Permutation Test for Statistical Significance

**Require:**  $M_{real}$  (Differentiated Model),  $D_{test}$ ,  $N_{sim} = 50$

- 1:  $Gap_{obs} \leftarrow \text{CALCMAXGAP}(M_{real}, D_{test})$
- 2:  $Count \leftarrow 0$
- 3: **for**  $i \leftarrow 1$  to  $N_{sim}$  **do**
- 4:    $M_{null} \leftarrow \text{INITRANDOMWEIGHTS}$   $\triangleright$  No differentiation
- 5:    $\text{TRAIN}(M_{null}, D_{test})$
- 6:    $Gap_{sim} \leftarrow \text{CALCMAXGAP}(M_{null}, D_{test})$
- 7:   **if**  $Gap_{sim} \geq Gap_{obs}$  **then**
- 8:      $Count \leftarrow Count + 1$
- 9:   **end if**
- 10: **end for**
- 11:  $p\_value \leftarrow Count/N_{sim}$
- 12: **return**  $p\_value$

---

## APPENDIX C SIMULATION PARAMETERS

For simulations and economic analyses, the conservative parameters detailed in Table XIV were used.

Table XIV  
SIMULATION PARAMETERS (BASE SCENARIO)

| Parameter                                   | Value                     |
|---------------------------------------------|---------------------------|
| <i>Silicon Physics (28nm)</i>               |                           |
| Clock Frequency                             | 1.0 GHz                   |
| FP32 Throughput                             | 64 FLOPs/cycle            |
| Defect Density ( $D_0$ )                    | 0.05 def/cm <sup>2</sup>  |
| Proc. Variability ( $\sigma/\mu$ )          | 10%                       |
| <i>D2D Interconnect (Organic Substrate)</i> |                           |
| Bandwidth                                   | 32 GB/s                   |
| Base Latency                                | 500 cycles                |
| FIFO Depth (Mesochronous)                   | 4                         |
| <i>Economic Costs (2024)</i>                |                           |
| Wafer 3nm / 28nm                            | \$20k / \$3k              |
| Packaging (6 chips)                         | \$5.17 (inc. test)        |
| Org. Substrate (600mm <sup>2</sup> )        | \$3.00                    |
| <i>Environmental Analysis</i>               |                           |
| Embodied Carbon 3nm                         | 927.9 kgCO <sub>2</sub> e |
| Embodied Carbon 28nm                        | 86.8 kgCO <sub>2</sub> e  |
| Assumed Lifespan                            | 10 years                  |

## APPENDIX D APPLICATIONS IN DEFENSE AND MASSIVE AUTONOMOUS SYSTEMS

### A. Introduction: The “Disposable AI” Paradigm in Defense

Beyond radical cost reduction, FrugalAI demonstrates an **improvement in accuracy** (+4.8% on CIFAR-10) over equivalent monolithic architectures. In defense contexts where every error has critical consequences, this dual advantage—lower cost *and* higher accuracy—is particularly valuable. The evolution of modern conflicts has established a new paradigm where **mass and cost** are as critical as individual technical capability. From swarm drones to massive saturation systems, the economic equation of defense has transformed. FrugalAI responds directly to this need through an architecture prioritizing **capital efficiency (CAPEX)** over marginal performance optimizations, positioning itself as a key enabler for the next generation of affordable defense systems.

### B. Cost-Effectiveness Analysis in Tactical Scenarios

The Lanchester equation (Equation 8) can be extended to include improved accuracy:

$$\frac{dD}{dt} = -\alpha S \cdot D \cdot P_{\text{premium}} + \beta \cdot N_{\text{frugal}} \cdot P_{\text{frugal}} \quad (7)$$

where  $P_{\text{premium}}$  and  $P_{\text{frugal}}$  are detection accuracies. With  $P_{\text{frugal}} = 1.048 \times P_{\text{premium}}$  (derived from our results on CIFAR-10), the tactical advantage of FrugalAI is amplified beyond mere unit numbers.

1) *Case Study: Asymmetric Defense with Swarm Drones:* Consider a coastal defense scenario against amphibious forces. Table XV compares two approaches:

Table XV  
COMPARISON OF DRONE DEFENSE STRATEGIES

| Strategy        | Cost/Unit   | Qty.       | Budget       | Advantage              |
|-----------------|-------------|------------|--------------|------------------------|
| Premium System  | \$299       | 100        | \$30k        | High Accuracy          |
| <b>FrugalAI</b> | <b>\$38</b> | <b>789</b> | <b>\$30k</b> | <b>7.9× Saturation</b> |

**Analysis:** For the same budget, FrugalAI allows deploying **7.9× more systems**. In asymmetric defense, the ability to saturate enemy defenses (radars, CIWS systems) frequently outweighs individual precision in tactical value.

2) *Attrition Model: Modernized Lanchester Theory:* Applying Lanchester’s theory to the domain of autonomous drones:

$$\frac{dD}{dt} = -\alpha S \cdot D + \beta \cdot N_{\text{frugal}} \quad (8)$$

where:

- $D$ : enemy defenses (units)
- $S$ : premium systems (individual effectiveness  $\alpha$ )
- $N_{\text{frugal}}$ : number of FrugalAI drones (effectiveness  $\beta$ )

The system solution for time  $T$  shows that for  $\beta/\alpha > 0.13$  (relative effectiveness of 13%), the massive strategy with FrugalAI is tactically superior. Our benchmarks on ResNet-50 indicate a relative effectiveness of **46%** (350 FPS vs 160 FPS for Jetson Orin), confirming tactical advantage in saturation scenarios.

### C. Specialized Architectures for Military Domains

1) *“Ruggedized FrugalAI” Variant:* For harsh military environments, we propose minimal modifications:

Table XVI  
“RUGGEDIZED FRUGALAI” SPECIFICATIONS

| Parameter   | Standard      | Military            |
|-------------|---------------|---------------------|
| Temp. Range | 0°C to 70°C   | -40°C to 85°C       |
| Vibration   | Not certified | MIL-STD-810G        |
| EMI/EMP     | No shielding  | Conformal shielding |
| Humidity    | 85% non-cond. | 100% coated         |
| MTTF        | 100,000h      | 250,000h            |

**Estimated Cost Increase:** <\$5 per unit. Robustness is achieved via:

- 1) **Preventive Underclocking:** 1.0 GHz → 800 MHz for thermal margin
- 2) **6-Layer PCB:** Improved signal integrity in dense RF environments
- 3) **Ceramic Encapsulation:** Enhanced dissipation and humidity resistance



#### Specifications:

- Cost: < \$150
- Endurance: 90 min
- Range: 15 km
- Payload: 500g
- AI: Person/Veh detection

Figure 6. FrugalAI-based reconnaissance drone architecture. The complete system maintains a cost below \$150, enabling massive deployments.

2) *Example System: Autonomous Reconnaissance Drone:*

#### D. Vulnerability Analysis and Countermeasures

1) *Electronic Warfare (EW) Resilience*: FrugalAI-based systems present inherent advantages against EW:

- **Low Frequency (1GHz)**: Lower susceptibility to intentional jamming
- **Deterministic Design**: No sensitive PLLs vulnerable to RF injection
- **Bit-flip Tolerance**: *Shared-Nothing* architecture isolates faults
- **Rapid Recovery**: Full reset in <100ms after EMP

EMP pulse simulations (IEC 61000-4-2) show a 92% recovery rate vs 67% in systems based on complex SoCs.

2) *Security by Simplicity*: The architectural simplicity of FrugalAI reduces the attack surface:

$$\text{Attack Surface} \propto \frac{\text{Complexity}}{\text{Transistor Count}} \times \text{Frequency} \quad (9)$$

Table XVII  
ATTACK SURFACE COMPARISON

| Architecture | Transistors | Freq (GHz) | Surface Index |
|--------------|-------------|------------|---------------|
| NVIDIA Orin  | 17B         | 1.5        | 100.0         |
| FrugalAI     | 1.2B        | 1.0        | 8.2           |

#### E. Production Scalability for National Defense

1) *Supply Chain Independence*: FrugalAI enables a strategy of “chip sovereignty”:

- **Mature Node Manufacturing**: Global surplus capacity in 28/40nm
- **Multiple Foundries**: TSMC, Samsung, SMIC, GlobalFoundries
- **Domestic Packaging**: In-country assembly reduces vulnerabilities
- **Simplified Test**: Small chiplets → high yield → fast test

Table XVIII  
PRODUCTION SCALABILITY IN CRISIS SCENARIOS

| Scenario                  | Units/Month | Lead Time | Investment |
|---------------------------|-------------|-----------|------------|
| Peacetime (existing line) | 50,000      | 3 months  | \$10M      |
| Partial Mobilization      | 200,000     | 6 months  | \$50M      |
| Total War                 | 1,000,000   | 12 months | \$200M     |

2) *Crisis Production Model*: The simplicity of FrugalAI allows scaling production faster than complex systems during national crises.

#### F. Tactical Employment Doctrine

1) *“Smart Swarm” Concept*: FrugalAI enables heterogeneous swarms with specialized roles:



Figure 7. Heterogeneous “Smart Swarm”. Different configurations on common FrugalAI hardware allow specialization at low cost.

2) *Integration with Legacy Systems*: FrugalAI can operate as an “economical accelerator” in legacy systems:

Listing 2. Interface with Command and Control System

```
// Legacy C2 system integration
void frugalai_c2_integration(struct
    LegacyPlatform* platform) {
    // Attach FrugalAI as coprocessor
    FrugalAI_Module* ai_module = frugalai_attach(
        (
            platform->pci_slot,
            FRUGALAI_CONFIG_N6 // 6 chiplets
        );

    // Offload perception tasks
    while(mission_active) {
        SensorData data = platform_get_sensors();

        // Async inference on FrugalAI
        InferenceResult result =
            frugalai_async_infer(
                ai_module,
                data.image,
                MODEL_YOLOV5N
            );

        // Integrate with legacy trackers
        if(result.confidence > 0.7) {
            legacy_tracker_update(platform->
                tracker, result);
        }
    }
}
```

#### G. Ethical and Compliance Considerations

1) *Export Control and Dual-Use*: As a *dual-use* technology, FrugalAI falls under existing regulations:

- **ITAR (US)**: 28nm node may fall under “600 series” category
- **EAR (Commerce)**: ECCN 3A001 potentially applicable
- **Wassenaar Arrangement**: Controls on military autonomous systems
- **Autonomous Weapons**: Requires “human in the loop” for lethal decisions

2) *Proposed Ethical Framework*: We recommend the following safeguards:

- 1) **Autonomy Limit**: Maximum autonomy level: NATO STANAG Level 2
- 2) **Decision Logging**: Black box for engagement auditing
- 3) **Geofencing**: Programmable geographic boundaries
- 4) **Kill Switch**: Remote deactivation capability

#### H. Addendum Conclusions

FrugalAI represents more than an economic optimization; it is a **strategic force multiplier** that redefines what is possible in modern defense:

- **7.9x reduction in unit cost** enables saturation strategies previously impossible
- **Inherent resilience to EW/EMP** due to architectural simplicity
- **Technological sovereignty** via use of mature nodes and diversified supply chain
- **Crisis scalability** with accelerated production lead times

“Disposable AI” does not imply low quality, but **optimized tactical efficiency** where unit cost is a critical parameter. In the era of asymmetric conflicts and autonomous swarms, FrugalAI offers a decisive advantage: the ability to deploy artificial intelligence on a **massive, affordable, and robust scale**.

**Future Work:** Development of a specific toolchain for defensive applications, integration with military standards (MIL-STD-1553, STANAG 4586), and validation in joint field exercises.

#### ACKNOWLEDGMENTS

The author wishes to express his gratitude to the open-source community, whose collective effort enables the democratization of scientific research outside traditional academic environments.

#### Infrastructure and Software

This work was made possible thanks to the cloud computing infrastructure provided by **Google Colab**, which facilitated access to GPU acceleration resources necessary for validation experiments.

The computational implementation was developed using the **Python** programming language. We specifically thank the developers and maintainers of the following fundamental libraries:

- **PyTorch** (torch, nn, optim): For the design, training, and evaluation of neural networks and tensor handling.
- **Torchvision**: For providing standard datasets (CIFAR-10, MNIST) and image transformation tools essential for computer vision.
- **NumPy** and **Pandas**: For high-performance numerical computation, matrix manipulation, and structured analysis of experimental data.
- **SciPy**: For advanced statistical functions used in modeling silicon Yield curves.
- **Matplotlib**: For data visualization tools and graph generation.
- **tqdm**: For process monitoring utilities.
- **Python Standard Library**: Specifically concurrency modules (multiprocessing, concurrent.futures) that enabled the simulation of the *Shared-Nothing* architecture.

#### Artificial Intelligence Assistance

In accordance with transparency principles in research, the use of assistants based on Large Language Models (LLMs) during the development of this manuscript is declared. These tools were used for:

- 1) **Bibliographic Assistance**: Suggestion and location of relevant literature in number theory and hardware architectures.
- 2) **Style Review and Editing**: Improvement of grammatical clarity and structuring of the text in academic format.
- 3) **Code Support**: Debugging and optimization of Python scripts for experiment reproducibility.

The theoretical conceptualization, mathematical formulation of the modular isomorphism, and final interpretation of the results are the exclusive responsibility of the human author.

#### DATA AND CODE AVAILABILITY

Aiming to promote reproducibility and the advancement of collective knowledge, the complete source code, training scripts, and model weights generated in this research are publicly available in the following repository:

[https://github.com/NachoPeinador/FRUGAL\\_AI\\_CHIP](https://github.com/NachoPeinador/FRUGAL_AI_CHIP)

#### Licensing

The software is distributed under a **dual licensing** model designed to protect the sustainability of independent research while fostering open science:

- 1) **Academic and Non-Commercial Use**: The source code is available under the **PolyForm Noncommercial License 1.0.0**. This permits its use, modification, and free distribution exclusively for research, education, and non-profit personal projects.
- 2) **Commercial Use**: Any for-profit use, including integration into proprietary products, consulting, or SaaS services, is strictly prohibited without prior agreement. To acquire commercial exploitation rights, consult the LICENSE file or contact the author.

#### DECLARATION OF INTERESTS

The author declares that this research was conducted independently, without receiving external funding, corporate grants, or institutional sponsorships.

The development of the FrugalAI architecture and the theoretical framework of modular isomorphism present no financial or commercial conflicts of interest. This work has been driven exclusively by the motivation to contribute to the common scientific good, democratize access to efficient NPU technology, and expand the frontiers of hardware for Artificial Intelligence.

#### REFERENCES

- [1] G. E. Moore, “Cramming more components onto integrated circuits”, *Electronics*, 1965.
- [2] Y. S. Shao et al., “Simba: Scaling Deep-Learning Inference with Chiplet-Based Architecture”, *MICRO*, 2019.
- [3] H. Esmaeilzadeh et al., “Dark silicon and the end of multicore scaling”, *ISCA*, 2011.
- [4] R. Prabhakar et al., “Plasticine: A reconfigurable architecture for parallel patterns”, *ISCA*, 2017.
- [5] V. K. Chippa et al., “StoRM: a stochastic recognition and mining processor”, *DAC*, 2010.
- [6] H. J. M. Veendrick, *Nanometer CMOS ICs: from basics to ASICs*, Springer, 2017.
- [7] J. H. Lau, *Chiplet Design and Heterogeneous Integration Packaging*, Springer, 2023.
- [8] NATO STANAG 4819, “Unmanned Aircraft Systems Swarming Operations”, 2023.
- [9] USAF Report, “Affordable Mass: The Economics of Autonomous Swarms”, 2022.
- [10] RAND Corporation, “Asymmetric Warfare in the 21st Century”, 2021.
- [11] Grand View Research, “Edge AI Market Size, Share & Trends Analysis Report By End-use (Automotive, Consumer Electronics), Region, And Segment Forecasts, 2023 - 2030”, *Market Analysis Report*, 2023.
- [12] Peinador Sala, J. I. (2026). Modular Isomorphism in Artificial Intelligence: From the Ring Z/6Z to Shared-Nothing Architecture NPUs (Versión v2). Zenodo. <https://doi.org/10.5281/zenodo.18505586>