

# Data Mining Hardware Descriptions

from Vendor Code, Configuration Tools, and Documentation

Niklas Hauser, emBO++25



**Who?**  
Hello, my name is Niklas and I like data science



Thank you for the introduction.  
Thanks to emBO for the opportunity to talk here.

My name is Niklas.

- I started studying Computer Science some time ago.
- I began building autonomous robots in 2010.
- We created a C++ library which is known as [modm.io](#), a C++23 library generator that supports several thousand Cortex-M devices.
- I then started at ARM working on Cortex-M sandboxing, before returning to the university to study for my masters degree.
- There, I worked on a digital modular signalling system for railways.
- I'm currently working at Auterion on the open-source PX4 Autopilot.

## [modm.io](#) C++23 barebone embedded library Modular, data-driven HAL and build system generator

- Generates startup code, linker script, peripheral drivers for microcontrollers.
- Supports 3034 STM32, 416 SAM, 388 AVR, and RP2040.
- Requires a lot of data for every supported device.



modm is a C++23 embedded library generator.

- The core of modm is a code generator written in Python called lbuild.
  - It queries a database of device data and formats the results into C++23 code.
  - The HAL is highly modular and configurable and it allows a very small maintainer team to support thousands of microcontrollers.
- ###
- Today we'll talk about the database part of this construct.



Microcontroller hardware is quite complex nowadays.

Here is a STM32H7 with its many internal busses.

You can see distributed memories in yellow, you can see lots of peripherals, and many DMA engines.

Everything also needs to be externally connected via the pins.

That's a lot of hardware to abstract.



The HAL is actually part of the hardware-dependent software and there's a lot of it.

###

It's also operating systems, external sensors, communication protocols, and bootloaders.

So it's a fairly large topic, not just about microcontrollers itself.



So the idea is to parse every data source I can find and merge it into a single database.

Then I can share this among all my embedded friends: Zephyr, modm and embassy.

And then I would benefit from any of their improvements to the database as well.



## Configuration Tools: CubeMX

**STM32\_open\_pin\_data** contains all packages, pinouts, memories



And I decided to make this an open-source project on GitHub. It's split into individual pipelines, where each data source is converted eventually into Python.

###

Let's first focus on the trivially machine-readable data sources

The most well known is the CubeMX GUI application, which allows you to configure the pin functions of the STM32.

This is actually backed by a XML database that STMicro actually publishes on GitHub with a BSD licence.

It contains the entire catalog of STM32 ever made, their package, their pinout, and all alternate functions.

It's undocumented but you can get very far with simple XPath queries.

Many people already use this, including Zephyr, embassy and KiCad to generate HALs and footprints!



However, the CubeMX database also contains a fully annotated graph of the entire STM32 clock tree.

# # #

A typical configuration is to have an external clock source fed into the PLL, which then increases the clock frequency and feeds it into the system clock, from which most peripherals are powered.



We can also render this clock graph as graphviz graph, and you can see that it contains all frequency limitations that are used to solve the problems of the clock tree in CubeMX.

Here we can follow the same configuration: external clock source gets fed into the PLL and comes out into the system clock.

But now there is a lot more detail visible.

You can also see that this is not really a tree, it's really a graph.

You can find more of these rendered clock graphs on my homepage.



So that was the easy part, let's now focus on more difficult data sources: source code.



We can convert the CMSIS header files back into a register map:  
 We know the order and width of the registers from the typedef struct.  
 We know the order and width of the bit fields from the macros.  
 And we know the peripheral instance and address from the typedef cast.

This does not give us enumerations of any bit fields unfortunately, since they are simply not in the header files.

## Parsing CubeHAL Header Files

CMSIS files are missing Bit Field Enumerations

- Neither the STM32 CMSIS-SVD nor CMSIS Header define Bit Field Enumerations.
- We need to parse the Low-Level CubeHAL header files to reconstruct.

```
***** Bit definition for RCC_CFGR_SW register *****
#define RCC_CFGR_SW_Pos      (0U)
#define RCC_CFGR_SW_Msk      (0x3UL << RCC_CFGR_SW_Pos)
#define RCC_CFGR_SW           RCC_CFGR_SW_Msk
#define RCC_CFGR_SW_0          (0x1UL << RCC_CFGR_SW_Pos)
#define RCC_CFGR_SW_1          (0x2UL << RCC_CFGR_SW_Pos)

#define LL_RCC_SYS_CLKSOURCE_HSI    0x00000000U
#define LL_RCC_SYS_CLKSOURCE_HSE    RCC_CFGR_SW_0
#define LL_RCC_SYS_CLKSOURCE_PLL    RCC_CFGR_SW_1
#if defined(RCC_PLLR_SYSCLK_SUPPORT)
#define LL_RCC_SYS_CLKSOURCE_PLLR   (RCC_CFGR_SW_1|RCC_CFGR_SW_0)
#endif
```

For some of the bit field enumerations we need to parse the CubeHAL low-level header files.

Same procedure, we interpret the macros.

###

We can do a reverse lookup to see which macros use the register bit field definitions and then work backwards from that.

Annoying, but doable.

But does not give every bit field enumeration possible.

## data.modm.io Conversion Pipelines



Now for the really hard stuff: parsing PDF datasheets.

PDFs are machine-renderable, but not machine-readable.

There's a lot of research out there on information extraction from PDFs, mostly relating to financial statements.

## STMicro PDF Documentation

You can look, but you cannot parse

- STMicro publishes >2600 PDFs for documentation: ~15GB on disk.
- You must consult multiple PDFs with thousands of pages: STM32H7A3/B0/B3.
- How hard could it possibly be to make all these PDFs machine-readable?



STMicro publishes a lot of PDFs: We are only looking at active components, microcontrollers, sensors, memories.

And there are over 2600 PDFs available: ~15GB.

For one microcontroller, a lot of PDFs apply, here the STM32H7 family has 7 PDFs involved.

Nobody reads them all.

## PDF Datasheets: Text

### 17 Cyclic redundancy check calculation unit (CRC)

#### 17.1 Introduction

The CRC (cyclic redundancy check) calculation unit is used in the CRC module from 8 bits of 22 bits word and a generator polynomial.

Among other applications, CRC-based techniques are used to verify data transmission or storage integrity. In the scope of the industrial safety standards, they offer a means of verifying the Flash memory integrity. The CRC calculation unit helps compute a signature of the software during runtime to be compared with a reference signature generated at link time and stored at a given memory location.

#### 17.2 CRC main features

- Uses CRC-32 Ethernet polynomial  $x^{32} + x^{28} + x^{24} + x^{23} + x^{22} + x^{16} + x^{12} + x^{10} + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1$

## How can we access PDF data?

For text it's relatively simple: each glyph is individually positioned on the page.

There's no semantics for headings or lists or superscript. It's all just individually positioned characters.

THERE IS NO NEED TO OCR PDFS!

## PDF Datasheets: Figures



Figures are a mix of vector graphics and text. There's no special indication that this is a figure, it must be detected.

## PDF Datasheets: Tables

Table 13: STM32F303xB/C Pin definitions (continued)

| Pin number | Pin functions |         |         |         |                      |          |               |       |                                                                                                                     |                                                               |
|------------|---------------|---------|---------|---------|----------------------|----------|---------------|-------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
|            | MULSP100      | LOPP100 | LOPP101 | LOPP102 | Pin name after reset | Pin type | I/O structure | Notes | Alternate functions                                                                                                 | Additional functions                                          |
| 13         | 52            | 34      | 28      | PB13    | IO                   | TF       |               | 14    | SPI2_SCK[2:2], S2_I2C1_SDA[2:2],<br>CONSTITUTION[1:1][1:1]<br>JTAG[1:1]Q[1:1]EVENTOUT[1:1]                          | ADC1IN5[1:1]COMP1_IN[1:1]<br>OPAMP1_IN[1:1]<br>OPAMP2_IN[1:1] |
| 12         | 53            | 35      | 27      | PB14    | IO                   | TF       |               | 15    | SPI2_MISO[2:2], S2_I2C1_SDN[2:2],<br>DISPARTA[1:1]RTS[1:1]<br>JTAG[1:1]Q[1:1]M[1:1]<br>JTAG[1:1]Q[1:1]EVENTOUT[1:1] | COMP1_IN[1:1]ADC1_IN[1:1]<br>OPAMP2_IN[1:1]                   |

A special case of a figure is a table, where the table cells are drawn in vector graphics and the text is placed inside that.

That's why if you just attempt to copy the text of the table into an editor, you usually get garbage.

Note the rotated text in the header, in which order is that copied? It's up to the PDF reader how to copy this text.

Here you can see the first page of a datasheet. We detect the double column layout manually, then convert each side. We need to simplify the problems, so first we

- Convert all 2D information into an abstract syntax tree.
  - Then modify that AST to detect the hierarchy of the document and then normalize page breaks.
  - Then format it as HTML.

If this sounds like a compiler, it's basically a PDF frontend, then a number of AST passes, then a HTML backend. And this actually works really well.

| Table 10. STM32F413xG/H pin definition |        |         |                                                      |         |             |                  |       |                     |        |                                                                                                                          |
|----------------------------------------|--------|---------|------------------------------------------------------|---------|-------------|------------------|-------|---------------------|--------|--------------------------------------------------------------------------------------------------------------------------|
| Pin Number                             |        |         | Pin Name<br>(function after<br>reset) <sup>(1)</sup> |         | Pin<br>type | I/O<br>structure | Notes | Alternate functions |        | Additional<br>functions                                                                                                  |
| UFIOPN48                               | LQFP64 | WLCSPI1 | LQFP100                                              | UBGA100 | UBGA144     | LQFP144          |       |                     |        |                                                                                                                          |
| - - NC 1 B2 A3 1                       |        |         |                                                      |         |             | PE2              | I/O   | FT                  | (2)    | TRACED1,<br>SP4_SCK/DS4_CK,<br>SP5_SCK/DS5_WS,<br>SA1_MCLK_A,<br>QUADSPI_BK1_I02,<br>UART10_RX,<br>FSMC_A23,<br>EVENTOUT |
| - - NC 2 A1 A2 2                       |        |         |                                                      |         |             | PE3              | I/O   | FT                  | (2)    | TRACED0,<br>SA1_SD_B,<br>UART10_TX,<br>FSMC_A19,<br>EVENTOUT                                                             |
| - - NC 3 B1 B2 3                       |        |         |                                                      |         |             | PE4              | I/O   | FT                  | (2)(3) | TRACED1,<br>SP4_NSS/DS4_WS,<br>SP5_NSS/DS5_WS,<br>SA1_SD_A,<br>DDROM4_DATHIN3,<br>FSMC_A20,<br>EVENTOUT                  |
|                                        |        |         |                                                      |         |             |                  |       |                     |        | TRACED3_TIM9_CH1,<br>SP4_MISO,<br>SP5_MISO,                                                                              |

This is the result, for example the pin definition table in the datasheet.  
This is a pure HTML table with minimal CSS to look similar to the PDF.  
All of the data is converted as is including line breaks.

| Table 12: STM32F411xH alternate functions |                             |                                         |                               |                             |                             |                             |                                               |                                               |                                                |                            |                                |                                  |                           |                                  |                     |                     |              |
|-------------------------------------------|-----------------------------|-----------------------------------------|-------------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------------------------|-----------------------------------------------|------------------------------------------------|----------------------------|--------------------------------|----------------------------------|---------------------------|----------------------------------|---------------------|---------------------|--------------|
| Port                                      | AP0                         | AF1                                     | AF2                           | AF3                         | AF4                         | AF5                         | AF6                                           | AF7                                           | AF8                                            | AF9                        | AF10                           | AF11                             | AF12                      | AF13                             | AF14                | AF15                |              |
|                                           | SYS_AF                      | TIME <sub>0</sub> /<br>LPTIM1           | TIME <sub>0</sub> /<br>LPTIM1 | TIME <sub>0</sub> /<br>I2C1 | TIME <sub>0</sub> /<br>I2C1 | TIME <sub>0</sub> /<br>I2C1 | SP1I2S <sub>0</sub>                           | SP1I2S <sub>0</sub>                           | SP1I2S <sub>0</sub>                            | SP1I2S <sub>0</sub>        | SP1I2S <sub>0</sub>            | SP1I2S <sub>0</sub>              | SP1I2S <sub>0</sub>       | SP1I2S <sub>0</sub>              | SP1I2S <sub>0</sub> | SP1I2S <sub>0</sub> |              |
| PA0                                       | -                           | TIME <sub>0</sub> /<br>I2C <sub>1</sub> | TIME <sub>0</sub> /<br>ETR    | TIME <sub>0</sub> /<br>CH1  | TIME <sub>0</sub> /<br>ETR  | -                           | -                                             | -                                             | -                                              | -                          | -                              | -                                | -                         | -                                | -                   | EVEN/<br>OUT        |              |
| PA1                                       | -                           | TIME <sub>2</sub> /<br>CH2              | TIME <sub>2</sub> /<br>CH2    | -                           | -                           | -                           | SP4I <sub>0</sub> MOSI <sub>1</sub><br>284_SD | -                                             | -                                              | USART2 <sub>0</sub><br>RTS | UART1 <sub>0</sub><br>RX       | QUADSPI <sub>0</sub><br>BK1_IODI | -                         | -                                | -                   | -                   | EVEN/<br>OUT |
| PA2                                       | -                           | TIME <sub>3</sub> /<br>CH3              | TIME <sub>3</sub> /<br>CH3    | -                           | TIME <sub>3</sub> /<br>CH3  | -                           | -                                             | I2S2 <sub>0</sub> CKIN                        | -                                              | -                          | UART2 <sub>0</sub><br>TX       | -                                | -                         | -                                | -                   | FSMC_D4/<br>D4      |              |
| PA3                                       | -                           | TIME <sub>2</sub> /<br>CH4              | TIME <sub>2</sub> /<br>CH4    | -                           | TIME <sub>2</sub> /<br>CH4  | -                           | -                                             | I2S2 <sub>0</sub> MCK                         | -                                              | -                          | USART2 <sub>0</sub><br>RX      | -                                | -                         | SAI1_SD_B                        | -                   | FSMC_D5/<br>D5      |              |
| PA4                                       | -                           | -                                       | -                             | -                           | -                           | -                           | -                                             | SP1I <sub>0</sub> NSS1 <sub>0</sub><br>SI1_W5 | SP1I <sub>0</sub> NSS1 <sub>0</sub><br>253_W5  | USART2 <sub>0</sub><br>CK  | DSIDIML <sub>0</sub><br>DATINI | -                                | -                         | -                                | -                   | FSMC_D6/<br>D6      |              |
| PA5                                       | -                           | TIME <sub>0</sub> /<br>I2C <sub>1</sub> | TIME <sub>0</sub> /<br>ETR    | -                           | TIME <sub>0</sub> /<br>CH1  | -                           | -                                             | SP1I <sub>2</sub> SK12 <sub>0</sub><br>SI1_CK | -                                              | -                          | DSIDIML <sub>0</sub><br>CKIN1  | -                                | -                         | -                                | -                   | FSMC_D7/<br>D7      |              |
| PA6                                       | TIME <sub>1</sub> /<br>B8IN | TIME <sub>1</sub> /<br>CH1              | TIME <sub>1</sub> /<br>B8IN   | TIME <sub>1</sub> /<br>CH1  | TIME <sub>1</sub> /<br>B8IN | -                           | -                                             | SP1I <sub>0</sub> MISO <sub>0</sub><br>281_SD | I2S2 <sub>0</sub> MCK                          | -                          | TIME1 <sub>0</sub><br>CH1      | QUADSPI <sub>0</sub><br>BK1_IODI | -                         | -                                | -                   | SIDIO_CMD           |              |
| PA7                                       | -                           | TIME <sub>1</sub> /<br>CH1              | TIME <sub>1</sub> /<br>CH2    | TIME <sub>1</sub> /<br>CH1  | TIME <sub>1</sub> /<br>CH1  | -                           | -                                             | SP1I <sub>0</sub> MOSI <sub>0</sub><br>281_SD | -                                              | -                          | DSIDIML <sub>0</sub><br>DATINI | -                                | TIME1 <sub>0</sub><br>CH1 | QUADSPI <sub>0</sub><br>BK2_IODI | -                   | EVEN/<br>OUT        |              |
| PA8                                       | MCO_1                       | TIME <sub>1</sub> /<br>CH1              | -                             | -                           | -                           | -                           | -                                             | I2C <sub>3</sub><br>SDA3                      | -                                              | -                          | USART1 <sub>0</sub><br>CKOUT   | UART1 <sub>0</sub><br>RX         | -                         | USB_FS <sub>0</sub><br>SOF       | CAN3_RX             | SIDIO_DI            |              |
| PA9                                       | -                           | TIME <sub>1</sub> /<br>CH2              | -                             | -                           | -                           | -                           | -                                             | I2C <sub>3</sub><br>SDA3                      | SP1I <sub>2</sub> SK12 <sub>0</sub><br>SI2_CK  | -                          | USART1 <sub>0</sub><br>CK      | UART1 <sub>0</sub><br>TX         | -                         | USB_FS <sub>0</sub><br>VBUS      | -                   | SIDIO_D2            |              |
| PA10                                      | -                           | TIME <sub>1</sub> /<br>CH3              | -                             | -                           | -                           | -                           | -                                             | SP2I <sub>0</sub> MOSU <sub>0</sub><br>282_SD | SP2I <sub>0</sub> MOSI <sub>0</sub><br>1255_SD | USART1 <sub>0</sub><br>RX  | -                              | -                                | USB_FS <sub>0</sub><br>ID | -                                | -                   | EVEN/<br>OUT        |              |
| PA11                                      | -                           | TIME <sub>1</sub> /<br>CH4              | -                             | -                           | -                           | -                           | -                                             | SP2I <sub>0</sub> NSS1 <sub>0</sub><br>SI2_W5 | SP4I <sub>0</sub> MISO <sub>0</sub><br>284_SD  | USART1 <sub>0</sub><br>CTS | UTSART5 <sub>0</sub><br>TX     | CAN1_RX                          | USB_FS <sub>0</sub><br>D6 | UART4_RX                         | -                   | EVEN/<br>OUT        |              |

Here is the alternate function table

This is normally broken up across many pages, in the HTML its just one long table.

| Table 24. RCC register map and reset values |               |              |            |              |              |  |  |  |  |
|---------------------------------------------|---------------|--------------|------------|--------------|--------------|--|--|--|--|
| Addr/offset                                 | Register name | PLL[4:0]     |            |              |              |  |  |  |  |
| 0x00                                        | RCC_CR        | MCQ2[1:0]    | Res.       | Res.         | 31           |  |  |  |  |
| 0x04                                        | RCC_PLLCFG    | PLL[2:0]     | PLL.R      | Res.         | 30           |  |  |  |  |
| 0x08                                        | RCC_CFGR      | MCQ2R[2:0]   | Res.       | Res.         | 29           |  |  |  |  |
| 0x0C                                        | RCC_CIR       | MCQ2R[3:0]   | PLLQ[3:0]  | PLL.I2SRDY   | 27           |  |  |  |  |
| 0x10                                        | RCC_AHB1RSTR  | MCQ2R[7:0]   | PLLQ[7:0]  | PLL.I2SRDY   | 26           |  |  |  |  |
| 0x14                                        | RCC_IER       | CSRC         | PLL.ON     | PLL.ON       | 24           |  |  |  |  |
| 0x18                                        | RCC_IDR       | MCQ1[1:0]    | PLL.RC     | PLL.RC       | 23           |  |  |  |  |
| 0x1C                                        | RCC_ISER      | DMARST[7:0]  | PLL.RC     | PLL.RC       | 21           |  |  |  |  |
| 0x20                                        | RCC_ISIER     | DMARST[15:0] | PLL.RDYC   | RTCPRE[4:0]  | CSRR[19]     |  |  |  |  |
| 0x24                                        | RCC_ISIER     | LSDRDYC      | PLL.RDYC   | PLL.PT[15:0] | HSEYP[18]    |  |  |  |  |
| 0x28                                        | RCC_ISIER     | LSDRDYC      | PLL.RDYC   | PLL.PT[15:0] | HSERDY[17]   |  |  |  |  |
| 0x2C                                        | RCC_ISIER     | LSDRDYC      | PLL.RDYC   | PLL.PT[15:0] | HSEDON[16]   |  |  |  |  |
| 0x30                                        | RCC_ISIER     | PTR[2:3]     | PLL.N[8:0] | PLL.N[8:0]   | HSICAL[7:0]  |  |  |  |  |
| 0x34                                        | RCC_ISIER     | PLLRDYIE     | PLLRDYIE   | PLLRDYIE     | PLL.M[5:0]   |  |  |  |  |
| 0x38                                        | RCC_ISIER     | PLLRDYIE     | PLLRDYIE   | PLLRDYIE     | PLL.M[4:0]   |  |  |  |  |
| 0x3C                                        | RCC_ISIER     | PLLRDYIE     | PLLRDYIE   | PLLRDYIE     | PLL.M[3:0]   |  |  |  |  |
| 0x40                                        | RCC_ISIER     | CSRF         | HPRE[3:0]  | HPRE[3:0]    | HSITRIM[4:0] |  |  |  |  |
| 0x44                                        | RCC_ISIER     | GPORFST      | GPORFST    | GPORFST      | PLL.M[4:0]   |  |  |  |  |
| 0x48                                        | RCC_ISIER     | GPORFST      | GPORFST    | GPORFST      | PLL.M[3:0]   |  |  |  |  |
| 0x4C                                        | RCC_ISIER     | LSRDYIE      | LSRDYIE    | LSRDYIE      | SWS[14:0]    |  |  |  |  |
| 0x50                                        | RCC_ISIER     | LSRDYIE      | LSRDYIE    | LSRDYIE      | SW[14:0]     |  |  |  |  |
| 0x54                                        | RCC_ISIER     | GPORFST      | GPORFST    | GPORFST      | HSRDY[0]     |  |  |  |  |
| 0x58                                        | RCC_ISIER     | LSRDYIE      | LSRDYIE    | LSRDYIE      | HSRDY[0]     |  |  |  |  |
| 0x5C                                        | RCC_ISIER     | GPORFST      | GPORFST    | GPORFST      | HSRDY[0]     |  |  |  |  |
| 0x60                                        | RCC_ISIER     | LSRDYIE      | LSRDYIE    | LSRDYIE      | HSRDY[0]     |  |  |  |  |

We also find the register layout information again for each peripheral. Note that the text is rotated only by CSS, so the table data is still easily accessible in HTML.

# **PDF to HTML conversion**

Open-sourced at [data.modm.io](https://data.modm.io)

- Manually written Python3 code based on pypdfium2.
  - ~157k PDF pages in 65mins on a MacBook Air M2 => ~25ms per page!
  - Works on all PDFs from STMicro: also sensors, not just STM32!
  - Most valuable data is inside tables, but table processing is hard and fuzzy.  
  - Not easily portable to other vendor data sheets due to content segmentation!
  - Figures and images are ignored, math formulas are not recognized.

And I can even convert the invisible table of the bit field and their enumerations description as a HTML table.

###

And indeed this is accurate, the PLLR does not exist for this device, so the guard in the CubeHAL header is actually correct.

###

Unfortunately we have the enumeration value and description, but not a name. That would need to be generated from the description and that not always easy to do automatically.

I'm very happy with this pipeline. It's written in Python3 using native bindings for pdfium (PDF renderer in Chrome). It's entirely deterministic, so the translated HTML is byte reproducible. It's also very fast with 25ms per page. All STMicro PDFs are supported, including sensors.

Some compromises: it's not easily portable to other vendors, since the format recognition is hardcoded. I'm only interested in tables and text, so figures are completely ignored (should be converted to SVG) and math formulas are turned into garbage.

## PDF Formatting Mistakes

- The PDF sometimes have formatting mistakes: tables with missing cell borders.
- Apply git patch to HTML result: works, but fragile.

| 31   | 30   | 29    | 28          | 27    | 26          | 25          | 24         | 23        | 22 | 21          | 20 | 19 | 18 | 17 | 16 |
|------|------|-------|-------------|-------|-------------|-------------|------------|-----------|----|-------------|----|----|----|----|----|
| Res  | Res  | Res   | Res         | Res   | PEC<br>BYTE | AUTOE<br>ND | RE<br>LOAD |           |    | NBYTES[7:0] |    |    |    |    |    |
| rs   | rs   | rw    | rw          | rw    | rw          | rw          | rw         | rw        | rw | rw          | rw | rw | rw | rw | rw |
| 15   | 14   | 13    | 12          | 11    | 10          | 9           | 8          | 7         | 6  | 5           | 4  | 3  | 2  | 1  | 0  |
| NACK | STOP | START | HEAD1<br>OR | ADD10 | RD<br>WRN   |             |            | SADD[9:0] |    |             |    |    |    |    |    |
| rs   | rs   | rs    | rw          | rw    | rw          | rw          | rw         | rw        | rw | rw          | rw | rw | rw | rw | rw |

## Interpreting Datasheet Tables

Substitution hell to fix typos in PDFs

```
package = package.replace("UFBGA/TFBGA64", "UFBGA64/TFBGA64")
package = package.replace("LQFN", "LQFP").replace("TSSOP", "TSSOP")
package = package.replace("UFBG100", "UFBGA100").replace("UBGA", "UFBGA")
package = package.replace("UOFN", "UFQFN").replace("WL CSP20L", "WL CSP20")
package = package.replace("UFQFN48E", "UFQFPN48+E").replace("UFQFN", "UFQFPN")
package = package.replace("LQFP48 SMPS<br>UFQFPN48 SMPS", "LQFP48/UFQFPN48+SMPS")
package = package.replace("LQFP48<br>64", "LQFP64").replace("LQFP<br>48", "LQFP48")
package = package.replace("UFQFPN<br>32", "UFQFPN32").replace("UFQFP<br>N48", "UFQFPN48")
package = package.replace("WL CSP<br>25", "WL CSP25")
```

## Interpreting Datasheet Tables

Substitution hell to fix typos in PDFs, now with more RegEx

```
patterns = {
    r" "+": "", r".*?\(( [A-Z]+|DMA2D) \).*?": r"\1",
    r"Reserved|Port|Power|Registers|Reset|(\.*?)\_REG": "",
    r"(\d|I2S)d": r"\1", r"/I2S|CANMessageRAM|Cortex-M4|I2S\dext|^GPV$": "",
    r"Ethernet": "ETH", r"Flash": "FLASH", r"(?i).*ETHERNET.*": "ETH",
    r"(?i)Firewall": "FW", r"HDMI-1": "", "SPDIF-RX": "SPDIFRX",
    r"SPI2S2": "SPI2", "Tamper": "TAMP", "TT-FDCAN": "FDCAN",
    r"USBOTG([FH])S": r"USB_OTG_\1", "LCD-TFT": "LTDC", "DSIHOST": "DSI",
    "TIMER": "TIM", r"VREF\$": "VREFBUF", "DelayBlock": "DLYB",
    "I/O": "I/O", "DAC1/2": "DAC12",
    r"[a-z]": ""
}
```

Even though it's deterministic and reproducible, some formatting mistakes are not easy to fix.

A classic is missing border cells in tables.

We could try to infer cells from whitespace analysis between text, but it's fairly unreliable for such issues.

I just apply a git patch to the HTML, which works because the HTML is so reproducible.

But, still interpreting the HTML tables was actually way more annoying than converting the PDF.

You can to clean the data, because of many typos and random line breaks.

Here I'm using text substitution.

And then I decided to use Regex to fix many patterns in the register definitions.

## Interpreting Datasheet Tables

RegEx hell to fix typos for bit field reconstruction

```
off_replace = {"+": "", "0x000@0": "0x00", "to": "-", "x": "*", "r"\(\d+\)": ""}
dom_replace = {"Register": "Bit position"}
reg_replace = {
    r"+|.|*": "", r"\(COM\(\d\)\)": "r"\_COM\1",
    r"\[R\]es\$|\0x[\da-fA-FXx]+\(\.*\)|-": "",
    r"\(\?\)reserved|resetvalue.*": "", "enabled": "_EN", "disabled": "_DIS",
    r"\(\?\)Output|comparemode": "Output", "\(\?\)Inputcapturemode": "Input", "mode": "",
    r"\^TG_FS_": "OTG_FS", "toRTC": "RTC", "SPI2S_": "SPI",
    r"andTIM\(\d+.*": "", r"\xe[\d,]+": ""
}
fld_replace = {
    r"\d+|\d|\nd|st": "!", r"\.+|\.\*\?|\|\[\d+:\d\]|(\.*?)|-|^[\dXx]+\$|_|:0\|": "",
    r"\#dataregister|Independentdataregister": "DATA",
    r"\#framefilterreg0.*": "FRAME_FILTER_REG",
    r"\(\?\)reserved|\(\?\)regular|\(\?\)NotAvailable|ReferToSection\d+|Comparator": "",
    r"\#Completebits|Selectedchannelsequence|channelsequence|conversioninregularsequencebits": "",
    r"\#conversioninselectedsequencebits|conversioninjectedsequencebits|or|first|second|third|fourth": ""
}
bit_replace = {"*:",""}
glo_replace = {r"\[R\]eserved": ""}

```

## Evaluation of Data Sources

Actual Science! OMG



Extracted 4 datasets with increasing complexity for ~2700 STM32 devices:

1. Interrupt vector table: PDF vs CMSIS Header
2. Package and pinout: PDF vs CubeMX Database
3. Pin functions: PDF vs CubeMX Database
4. MMIO register map and descriptions: PDF vs SVD vs Header

Compare PDF against machine-readable sources: Headers, SVD, CubeMX

## PDF Interrupt Table vs CMSIS Header

Device → Reference Manual → Table → Position + Name



| Position | Priority | Type of priority | Acronym    | Description                                                                                                        | Address     |
|----------|----------|------------------|------------|--------------------------------------------------------------------------------------------------------------------|-------------|
| -        | -        | -                | -          | Reserved                                                                                                           | 0x0000 0000 |
| -        | -3       | Fixed            | Reset      | Reset                                                                                                              | 0x0000 0004 |
| -        | -2       | Fixed            | NMI        | Non maskable interrupt. The RCC clock security system (CSS) and the RAM parity check are linked to the NMI vector. | 0x0000 0008 |
| -        | -1       | Fixed            | HardFault  | All classes of fault                                                                                               | 0x0000 000C |
| -        | 3        | Settable         | SVCall     | System service call via SWI instruction                                                                            | 0x0000 002C |
| -        | 5        | Settable         | PendSV     | Pendable request for system service                                                                                | 0x0000 0038 |
| -        | 6        | Settable         | SysTick    | System tick timer                                                                                                  | 0x0000 003C |
| 0        | 7        | Settable         | WWDG       | Window watchdog interrupt                                                                                          | 0x0000 0040 |
| 1        | 8        | Settable         | PVD_VDDIO2 | PVD and VDDIO2 supply comparator interrupt (combined EXTI lines 16 and 31)                                         | 0x0000 0044 |
| 2        | 9        | Settable         | RTC        | RTC interrupts (combined EXTI lines 17, 18 and 20)                                                                 | 0x0000 0048 |

98.8% match (N=190 109)

And this then got a little out of hand for the bit field enumerations.

I do not recommend using regex for this, there needs to be a better way.

Ok, but enough regexing around. Let's do some actual science!

We want to find out how accurate our data import pipelines actually are.

So we're going to compare the machine-readable data against the PDF data.

We evaluated in detail four data sets for this.

We fixed obvious spelling mistakes, but only as long as the fix is unambiguous.

This is fairly easy: it's the interrupt vector table for STM32 microcontrollers.

Quite good.

| PDF Pinout vs CubeMX Database                    |            |         |        |                                 |    |     |                                                                      |              |       |                                                                                 |                                     |
|--------------------------------------------------|------------|---------|--------|---------------------------------|----|-----|----------------------------------------------------------------------|--------------|-------|---------------------------------------------------------------------------------|-------------------------------------|
| Device → Datasheet → Table → Pin Position + Name |            |         |        |                                 |    |     |                                                                      |              |       |                                                                                 |                                     |
| Port & Pin Name                                  | Pin number |         |        | Pin name (function after reset) |    |     | Pin type                                                             | IO structure | Notes | Pin functions                                                                   |                                     |
|                                                  | WLCSPI00   | LQFP100 | LQFP64 | LQFP48                          |    |     |                                                                      |              |       | Alternate functions                                                             |                                     |
| J3                                               | 52         | 34      | 28     | PB13                            | IO | TTa | (4) SPI2_SCK_I2S2_CK_USART3<br>CTS,TIM1_CH1N,<br>TSC_G8_I03,EVENTOUT |              |       | ADC3_INS,COMP5_INP,<br>OPAMP4_VINP,<br>OPAMP3_VINP                              |                                     |
| J2                                               | 53         | 35      | 27     | PB14                            | IO | TTa | (4) USART3_RTS_DE,<br>TIM1_CH2N,TIM15_CH1,<br>TSC_G8_I04,EVENTOUT    |              |       | SPI2_MISO_I2S2_SD,<br>TIM1_CH0N,RTC_REFIN,<br>TIM15_CH1N,TIM15_CH2,<br>EVENTOUT | COMP2_INP,ADC4_IN4,<br>OPAMP2_VINP  |
| H4                                               | 54         | 36      | 28     | PB15                            | IO | TTa | (4) JTDI,TIM16_CH1,<br>SWDIO,TIM16_CH1N,<br>SWCLK_JTCK               |              |       |                                                                                 | ADC <b>99.88% match (N=247 756)</b> |

| PDF Pin Functions vs CubeMX Database             |       |            |          |            |          |          |           |               |           |        |           |      |        |           |           |
|--------------------------------------------------|-------|------------|----------|------------|----------|----------|-----------|---------------|-----------|--------|-----------|------|--------|-----------|-----------|
| Device → Datasheet → Table → Pin Name + Function |       |            |          |            |          |          |           |               |           |        |           |      |        |           |           |
| Port & Pin Name                                  | AF0   | AF1        | AF2      | AF3        | AF4      | AF5      | AF6       | AF7           | AF8       | AF9    | AF10      | AF11 | AF12   | AF14      | AF15      |
| PA12                                             | -     | TIM16_CH1  | -        | -          | -        | -        | TIM1_CH2N | USART1_RTSD_E | COMP2_OUT | CAN_TX | TIM1_ETR  | -    | USB_DP | EVENT_OUT |           |
| PA13                                             | SWDIO | TIM16_CH1N | -        | TSC_G4_I03 | -        | IR_OUT   | -         | USART3_CTS    | -         | -      | TIM4_CH3  | -    | -      | EVENT_OUT |           |
| PA14                                             | SWCLK | JTCK       | -        | TSC_G4_I04 | I2C1_SDA | I2C1_CH2 | TIM8_BKIN | USART2_TX     | -         | -      | -         | -    | -      | EVENT_OUT |           |
| PA15                                             | JTDI  | TIM2_CH1   | TIM8_CH1 | -          | I2C1_SCL | SPI1 NSS | SPI3 NSS  | I2S3_WS       | USART2_RX | -      | TIM1_BKIN | -    | -      | -         | EVENT_OUT |

Data cleanup for cells requires interpretation of newlines, commas, hyphenation

**96.2% match (N=1 107 035)**



The package pinout was extremely accurate. This is just the pin position and name on the package.

The signals are a bit more interesting, this is our first 2D data structure.

We looked at over a million signals in our dataset, didn't find any issues with our PDF-to-HMTL pipeline, but many issues in the CubeMX database, as well as formatting issues in the PDF.

Still very accurate.

And finally we compared the register maps reconstructed from the CMSIS Header, vs the CMSIS-SVD vs the PDF.

And this was the most interesting part, because it shows that STMicro has three slightly different datasets for their hardware.

As a proxy for completeness we can look at the size of the register map. How many bytes are occupied by the registers.

You can see that the register map reconstructed from the reference manual is very accurate.

BUT the device resolution is not great:

- the CMSIS headers create 183 register maps,
- The CMSIS-SVD files only 100 register maps, and
- The PDFs only 53 register maps.



Here is the conflict rate in more detail. We can see that the complex families like F7, H7 and L4 have the most conflicts overall.

Since we have three differing data sources, we can do majority voting and see how many differing registers we can fix.

It works well for simple families, and improves the matching data quite a bit, but we can also see that the combination of CMSIS header and CMSIS-SVD is the least successful in majority voting.

This is very weird since the CMSIS header files are supposed to be generated from the CMSIS SVD files.

## Results Overview

It's almost great!



- We didn't find any systemic issues in our PDF-to-HTML pipeline!
- STMicro maintains three slightly different datasets for register maps???

| Dataset                       | Sources                                   | Method of Comparison                                                                          | Result                     | N                          |
|-------------------------------|-------------------------------------------|-----------------------------------------------------------------------------------------------|----------------------------|----------------------------|
| Device Identifier             |                                           | Datasheet $\supseteq$ CubeMX                                                                  | 93.2 %                     | 3024                       |
| Package                       | ③ Datasheet vs. ⑦ CubeMX                  | Datasheet = CubeMX                                                                            | 99.68 %                    | 2819                       |
| Pinout                        |                                           | Matching pin name at package position                                                         | 99.88 %                    | 247756                     |
| Pin Function                  |                                           | Matching index for function name at pin                                                       | 96.2 %                     | 1107035                    |
| Interrupt Vector Table        | ③ Reference Manual vs. ⑤ Header           | Matching vector name at table position                                                        | 98.8 %                     | 190109                     |
| Peripheral Register Bit Field | ④ Reference Manual vs. ⑤ Header vs. ⑥ SVD | Matching peripheral, register, or bit field name at byte or bit address after majority voting | 96.4 %<br>97.7 %<br>95.9 % | 42752<br>891044<br>3425903 |
| All Datasets                  | All Sources                               | Weighted average over all data points                                                         | 96.5 %                     | 5910442                    |

Overall, the machine-readable data is very accurate with 96.5% match at 5.9 million data points.

As a result, I would not use the PDF or the CMSIS-SVD files as primary data sources unless necessary.

Extract as much as possible from the CubeMX database and CMSIS headers instead.

## data.modm.io : Data Interface



So the question is how to we make this data accessible?

We have a highly heterogeneous dataset, which includes clock graphs, so why not use a graph database?

The graph database then also acts as the interface to the external world.

## Knowledge Graph as Interface



- Perfect for heterogeneous dataset like hardware description.
- Implements Cypher Query Language for many languages.
- Serialization into sorted plaintext allows trivial archiving.
- Simple to install and use: `pip install kuzu`

```
kuzu> MATCH (:Package)-[po:hasPin]->(:Pin)-[af:hasAlternateFunction]->(:Signal)<--(pe:Peripheral)
    RETURN pi.name, po.position, pe.name, s.name, af.index;
```

| pi.name  | po.position | pe.name | s.name | af.index |
|----------|-------------|---------|--------|----------|
| PA0/WKUP | N3          | ETH     | CRS    | 11       |
| PH2      | K4          | ETH     | CRS    | 11       |
| PA0/WKUP | N3          | TIM1    | CH1    | 1        |

I chose Kuzu, which is a small and fast embedded graph database that implements the Cypher language.

It's easy to install and use and comes with many language bindings: C/C++, Rust, Python, Web Assembly.

You can see a part of the schema on the right and a cypher query at the bottom showing alternate functions.

The query returns a table which you can then use to generate code.



There a browser based explorer tool including graph visualization.  
Here you can see the package node, surrounded by all the pins and their corresponding signals.  
This is a bit chaotic

## Knowledge Graph as Interface

```
MATCH p=(:Peripheral)-->(:Signal) RETURN p
```



We can also query only a part of the graph, like specific relations  
Here we query all relations between peripherals and signals.

## Package and Pinout Shenanigans

| STM32G071GBU6 |     |     |     |      |      |
|---------------|-----|-----|-----|------|------|
| PB8           | PB7 | PB6 | PB5 | PB4  | PB3  |
| PC14          |     |     |     |      | PA14 |
| PC15          |     |     |     |      | PA13 |
| VDD           |     |     |     |      | PA12 |
| VSS           |     |     |     |      | PA11 |
| PF2           |     |     |     |      | PC6  |
| PA0           |     |     |     |      | PA8  |
| PA1           |     |     |     |      | PA9  |
|               | PA2 | PA3 | PA4 | PA5  | PA6  |
|               | PA7 | PA8 | PA9 | PA10 | PA11 |

```
pip install stm.layout
```

```
Pin Info
  Name: PA2
  Pos: 8
  Config: Default
  Mode: [ ] GPIO
        [ ] GPO
        [ ] Alternate
        [x] Analog
  Speed: Low
        Mid
        High
        Very High
  Type: Push-Pull
        Open-Drain
Resistor: None
          Pull-Up
          Pull-Down
```

```

ernate Functions           Additional Functions-
1: I2S1_SD|SPI1_MOSI      UCP2_I2MM
2: I2S1_TX                RCC_LSCO
3: -                      SYS_WKUP4
4: -                      UCP01_FRSTX1
5: TIM15_CH1              UCP01_FRSTX2
6: LPUART1_TX             COMP2_OUT
7: -                      -
8: -                      -
9: -                      -
10: -                     -
11: -                     -
12: -                     -
13: -                     -
14: -                     -
15: -                     -

```

lnditional Functions  
I1\_IN2  
IP2\_INM  
\_LSC0  
\_WKUP4  
D1\_FRSTX1  
D1\_FRSTX2

But there are also many other things you can do with this code

For example REGEX your alternate functions

This is a smol TUI tool based on the CubeMX database, here showing all the pins that have an ADC input signal.

## Package and Pinout Shenanigans

The screenshot shows a pinout configurator interface. At the top is a package diagram for a STM32H743 device with pins labeled PC1 through PC15, PB1 through PB8, PE1 through PE8, and VSS, VDD, and VIO. Below the diagram is a table of pin configurations. A 'Regex' section shows a search pattern for 'I2C\_d\_S[DC][AL]'. The 'Pin Info' section shows a pin configuration for PB7: Speed=Med, Mode=GPIO, Config=Custom, Resistor=None, Type=Open-Drain, and Alternate Function=Analog. The 'Alternate Functions' section lists various functions like TIM17\_CH1N, TIM4\_CH2, I2C1, I2C2, USART1\_RX, USART1\_TX, PWR\_PVD\_IN, and others. The 'Additional Functions' section lists functions like TIM17\_CH1N, TIM4\_CH2, I2C1, I2C2, USART1\_RX, USART1\_TX, PWR\_PVD\_IN, and others.

## Memory Map Shenanigans

```
// Enable:SYSCFG
if target.family in ["c8", "g8"]:
    if regs.set("RCC", "APB1ENR\d?", "SYSCFG.*?EN|AFIOEN"):
        // Enable SYSCFG
        {regs.result}
    RCC<-APB1ENR |= RCC_APB1ENR_SYSFCGEN; __DSB();
    % end

% elif target.family == "f1":
    if regs.set("RCC", "APB2ENR\d?", "(?:PWR|BKP)EN"):
        // Enable power to backup domain
        {regs.result}
    RCC<-APB2ENR |= RCC_APB2ENR_AFIOEN; __DSB();
    % end

% elif target.family == "l1":
    if regs.set("RCC", "APB1ENR\d?", "PWR\d?"):
        // Enable access to backup domain
        {regs.result}
    RCC<-APB1ENR |= RCC_APB1ENR_SYSFCGEN; __DSB();
    % end

% elif target.family == "u5":
    if regs.set("RCC", "APB2ENR\d?", "DBP"):
        // Enable access to backup domain
        {regs.result}
    RCC<-APB2ENR |= RCC_APB2ENR_SYSFCGEN; __DSB();
    % end
% endif

// Enable power to backup domain
if target.family == "f1":
    RCC<-APB1ENR |= RCC_APB1ENR_PREN; __DSB();
    % end

% elif target.family in ["f4", "g4", "l4", "f7", "l7", "u4"]:
    RCC<-APB1ENR |= RCC_APB1ENR_PREN; __DSB();
    // Enable:SYSCFG
    RCC<-APB2ENR |= RCC_APB2ENR_SYSFCGEN;
    % end

% elif target.family in ["r4", "l4", "l5"]:
    RCC<-APB1ENR |= RCC_APB1ENR_PREN; __DSB();
    RCC<-APB1ENR |= RCC_APB1ENR_AFIOEN; __DSB();
    % end

% elif target.family == "u5":
    RCC<-AHB3ENR |= RCC_AHB3ENR_PREN; __DSB();
    % end

% endif
```

Why not simply™  
regex your  
register map?

###

This also works for BGA pins, here searching for all pins with I2C data and clock signals.

This is very useful to quickly find alternate functions, since the CubeMX gui is not great for searching like this.

## More Use Cases

**Everything is a Query when all you have is Data**

- Modularize and generator your own HAL much easier.
- HTML version of pinout and clock configurator. No more CubeMX.
- Optimizing constraint solver for pinout and clock limitations.
- Differing HTML versions of Datasheets and Reference Manuals.
- Much more accurate CMSIS-SVD files from the CMSIS Header + PDF.
- Testing AI models against PDF-to-HTML-to-Knowledge-Graph pipeline.

I apologize to your eyeballs. This is code from modm.

modm uses Python Ninja templates to generate startup code, where we need to enable the clock to the system config and power peripherals.

And it's very annoying since the bit is in different registers depending on the family.

###

So instead, why not just regex the register?

###

This abomination actually works really well...

There are many more use cases that I didn't go into.

A nice one would be to create a simpler CubeMX application as a HTML page, something that can configure and generate code for other HALs.

You can apply a SAT solver to the database of course to solve design constraints and help with parts selection.

You can now diff PDFs via the HTML version.

And you can generate much more accurate SVD files than the official ones.

If you think your AI model can do better, I've basically built you a benchmark. BEWARE.

## Conclusion and Future Work

- STMicro publishes several machine-readable data sources on GitHub!
- Parsing machine-readable data sources is easy and very accurate.
- Parsing PDF/HTML is difficult due to typos and formatting mistakes.
- modm-data: PDF2HTML works well, rest is "academic" code quality.
  
- Knowledge Graphs are a good database for heterogeneous data sets.
- Documentation and discoverability of Knowledge Graph Ontology is difficult.

There's a lot of machine-readable data on GitHub, it needs to be put in a good database. Knowledge graphs are still pretty niche.

Parsing PDFs is hard because of humans, rather than accessing the PDF. Some fuzzy matching required.

The PDF2HTML pipeline works really well in modm-data, the rest needs to be rewritten.

## Data Mining Hardware Descriptions Questions?

Niklas Hauser likes data science.

Homepage: [salkinium.com](http://salkinium.com)

Fediverse: [@salkinium@chaos.social](https://salkinium@chaos.social)

Code: [github.com/salkinium](https://github.com/salkinium)

Thesis: [salkinium.com/master.pdf](http://salkinium.com/master.pdf)

Paper: [\(peer-reviewed!\)](http://salkinium.com/hp23.pdf)

GitHub: [github.com/modm-io/modm-data](https://github.com/modm-io/modm-data)



[data.modm.io](http://data.modm.io)

If you want to know more details, including citations, check out master thesis and my peer-reviewed paper.

All the code is public and somewhat documented.

Do you have Questions?