

# A flexible control system for atomic, molecular and optical physics experiments

A. Trenkwalder,<sup>1,2</sup> M. Zaccanti,<sup>1,2</sup> and N. Poli<sup>1,2,3</sup>

<sup>1)</sup>*Istituto Nazionale di Ottica del Consiglio Nazionale delle Ricerche (INO-CNR), 50019 Sesto Fiorentino, Italy*

<sup>2)</sup>*European Laboratory for Nonlinear Spectroscopy (LENS), 50019 Sesto Fiorentino, Italy*

<sup>3)</sup>*Dipartimento di Fisica e Astronomia and INFN Sezione di Firenze, Università degli Studi di Firenze, Via Sansone 1, 50019 Sesto Fiorentino, Italy*

(\*Electronic mail: poli@lens.unifi.it)

(\*Electronic mail: trenkwalder@lens.unifi.it)

(Dated: 8 June 2021)

We have implemented a control system for experiments in atomic, molecular and optical physics based on a commercial low-cost board, featuring a field-programmable gate array as part of a system-on-a-chip on which a Linux operating system is running. The board features Gigabit Ethernet, allowing for fast data transmission and operation of remote experimental systems. A single board can control a set of devices generating digital, analog and radio frequency signals with a precise timing given either by an external or internal clock. Contiguous output and input sampling rates of up to 40 MHz are achievable. Several boards can run synchronously with a timing error approaching 1 ns. For this purpose, a novel auto-synchronization scheme is demonstrated, with possible application in complex distributed experimental setups with demanding timing requests.

## I. INTRODUCTION

Experimental control and data acquisition systems are widespread in many fields of scientific and industrial research where test and measurement systems need to be controlled and experimental data have to be gathered. For the application of controlling experiments in the field of atomic, molecular and optical (AMO) physics digital pulses, analog, radio and microwave frequency signals need to be generated at well-defined times. For instance, laser cooling and trapping of atomic gases down to ultralow temperatures typically require a temporal resolution of one microsecond. For this task, field programmable gate arrays (FPGAs) are very well suited. These can generate arbitrary digital pulses which can be used to program digital-to-analog converters (DAC), direct-digital synthesizers (DDS), and other devices with the requested timing resolution. As a result, FPGAs are already successfully employed in both commercial<sup>1</sup> and open source<sup>2</sup> control systems.

Owing to their flexibility, FPGAs also find application for a wide range of different tasks, encompassing clock signal generation<sup>3</sup>, DDS programming<sup>4–6</sup>, arbitrary waveform generation<sup>7</sup>, lock-in demodulation<sup>8</sup>, high-speed data acquisition (DAQ)<sup>9</sup>, digital feedback servo system<sup>10,11</sup>. Moreover, FPGAs are increasingly used for the control of quantum systems and processors and as feedback devices for quantum measurements, and can be even used within cryogenic environments<sup>12–16</sup>. Applications of FPGAs in space are becoming of growing interest<sup>17</sup>. Despite of all of these applications, the development of a custom FPGA-based system is time consuming and commercial solutions tend to be expensive. Nonetheless, the advent of cheap, multi-purpose FPGA development boards targeted for hobbyists and students, offers a solution with low-cost and short development time, from which also experimental research can benefit thanks to

the impressive capabilities of these boards.

Here we present a control system with a novel approach based on a commercial, low-cost system-on-a-chip (SoC) board, consisting of a central processing unit (CPU) which is tightly connected to an FPGA and to a set of hardware interfaces used to communicate with external devices. A Linux operating system, executed on the CPU, gives the flexibility to use high-level programming languages, which can be quickly adapted to any specific request, such as interfacing with external devices like USB, Secure Digital (SD) memory card or Ethernet with no need of additional hardware or specifically designed micro-controllers. Furthermore, the presence of an electrically isolated Gigabit Ethernet interface, allows fast data transfer and easy connection to remote locations.

All these features represent a clear advantage of FPGA-SoC systems with respect to previously realized FPGA-based solutions<sup>18</sup>, not only in terms of the superior data rates offered by the Ethernet interface, but also by the additional flexibility given by the presence of the easy programmable CPU and the fact that these are stand-alone systems which can be utilised independently on the hardware and software environment.

As a powerful simple application of such extended capabilities, here we demonstrate a novel scheme to auto-synchronize several boards using only two coaxial cables and the Ethernet communication. Without user interaction or dedicated real-time networking hardware<sup>19</sup>, the propagation delays of the signals among distant boards are measured by the boards and are corrected automatically with a residual timing error approaching 1 ns.

The paper is organized as follows: First, we present the board architecture in Sec. II, and the developed software in Sec. III. We then present the measured performance and the auto-synchronization scheme in Sec. IV, and discuss the results in Sec. V.



FIG. 1. a) Schematics of the control hardware. The experiment control sequence is sent from the control computer over an Ethernet network (yellow connections) to the FPGA-SoC boards (red). Each board, hosted in separated racks, where digital, analog and DDS devices can be freely inserted, is connect via buffer cards to a bus (gray ribbon cable). All FPGA-SoC boards are clocked (green connections) either by an external clock source or by the primary board clock signal. All the boards are synchronized via the clock and the trigger (blue connections) signals. b) Image of the FPGA-SoC board (red; back side visible), mounted on the buffer card (green; 100 mm  $\times$  160 mm Eurocard size). Backplane and power connectors are located on the right and bottom side. The trigger and clock I/Os are on the left-top and left-bottom side respectively. c) Image of the FPGA-SoC board (front side). The SoC is located in the center, the Ethernet connector is on the top side, and the two rows of pin sockets on the left and right side are used to connect with the buffer card. The external clock input is on one of the connectors on the bottom.

## II. HARDWARE ARCHITECTURE

An overview of our setup is presented in Fig. 1a. A control computer generates the experiment control sequence (represented by a list of actions to be executed at a precise time) which is sent over Ethernet to one or several FPGA-SoC boards (distinguished by their IP address). Each FPGA-SoC board, hosted within a 19" rack, drives via a buffer card a parallel bus over which digital and analog output devices and DDS are programmed at the specified time. These devices ultimately control the experiment and all physical parameters. The system is compatible with the well-established architecture in use at LENS, consisting of digital output devices with 16 TTL channels, analog output devices with two channels

with 16-bit DACs with maximum  $\pm 10$  V output, and DDS devices with two channels with up to 200 MHz output frequency, which can be modulated in frequency and amplitude. After the user has uploaded the control sequence, the experiment starts and the FPGA-SoC consecutively puts the data on the bus at the time defined in the time-stamp part of the control sequence. Once all samples are generated, the entire sequence can be repeated several times. For better timing accuracy, the clock source of the FPGA-SoC can be switched from the internal crystal oscillator to an externally provided clock signal.

The heart of our control system is the Cora-Z7 board from Digilent<sup>20</sup>, which hosts the Zynq-7007S (Zynq-7010) FPGA-SoC from Xilinx with a single (dual) core CPU (ARM Cortex A9) clocked at 650 MHz. This represents the smallest FPGA-SoC from the Xilinx Zynq-7000 series. The board is provided with 512 MB of DDR3 SDRAM (16 bits data clocked at 525 MHz) with Gigabit Ethernet and USB host and device ports. The FPGA part is nearly the same for the two variants and is similar to the low-end Artix-7 FPGA series, aiming for low-cost, low-power consumption and less demanding applications. It should be noted, that, while we choose a particular FPGA-SoC board with Gigabit Ethernet to implement our control system, the system and the methods presented in this paper can be implemented with any other FPGA-SoC boards with similar performance. For example, the DE10-Nano from Terasic Inc. is a possible alternative<sup>21</sup>.

A custom designed buffer card<sup>22</sup> is used to buffer the FPGA-SoC board signals and to shift the voltage level from the internal 3.3 V logic level to the 5 V (TTL) level of the bus. The buffer card also provides the needed buffers for the clock and trigger line used for the synchronization of different boards, as described below. An image of the FPGA-SoC board mounted on the buffer card is shown in Fig. 1b, and in Fig. 1c an image of the FPGA-SoC board (front side) is shown.

### A. The FPGA logic

Here we give an overview on the logic used in the FPGA to generate the experiment control data on the bus and all the signals necessary for the synchronization of several boards. A simplified block diagram is shown in Fig. 2. The board is basically composed of two parts: the processing system (PS, top, green), consisting of a CPU on which a Linux operating system is running, and the programmable logic (PL, bottom, yellow), where our custom hardware is implemented. The two parts of the FPGA-SoC are tightly bound via interfaces and buses, enabling mutual data exchange at high speed. In such a way, the two main tasks of the board are effectively separated among the two independent parts of the FPGA-SoC system itself. While the processing system handles the communication via Ethernet with an external control computer, the logic part produces the signals on the bus. The driver mediates between both parts and coordinates the access to the external memory. The source code for programming the FPGA is written in Verilog. It is synthesized and implemented with the Vivado



FIG. 2. Simplified block diagram of the Zynq-7000 SoC with the user data flow on the chip (thick lines). The processing system (PS, green) with 32-bit dual-core CPU allows the server and driver to access periphery like Gigabit Ethernet (GigE) and DDR memory using high-level programming languages and Linux system services. The programmable logic (PL, yellow/orange/red) holds the custom implementation of hardware. Interfaces efficiently transfer data between the two parts. Two phase-locked loop (PLL) modules are generating three different clocks (clock out, bus clock and detection clock) from an external clock source or from the PL system clock (yellow; selected by the multiplexer “MUX”). An overall dynamic phase shift  $\phi_{ext}$  can be applied, as well as an individual phase shift  $\phi_{det}$  on the detection clock. The user data is received over GigE by the server and is written via the driver into a memory region, reserved for direct-memory-access (DMA). The timing module reads the data via DMA from memory and uses one FIFO (TX) to buffer and transmit the data into the bus clocking region (orange). Data is read back into memory with the same DMA interface and another FIFO (RX). The auto-synchronization module generates a pulse on the trigger line and waits for its reception and a programmed number of cycles  $N_w$  before it gives the start signal for the timing module to generate the data on the bus. In combination with a phase-shifted detection clock (red), the pulse round-trip number of cycles  $N_{RT}$  between two boards can be measured. All control and status registers in the PL part can be accessed by the driver via the AXI Light interface and are transmitted with clock-domain-crossing (CDC<sup>23</sup>) modules between the different clocking regions. The DMA and timing modules send interrupts (IRQ) to notify the driver of important events.

2017.4 software from Xilinx running on Ubuntu 18.04 LTS, and is available online<sup>22</sup>. Detailed information on the FPGA resources used for this application is reported in Tab. III in Appendix D.

In brief, we use one general purpose I/O (GPIO) port for the reading and writing of memory mapped registers (via AMBA AXI-4 Lite interface<sup>24</sup>), and one high performance (HP) port

to efficiently transfer the experiment control sequence from the memory to the PL part and vice-versa (using direct memory access DMA<sup>25</sup> via an AXI stream bus). The clock frequency for the PL part, CPU and the DDR memory are set to their default values, corresponding to 50 MHz, 650 MHz and 525 MHz respectively.

The experiment control sequence (represented by thick lines in Fig. 2) is sent via Ethernet from the control computer to a TCP/IP server application running on the CPU. The server application interacts with a Linux kernel driver module<sup>22</sup>, which writes the data into DDR memory and programs the FPGA registers using the AXI Light bus. The data are transferred via DMA from the memory into a transmit (TX) first-in-first-out (FIFO) buffer<sup>26,27</sup> which holds a maximum of 8192 samples of 128 bits each. The FIFO serves to buffer gaps in the DMA data transmission, and allows efficient transfer of data between regions using different clocks (clock domains). In addition, we have implemented a receive DMA channel (RX), which can be used, for example, to read data from an analog input device that sends data on the bus.

In our case, the experiment control sequence consists of 64 bits per sample: 32 bits are used for the time-stamp, 7 address bits select which device on the bus to be updated, and 16 device specific bits define the new state of the device<sup>28</sup>. The time-stamp defines at which time the bus should be updated with the specific data and address of the corresponding device. After the bus has been updated, a pseudo-clock pulse (strobe) is generated by the FPGA on another pin of the bus, to initiate the state change of the selected device<sup>29</sup>. The time-stamp is defined in units of  $1/\Gamma_{sample}$  with  $\Gamma_{sample}$  being the output sampling rate of the bus, typically set to 1 MHz or 10 MHz.

The timing module is responsible to output the data on the bus. It first takes out one 64 bits-wide sample from the 128 bits of the TX FIFO, and it compares the time-stamp with an internal counter running at  $\Gamma_{sample}$ . When they are equal, the module outputs the 16+7 data and address bits, and it generates the previously mentioned strobe signal. The timing module internally uses a dedicated 50 MHz bus clock, which can be either the PL system clock (i.e. the internal oscillator of the FPGA-SoC board), or it can be generated from an external clock signal using a phase locked loop (PLL) of the FPGA-SoC. In the latter case, the frequency allowed for the external clock signal ranges from a minimum value of 10 MHz, limited by the PLL, to a maximum of 300 MHz, limited by the input buffer on the buffer card. A second PLL is used as a software controlled multiplexer (MUX) to switch between the two clock sources<sup>30</sup>. Both PLLs enable to dynamically change the phase of the generated clock signals. The auto-synchronization module, discussed in Sec. II B, is using these signals to synchronize several boards. The timing module can also trigger the output of the experimental sequence, which alternatively can be started by an external hardware trigger or via software. Finally, both the DMA TX/RX channels and the timing module communicate with the driver via interrupts. The DMA channels generate interrupts when buffers need to be updated. The timing module generates one interrupt when the experimental control sequence has been completed. Further interrupts are generated at a configurable frequency, typ-

ically 16 Hz, and are used to update the board status in the control software.

## B. Auto-synchronization

In order to synchronize several FPGA-SoC boards, all boards need to start the experimental control sequence simultaneously and they need to use the same clock source to execute each command at the same time. The common clock can be either generated by one board, or provided externally. In both cases, a suitable amplification and distribution system to all boards is needed, which might introduce unknown phase shifts. Additionally, a starting (trigger) signal needs to be distributed from one board to all the others, and can accumulate an unknown delay. As discussed in the following, our scheme takes into account and corrects for both these effects. To compensate the delay on the start trigger signal, we adopt a scalable scheme, where one trigger line is connected with high impedance to all participating boards, see Fig. 3a. The trigger line is a coaxial cable with  $50\Omega$  termination on both ends to avoid unwanted reflections. One board, called the primary board, receives the start signal from the control computer (or from an external hardware trigger), and generates a pulse in the trigger line which is detected by the other “secondary” boards. In order to compensate for the pulse propagation time between the boards, the propagation time is automatically measured in advance, such that each board can delay its execution accordingly and all boards can start at the same time.

To measure the propagation delay, the primary board instructs via Ethernet one of the secondary boards to introduce a short circuit in the trigger line using a bipolar or a field-effect transistor. Then the primary board generates a pulse in the trigger line, and it measures the round-trip time  $t_{RT}^i = N_{RT}^i T + \Delta t_{RT}^i$  needed by the pulse to propagate to the secondary board  $i$ , be reflected at the short circuit, and travel back (see Fig. 3b). Here  $N_{RT}^i = N_{det}^i - N_{gen}^i$  is the number of cycles between the generation ( $N_{gen}^i$ , blue) and the detection ( $N_{det}^i$ , red) of the pulse, and  $\Delta t_{RT}^i < T$  is a fraction of the period  $T$  of the bus clock of the primary board. While  $N_{RT}^i$  can be measured directly,  $\Delta t_{RT}^i$  cannot. This limits the resolution to the period  $T$ , which is 20 ns for the chosen 50 MHz bus clock frequency, and would not be satisfactory for bus output rates above 10 MHz. To measure the total delay with higher accuracy, the reflected pulse is sampled with a phase shifted replica (detection clock) of the bus clock signal. A train of trigger pulses is generated, and the phase shift of the detection clock is varied between pulses. For a linear increase of the detection clock phase, at:

$$\phi_{-}^{p,i} = \Delta t_{RT}^i \frac{360^\circ}{T}, \quad (1)$$

the measured  $N_{RT}^i$  reduces by one. This change in  $N_{RT}^i$  is detected, and  $\Delta t_{RT}^i$  can be obtained<sup>32</sup>. In principle, this method would allow one to achieve a time resolution of about 20 ps, given the  $0.3^\circ$  phase resolution of the PLL at the used clock



FIG. 3. Triggering and auto-synchronization scheme for multiple boards. a) In the simplest configuration all boards are connected with a common clock (period  $T$ ) provided by the primary board and daisy-chained from one board to the next using splitters. Additionally, a common trigger coaxial cable directly connects all boards and is terminated by  $50\Omega$ . The primary board generates a pulse in the trigger cable which all secondary boards detect with individual delay. The primary and secondary boards wait until all secondary boards have received the trigger pulse and start generating output simultaneously. The delays between the primary and secondary boards for the clock  $\tau_c^i$  and the trigger  $\tau_p^i$  are indicated ( $i \in 0 \dots N$ ), with  $N$  the number of secondary boards. b) The trigger delay  $\tau_p^i$  of each secondary board  $i$  is measured during the auto-synchronization by determining the round-trip time  $t_{RT}^i$  of the pulse (orange) from the difference of the number of cycles from the generation of the pulse ( $N_{gen}^i$ , blue) and its detection ( $N_{det}^i$ , red). The time correction  $\Delta t_{RT}^i < T$  is obtained by repeating the measurement and detecting the reflected pulse with a phase-shifted detection clock with increasing detection phase (black, seven phases shown) with respect to the bus clock (green) which is used to generate the pulse. At the phase  $\phi_{-}^{p,i}$  the measured  $N_{RT}^i$  reduces by one cycle and  $\Delta t_{RT}^i$  is obtained. For board  $i$  the trigger delay is calculated from  $\tau_p^i = t_{RT}^i / 2$ . For the determination of  $\Delta \tau_c^i$  a similar measurement is done on each secondary board where  $\Delta t_s^i$ ,  $N_{bus}^i$  and  $\phi_{-}^{s,i}$  replace the roles of  $\Delta t_{RT}^i$ ,  $N_{gen}^i$  and  $\phi_{-}^{p,i}$  in the figure. The clock delay  $\Delta t_c^i$  at board  $i$  is calculated from Eq. (4). See text and Appendix A 1 for more details<sup>31</sup> and figures 6a and 9 for example detection signal for varying detection phase.

frequency. However, noise in the generation and detection of the pulse actually limits the resolution to larger values. This measurement is repeated for each secondary board  $i = 0 \dots N$ . With the measured round-trip time  $t_{RT}^i$ , the propagation time of the pulse from the primary board to the  $i$ -th secondary board is calculated as:

$$\tau_p^i = t_{RT}^i / 2. \quad (2)$$

It is important to notice that in this simplified treatment we neglect all additional (but constant) delays, both internal to the FPGA and due to the electronics needed for the generation and detection of the pulse. Details of the full model accounting for these additional delays are given in Appendix A 1.

In order to achieve a perfect synchronization among all boards, the measurement of  $\tau_p^i$  for each board discussed above is not sufficient, since the clocks of the secondary boards must be corrected for the delays  $\tau_c^i$  introduced along the clock distribution line (see Fig. 3a). In this case however, one needs to know only the introduced clock delay  $\Delta\tau_c^i = \tau_c^i \% T$ , where  $\%$  is the modulus. To this end, a second set of measurements is carried out, where the primary board generates a train of pulses similarly to the previous scheme, but the measurement is now taken on the secondary boards. Since the pulses do not need to be reflected, all the secondary boards can measure the respective clock delay simultaneously. Each secondary board determines the time  $\Delta t_s^i$  between the arrival of the pulse and the previous rising bus clock edge, local to the secondary board. Similarly to the delay  $\Delta t_{RT}^i$ , here the quantity  $\Delta t_s^i$  is obtained by detecting the arrival of the pulse with both the detection and the bus clock simultaneously, giving  $N_{det}^i$  and  $N_{bus}^i$  (blue dashed line in Fig. 3b), respectively. The difference between the two signals  $N_{det}^i - N_{bus}^i$  is monitored for a reduction of one cycle at the phase:

$$\phi_{-}^{s,i} = \Delta t_s^i \frac{360^\circ}{T}, \quad (3)$$

and  $\Delta t_s^i$  can be obtained. At the secondary board location, the calculated pulse delay with respect to the primary bus clock is  $\Delta t_{RT}^i/2$  and the difference to the observed delay  $\Delta t_s^i$  gives the unknown clock delay:

$$\Delta\tau_c^i = \Delta t_{RT}^i/2 - \Delta t_s^i. \quad (4)$$

Once  $\Delta t_{RT}^i$  and  $\Delta t_s^i$  are determined for each secondary board  $i = 0 \dots N$ , the external clock PLL phases  $\phi_{ext}^i$  of each secondary board can be set to  $\phi_{ext}^i = -\Delta\tau_c^i \frac{360^\circ}{T}$ . In this way, the clocks of all secondary boards are synchronized with that of the primary one and the auto-synchronization measurement is completed and all parameters are set. In order to simultaneously generate data on all boards, the primary board sends a pulse in the trigger line. It then waits until all secondary boards have detected the trigger pulse, i.e. it waits the largest propagation time  $\tau_p^i$ . Each secondary board  $i$  waits  $\tau_p^i$  less time than the primary board. After these waiting times, all boards synchronously start generating output of data on their bus.

While we refer the reader to Appendix A 1 for more details, we emphasize that our auto-synchronization scheme allows for the synchronization of many boards on time scales of order of nanoseconds with a relatively simple scheme and few external components. A first experimental demonstration of this scheme together with measurements of the residual synchronization timing error are presented in Sec. IV.

### III. SOFTWARE

In this section we summarize the software implementation on the PS/CPU part of the SoC, on which a Linux operating system is running<sup>33</sup>. This is a fully featured operating system which provides system services and interfaces to external devices, and that can be configured for our specific needs.

The PL part appears for the operating system like an external device, and our device driver can communicate with it via registers<sup>22</sup>.

#### A. Control computer software

Many research laboratories, including ours, typically employ either Labview or LabWindows/CVI<sup>34</sup> as user application programs. While our setup is currently adapted to work with this software, we emphasize that any other user application can be easily implemented on our hardware, provided that the data need to be sent via Ethernet to our TCP/IP server running on the FPGA-SoC. No additional driver nor hardware is required, and no constraints on the operating system are given for the control computer. For example, the freely-available, Python-based control software “labscript suite”<sup>35</sup> might be a viable alternative to the above mentioned commercial solutions. We provide the necessary files in Ref. 22 to use our FPGA-SoC board together with the suite.

In our specific case, we upgraded an existing control system based on a digital I/O card<sup>36</sup> installed on the experiment control computer, driving the bus via a 2 m long cable and a buffer card. The FPGA-SoC system replaces completely the former system, maintaining the compatibility with previous hardware and software. For this, a new Windows dynamic link library (DLL) has been written, which communicates via Ethernet with the FPGA-SoC while keeping the same functions of the previous I/O card.

#### B. TCP/IP server and Linux device driver

We have designed a simple TCP/IP server application, running on the FPGA-SoC, which receives commands and the user data from the control computer, and which communicates with our device driver that mediates with the two FPGA-Soc parts, see Fig. 2.

Our server application can control, via the device driver, the FPGA PL part, write the user data into reserved DMA (coherent) memory, and receive status information from it. The driver allows a user application to read back data from the PL part, wait for interrupts or for the end of the sequence. The driver maintains the ring buffers for the DMA transfer, and responds to the corresponding interrupts. We have reserved 128 MiB of memory for coherent DMA transfer. This size corresponds to  $10^7$  samples and 10 seconds of contiguous data output at  $\Gamma_{sample} = 1$  MHz. However, most applications typically do not require such a large number of samples and dense output of data. If needed, data could be uploaded via Ethernet during the experimental run as well. The reserved size is sufficiently large to store all user data directly into coherent memory, which keeps the server and driver simple, and it avoids additional copying for repeated runs. A timer interrupt, generated by the PL part, and transmitted by the driver, allows the server application to send status information at regular intervals to the control computer.

### C. Startup script

When the board is powered up, a bootloader reads from a SD card the binary data to program the PL part and to load the required Linux image into memory, and to start the operating system. After this is completed, our startup script reads a configuration file from the SD card which contains the IP address and other information, with which it configures the Linux system and launches our TCP/IP server application. The server may either initiate the auto-synchronization procedure on startup, or wait for instructions from the control computer. A startup script and a text configuration file are used to change the configuration of the board without the need of recompiling the binary code from the sources.

## IV. MEASUREMENTS AND RESULTS

In this section we present and discuss measurements done on the FPGA-SoC board. For these measurements, specific code running on the FPGA-SoC system has been written, and the data has been acquired directly on the board and stored on a micro-SD card<sup>37</sup> for further analysis. Except for the verification of the synchronization error, no external measurement was needed. All the data presented in the paper is available in Ref. 38.

In the first part, Sec. IV A, measurements of the DMA transmission rates are shown, defining how fast data can be transmitted from the external memory into the PL part and back. This represents a direct measure of the maximum sampling rate at which the board can contiguously output and input data. In the second part, Sec. IV B, we present measurements on the data uploading rates over Gigabit Ethernet for both the Cora-Z7-10 and Cora-Z7-07S boards. This measurement confirms that Gigabit Ethernet is a good choice for experiments where a fast cycle time is required. In the last part, Sec. IV C, we present first measurements of the proposed auto-synchronization scheme outlined in Sec. II B, tested on a simple two-board configuration. An additional measurement presented in Appendix C demonstrates the start- and stop trigger option in cycling mode.

### A. DMA transmission rates

In order to measure the DMA data transmission rates of the FPGA-SoC board we have temporarily added a module in the PL part which allows one to transmit data without delay in a “loop-back” configuration between the TX and the RX FIFO buffers (see Fig. 2), and to measure the time interval required to transmit a certain number of samples. From the measured time  $t$  and the number of samples  $N$  we calculate the average data rate  $\Gamma$  in MB/s using:  $\Gamma = \beta N/t$ , with  $\beta = 12$  bytes per sample for this measurement. In particular, we measure three distinct rates, shown in Fig. 4 for the Cora-Z7-10 board, as a function of the number of samples  $N$ : the transmission rate from the memory to the PL part (TX DMA, red circles), the transmission rate from the PL part to the memory (RX



FIG. 4. Measurement of DMA data transmission rates of the Cora-Z7-10 board as a function of number of samples  $N$ . Each measurement point is the mean value of at least 20 measurements and the error bar corresponds to the standard deviation. The curves are fits to the data as explained in Appendix B. The vertical dotted line at 8192 samples corresponds to the TX and RX FIFO buffer size. The horizontal dotted line at 600 MB/s corresponds to 1 sample/cycle for the 50 MHz PL clock frequency and the horizontal red dotted line is the fitted  $\Gamma_{DMA} = 341(1)$  MB/s for large number of samples.

DMA, orange squares) and the transmission rate through the RX FIFO (green diamonds). Each experimental point (error bar) shown in the figure represents the mean value (standard deviation) of at least 20 repeated measurements for each  $N$ . The data are well fitted to a simple model (solid curves in Fig. 4) that has one delay and two rates as free parameters. For details about the fitting function, and the fit results, we refer the reader to Appendix B and Tab. II therein.

For the measurement of the TX transmission rate (red circles in Fig. 4) we measure the time interval from the first sample received out of the TX FIFO until the  $N$ -th sample is received. The first four samples are transmitted with the maximum possible rate of one sample per cycle, i.e.  $\Gamma_{max} = \beta \times f_{PL} = 600$  MB/s (horizontal black dotted line) for the PL clock frequency of  $f_{PL} = 50$  MHz. This is because the TX FIFO already contains three to four samples when the measurement starts (in agreement with the simulated latency of the used FIFO). As  $N$  is increased, the rate reduces rapidly until it reaches a constant rate  $\Gamma_{DMA}$  (horizontal red dotted line), corresponding to the transmission rate from memory to the PL part. We remark, that this characterization does not allow to measure a possible delay between the start of the DMA transmission, initiated by the CPU and the arrival of the first sample.

The second measurement (orange squares in Fig. 4) shows the RX transmission rate obtained from the time interval between the first sample written into the RX FIFO and the RX DMA interrupt<sup>39</sup>, which indicates that all  $N$  samples have been transmitted from the PL part to the external memory. This second rate increases for increasing  $N$ , from very small to the same  $\Gamma_{DMA}$  as observed for the TX measurement. This

initial increase is consistent with a constant delay of 202(8) PL cycles, required for the RX DMA channel to start or finish the transmission. This delay is larger than expected<sup>40</sup>, and it points to a significant latency in the RX channel. Nonetheless, the large RX FIFO can easily compensate for such a latency.

The third measurement, shown in Fig. 4 as green diamonds, was taken simultaneously with the RX transmission rate, and it shows the data rate through the RX FIFO: namely, the rate obtained from the time  $N$  samples need to pass through the RX FIFO during active RX transmission. As long as the RX FIFO is not full, one sample per cycle is transmitted, corresponding to  $\Gamma_{max}$ . When the RX FIFO becomes full with  $N_{FIFO} = 8192$  samples (dotted vertical line in Fig. 4), the rate reduces to the RX and TX data transmission rate  $\Gamma_{DMA}$ . Since the RX FIFO is simultaneously loaded with  $\Gamma_{max}$ , and unloaded with  $\Gamma_{DMA}$ , we expect this rate to drop once the number of transmitted samples reaches  $N_{FIFO} \frac{\Gamma_{max}}{\Gamma_{max} - \Gamma_{DMA}} \approx 19 \times 10^3$  samples, a value close to the observed one of  $20(1) \times 10^3$  samples.

All three measurements give for large number of samples a consistent DMA transmission rate of  $\Gamma_{DMA} = 341(1)$  MB/s (averaged over all measurements). This rate deviates with the specified rates from Xilinx<sup>25</sup> for the default settings. In particular, the TX rate is lower while the RX rate is higher than specified. However, their measured sum is 684(2) MB/s, which is only 2% lower than the value expected from the specification of 700 MB/s. Although the exact reason for this discrepancy is not clear (the ratio between the TX and RX rates can be adjusted<sup>41,42</sup>), the observed overall performance allows us to conclude that our DMA transmission rates are indeed close to the maximum possible ones for a single HP port. Finally, from the measured DMA transmission rate we can also directly deduce the maximum contiguous bus data rate of  $\Gamma_{DMA}/\beta = 30 - 40$  MHz<sup>43</sup>.

We note that, the FPGA-SoC has 4 HP ports, and in our design there should be enough free resources to use at least an additional one to increase the DMA rate even further, if higher bus rates are needed. Short “bursts” of data output (input) of up to 8192 samples at higher frequencies are already possible with the present setup as long as there is sufficient time before the “burst” to fill (empty) the TX (RX) FIFO and the rate afterwards is slow enough to prevent the TX (RX) FIFO from becoming empty (full). Although not shown here, we have performed the same measurement for the Cora-Z7-07S board, finding no significant deviation from the results presented in Fig. 4.

## B. Ethernet uploading rates

The uploading rate from the control computer to the FPGA-SoC board over Gigabit Ethernet is another measure of the performance of our system. It can be a limitation for experiments where short cycle times are needed, like experiments with optical tweezers<sup>45</sup> or with ions<sup>46</sup>.

Fig. 5a shows the uploading rate measured for the Cora-Z7-10 (solid blue circle) and Cora-Z7-07S (solid orange square) board. This measurement includes the total time of uploading and writing into reserved DMA memory. For each board the



FIG. 5. a) Measured rates for uploading and writing to reserved DMA memory (solid symbols) and uploading only (open symbols) as a function of number of samples  $N$  for the Cora-Z7-10 (blue circles) and Cora-Z7-07S board (red squares). The horizontal dotted line indicates the theoretical maximum rate of 118.7 MB/s for Gigabit Ethernet<sup>44</sup> and the vertical dotted line indicates the size of the receive buffer of the server. The numbers are the measured uploading and writing rates for  $10^7$  samples. Each data point is the mean of at least 15 measurements and the error represents the standard deviation. The dotted curves are fits with Eq. (B1) with a delay and single rate and the fit results are summarized in Tab. II. b) Same data as in panel a but time for uploading or uploading and writing to memory is shown. Numbers give the fitted time needed for uploading and writing to DMA memory for 4 samples and  $10^7$  samples for the Cora-Z7-10 (blue) and Cora-Z7-07S (orange) boards.

fastest strategy is used depending if a dual-core CPU is present (Cora-Z7-10) or only a single-core CPU (Cora-Z7-07S): for the dual-core CPU the server uses one thread to receive the uploaded data and a second thread to write the data into reserved DMA memory in parallel. For the single-core CPU it is fastest to immediately write the uploaded data into reserved DMA memory using a single thread<sup>47</sup>. Fig. 5b shows the corresponding times for the same data as in Fig. 5a.

The rates are calculated from  $\Gamma = N\beta / (t_{tot} - t_{ACK} - t_{RT}^{net}/2)$  where  $N$  is the number of transmitted samples and  $\beta = 12$  bytes per samples used for the measurement. The time  $t_{tot}$  is when uploading and writing to memory is finished, and  $t_{ACK}$  is the time when the server acknowledged to receive the data from the user application. The network round-trip time  $t_{RT}^{net}$  is obtained during each individual measurement as the time from the acknowledgement of the server ( $t_{ACK}$ ) until the arrival of the first data at the server. We take half of  $t_{RT}^{net}$  under the assumption that sending and receiving involves the same delays, which is not necessarily the case. For each datapoint we have taken at least 15 measurements and plot the mean value and standard deviation (error bar).

For small number of samples the observed uploading rate is small. This can be interpreted as a fixed delay (of order of a few 100  $\mu$ s, see Fig. 5b), which the user application or the server needs to start sending or receiving the data. For

increasing number of samples, this delay becomes less important and the rate reaches a peak of about 70 - 80 MB/s at  $32 \times 10^3$  samples (vertical dotted line) and decreases for number of samples beyond this. At  $10^7$  samples the uploading and writing rate is 56.5(3) MB/s (47.2(4) MB/s) for the Cora-Z7-10 (Cora-Z7-07S) board, which corresponds to a time of 2.13(1) s (2.54(2) s). This time is even faster than the typical calculation time the user application needs (about 7 s with labscript-suite) to generate this number of samples.

The peak in the rate is correlated with the receive buffer size (512 kB) of the server. If chosen too small the decrease in the rate at higher  $N$  becomes much worse. This indicates that the overhead in handling large lists of small buffers can become significant. In this respect the Cora-Z7-10 board performs slightly better than the Cora-Z7-07S board, which is limited by a single-core CPU.

For comparison, we present another measurement where only data are uploaded, but no writing to the reserved DMA memory is done. The resulting rates for the Cora-Z7-10 (open blue circle) and Cora-Z7-07S (open orange square) board are shown in Fig. 5a and b. For the calculation of the rate,  $t_{tot}$  is now the time until all data is uploaded without writing to reserved DMA memory. For the Cora-Z7-10 board the peak uploading rate reaches about 110 MB/s which is very close to the theoretical maximum of 118.7 MB/s for Gigabit Ethernet<sup>44</sup>. The Cora-Z7-07S board is with about 90 MB/s only slightly slower. In this measurement the CPU is still copying data into temporary buffers which explains the difference of the boards, and the observed decrease of the rate after the peak.

With Eq. (B1) in Appendix B we fit the measurements with a delay time and a single transmission rate (dotted curves in Fig. 5). We use the standard deviation of each data point to get more weight on the large number of samples with less noise. See Tab. II for the fit results. The numbers in the figure are the fitted rates and times for both boards when uploading and writing  $10.5 \times 10^6$  samples to reserved DMA memory.

The observed fast uploading and writing rates confirm that the FPGA-SoC board is indeed the right choice for applications where fast cycle times are requested.

### C. Auto-synchronization

Here we present the first realization of the auto-synchronization scheme proposed in Sec. II B. In particular, first tests have been done utilizing two boards connected with different trigger cable lengths and using different external clock phases. Without loss of generality, we present the synchronization of the two boards that are directly connected with the trigger line, terminated with  $50\Omega$  on the primary board side and switchable on the secondary board side from  $50\Omega$  to high impedance to reflect the pulse. In the following we omit the index  $i = 0$  since here only one secondary board is used. For details on the theoretical analysis and the measurement of the secondary board external clock PLL phase we refer the reader to Appendix A 1 and A 2.

On the primary board we measure the round-trip cycle time  $N_{RT}$  of the reflected pulse, and the phase  $\phi_-^P$  at which  $N_{RT}$  is



FIG. 6. Auto-synchronization result for two boards at different trigger cable lengths. a) Round-trip cycle time  $N_{RT}$  for the reflected pulse leading edge vs. detector phase shows jumps of one cycle at specific phases ( $\phi_+$  and  $\phi_-$ , see Sec. II B and Appendix A 1 for details). Data is shown for selected cable lengths. b) Pulse round-trip time  $t_{RT}$  calculated with Eq. (A1) for the trailing edge of the pulse for 12 cable lengths. The slope of the linear fit gives a propagation delay per unit cable length of  $\frac{dt_p}{dL} = 4.9(4)$  ns/m, when averaged over leading and trailing edges of the pulse. c) Synchronization error as a function of cable length. Each point and error bar is the mean and standard deviation of five repetitions with external clock phase 0, 90, 180 and 270°. The red shaded area represents the 68% confidence interval of the average error over all data giving  $(-0.5 \pm 1.3)$  ns. The insert shows all signal traces of the primary (blue) and secondary board (red) used to measure the synchronization error.

reduced by one, see Fig. 6a for different lengths of the trigger coaxial cable<sup>48</sup>. Combining both measured values of  $N_{RT}$  and  $\phi_-^P$  we obtain, from Eq. (A1) in Appendix A 1, the round-trip time  $t_{RT}$  shown in Fig. 6b. From a linear fit to the data (green line) we obtain the propagation delay per unit of cable length  $L$  of  $\frac{dt_p}{dL} = 4.9(4)$  ns/m, when averaged over leading and trailing edges of the pulse. This value is consistent with the expected one<sup>49</sup>.

Based on a similar measurement protocol<sup>50</sup>, the secondary board determines the phase  $\phi_-^S$  of the negative jump in  $N_{det} - N_{bus}$  for the received pulse. The local clock of the second board is locked to the external clock provided by the primary one, where a short (ca. 20 cm long) cable is employed to ensure no additional phase shifts. To simulate different delays  $\Delta\tau_c$  of the external clock, four different auto-synchronization measurements are performed, where the external clock PLL phase of the secondary board is set to 0, 90, 180 or 270°, corresponding to  $\Delta\tau_c = 0, 5, 10$  or 15 ns respectively.

The resulting synchronization error is verified in a final measurement for each cable length and  $\Delta\tau_c$  after the auto-synchronization is finished, see Fig. 6c. For this measurement, the resulting phase  $\phi_{ext}$ , obtained from Eq. (A4) in Appendix A 1, is added to the previously set external PLL clock phase  $\Delta\tau_c \frac{360^\circ}{T}$ , which, for perfect synchronization, should be compensated by  $\phi_{ext}$ . Then the primary board generates a trigger pulse and waits  $N_w^{prim} = \tau_p // T$  cycles (see Eq. (A7) and (A8) in Appendix A 1; the symbol  $//$  represents integer division), before it starts generating data on the bus. The secondary board starts generating data on the bus as soon as the trigger signal is detected. The synchronization error corresponds to the difference between the times at which secondary and primary boards start generating data on their own buses. The corresponding traces are recorded with an oscilloscope, see the inset of Fig. 6, and are fitted with a sigmoid function to obtain the synchronization error. See Appendix A 3 for further details. In Fig. 6c each data point (error bar) represents the mean (standard deviation) of the synchronization error, measured at least five times for each of the four external clock phases ( $\Delta\tau_c$ ). Averaging over all cable lengths, we obtain a synchronization error of  $(-0.5 \pm 1.3)$  ns (red shaded area in Fig. 6c) which is much smaller than the 25 ns time resolution for the maximum possible bus output rate of 40 MHz of the board.

Finally we remark that, although the basic principle of our auto-synchronization scheme is very simple, being based on a round-trip time measurement, the details can be involved. Developing such a scheme on a FPGA-only platform is feasible, but it might be challenging and time-consuming. In turn, our FPGA-SoC board allows one to implement a simple pulse generation and detection in hardware, but to analyze the data and calculate the ideal settings to minimize the error, via the CPU, by software. In this way, the system could be quickly developed, errors corrected and the formulas implemented in software with no need to change the hardware every time. We believe that, the auto-synchronization is not only a useful feature, but it is also a perfect example of the flexibility which the FPGA-SoC approach offers.

## V. CONCLUSIONS AND OUTLOOK

In conclusion, we have successfully implemented a versatile experimental control system based on a commercial, low-cost, and stand-alone FPGA-SoC board. We have demonstrated that the board can sustain bus output and input rates of up to 40 MHz and we have shown how the board can automatically synchronize with a timing error approaching 1 ns. Furthermore, we have proven the extreme flexibility, easy Ethernet connectivity, and computational power of the FPGA-SoC system, showing several examples in which the operating system, running on the board itself, is used not only to control the FPGA hardware, but also for data acquisition and analysis. Finally, we stress that no specific device driver or proprietary software, or operating system is needed to use our device, and that the whole source code to program the FPGA-SoC is freely available<sup>22</sup>. Although not discussed in

the present work, our system can be easily extended to include the control of additional devices through the on-board USB host controller<sup>51</sup>, or via adapter with the older GPIB standard<sup>52</sup>, widespread in many laboratories, or to directly read data with analog-to-digital converters (ADC). We also emphasize that our design is stand-alone and lightweight, and the power consumption of less than 2 W, makes it compatible for the operation in remote locations, and even for experiments in space<sup>53–55</sup>. We believe that the auto-synchronization feature, devised and implemented in this work, will also help several experimental setups on ground with growing complexity: for instance, setups which must bridge large distances to challenge relativity<sup>56</sup>, to detect gravitational waves with large-scale atom interferometers<sup>57,58</sup>, and to measure difference of gravitational red-shift between two separated atomic lattice clocks<sup>59</sup>. Finally, our architecture, thanks to the rich features and flexibility offered by the new FPGA-SoC board, may find application in various research fields, extending well beyond our original purpose of controlling AMO physics experiments.

## ACKNOWLEDGMENTS

We thank Jacopo Catani for fruitful discussions, borrowing equipment and careful reading of the manuscript, Roberto Concas and Fabio Corti for machining and soldering a prototype buffer card, Giacomo Mazzamuto for help with github, and all members of the Quantum Gases Group at LENS, in particular Leonardo Fallani and Daniele Tusi and the Yb team for testing the boards in their experiment. This work was supported by the ERC through grant No. 637738 PoLiChroM and by the Italian MIUR through the FARE grant No. R168HMFYM P-HELiCS. N.P. acknowledges support from European Research Council, Grant No. 772126 (TOCTOCGRAV).

The authors declare that they have no competing interests.

## DATA AVAILABILITY STATEMENT

| AVAILABILITY OF DATA                                                        | STATEMENT OF DATA AVAILABILITY                                                                                                                                       |
|-----------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data openly available in a public repository that issues datasets with DOIs | The data that support the findings of this study are openly available at <a href="https://doi.org/10.5281/zenodo.4893285">https://doi.org/10.5281/zenodo.4893285</a> |

## Appendix A: Auto-synchronization

In Sec. A 1 we present the full model of the auto-synchronization scheme outlined in Sec. II B and in Sec. A 2 we show additional data for the first implementation presented in Sec. IV C. In Sec. A 3 the fitting function is presented

| name                     | value            | remark                                                              |
|--------------------------|------------------|---------------------------------------------------------------------|
| $t_g + t_d$              | 205(1) ns        | offset from linear fit Fig. 6b <sup>a,b</sup>                       |
| $t_d$                    | -2(1) ns         | offset from linear fit Fig. 8 at $\Delta\tau_c = 0$ ns <sup>a</sup> |
| $\phi_+$                 | 25(1) $^\circ$   | measured <sup>b</sup>                                               |
| $\varphi_p^{crit}$       | 180(20) $^\circ$ | measured <sup>c</sup>                                               |
| $\phi_0$                 | 20 $^\circ$      | fine-adjusted manually to minimize the error                        |
| $N_0$                    | 3                | adjusted manually to minimize the error                             |
| $\varphi_m$              | 90 $^\circ$      | chosen                                                              |
| $\varphi_{add}$          | 70 $^\circ$      | chosen                                                              |
| $\Delta\varphi_p^{crit}$ | 20 $^\circ$      | chosen                                                              |
| $\delta\varphi_p^{crit}$ | 30 $^\circ$      | chosen                                                              |

TABLE I. Used constants for the auto-synchronization. The measured standard deviation is given in brackets.

<sup>a</sup> Obtained from earlier measurements.

<sup>b</sup> At  $f_{PL} = 50$  MHz.

<sup>c</sup> Error is smaller than  $\Delta\varphi_p^{crit}$  but was not systematically measured.

which is used to obtain the synchronization error shown in Fig. 6c in Sec. IV C. In Sec. A 4 sample detector signals are shown.

## 1. Theoretical Model

A graphical representation of all the quantities and delays involved in the synchronization scheme is presented in Fig. 7 for the primary and secondary boards. The measurement on the primary board gives for each secondary board  $i$  the round-trip number of cycles  $N_{RT}^i = N_{det}^i - N_{gen}^i$  and the negative jump in  $N_{RT}^i$  gives  $\Delta t_{RT}^i$  from Eq. 1. On the secondary board the time  $\Delta t_s^i$  is measured from the negative jump in  $N_{det}^i - N_{bus}^i$  using Eq. 3. From these quantities the waiting number of cycles  $N_w^i$  and the external clock phase  $\phi_{ext}^i$  and the detector phase  $\phi_{det}^i$  (see Fig. 2) are calculated as described below.

The model uses a set of constants which are summarized in Tab. I. They have been determined from several calibration measurements, or have been chosen for best performance, as described below. The PL system clock is 50 MHz for this measurement, but it should affect only  $t_g + t_d$  (see below) through the fixed number of clock cycles used for the CDCs. After these parameters have been determined, they can be applied for all boards and should not need to be changed as long as the boards are the same and the setup (hardware and software) is not changed.

Taking into account the generation time  $t_g$  (green) and the detection time  $t_d$  (orange) of the pulse, the round-trip time  $t_{RT}^i$  and propagation time  $\tau_p^i$  between the primary and the secondary board is obtained from:

$$t_{RT}^i = \begin{cases} N_{RT}^i \\ N_{RT}^i + 1 \end{cases} T + \Delta t_{RT}^i \quad \text{for} \quad \begin{cases} \phi_{-}^{p,i} > \phi_{+} \\ \phi_{-}^{p,i} < \phi_{+} \end{cases} \quad (\text{A1})$$

$$\tau_p^i = (t_{RT}^i - t_g - t_d) / 2.$$

This is the full relation in comparison to Eq. 2 in Sec. II B. At the phase  $\phi_+$  of the detector clock, the measured  $N_{RT}$  increments by one cycle.  $\phi_+$  is at a small and positive detector phase, because the signal for  $N_{gen}$  (blue solid line in Fig. 3b)



FIG. 7. Graphical representation of main quantities (delays and phases) involved in the auto-synchronization scheme. Upper part: the primary board generates the pulse and waits until detection of the reflected signal after a propagation time of  $2 \times \tau_p^i$  (light gray). Delays involving the generation ( $t_g$ , green) and the detection ( $t_d$ , orange) of the pulse have to be added for the calculation of the total round-trip time  $t_{RT}^i = N_{RT}^i T + \Delta t_{RT}^i$ , with  $T$  the clock cycle time. Lower part: the secondary board  $i$  detects the pulse after the propagation time  $\tau_p^i$  (dark gray) and it is assumed the same delays as for the primary board. The delay of the local clock of the secondary board with respect to the primary board is  $\Delta\tau_c^i$  and can be calculated from the difference of  $\phi_p^i - \phi_s^i$ . The measured quantities  $N_{RT}^i$ ,  $\Delta t_{RT}^i$  and  $\Delta t_s^i$  are indicated in red. The width of the pulse  $w_p$  is changing during the propagation due to dispersion, and affects the measurement if this involves both leading and trailing edges of the pulse (not shown here).

has to be transmitted from the bus clock to the detection clock and for too small delay between the clock edges the signal is transmitted one cycle later. For detector phases above  $\phi_+$  the signal can be transmitted within the same clock cycle<sup>60</sup>. Therefore, one cycle has to be added to  $t_{RT}^i$ , when the measured  $\phi_{-}^{p,i} < \phi_{+}$ . This happens regardless of the additional clock-domain-crossing stage (CDC, see Fig. 2, avoided in Fig. 3b for clarity), which is needed for the transmission of the signal for  $N_{gen}$  from the bus clock to the detection clock. The sum  $t_g + t_d$ , used for the calculation of the propagation time  $\tau_p^i$ , is the experimentally obtained offset of the linear fit of the round-trip time vs. cable length (see green line in Fig. 6b).

From the propagation time  $\tau_p^i$  the pulse phase  $\varphi_p^i$  can be calculated:

$$\varphi_p^i = ((\tau_p^i + t_g) \% T) \frac{360^\circ}{T} = \left( \frac{\Delta t_{RT}^i + t_g - t_d + (t_{RT}^i // T) T \% T}{2} \right) \frac{360^\circ}{T}. \quad (\text{A2})$$

The symbols % and // represent modulo and integer division, respectively. The factor  $(t_{RT}^i // T) T$  adds  $T/2$  to  $\varphi_p^i$  when  $\phi_{-}^{p,i} < \phi_{+}$ .  $\varphi_p^i$  is the expected phase of the pulse which the secondary board would measure for  $\Delta\tau_c^i = 0$ . The actual pulse phase which the secondary board obtains is:

$$\varphi_s^i = (\Delta t_s^i - t_d) \frac{360^\circ}{T}, \quad (\text{A3})$$

where we assume that the detection delay  $t_d$  is the same as for the primary board. The difference between the primary and secondary pulse phase is a measure of the secondary clock delay  $\Delta\tau_c^i$ . This is used to set the external clock phase  $\phi_{ext}^i$  of the secondary board:

$$\phi_{ext}^i = -\Delta\tau_c^i \frac{360^\circ}{T} = \varphi_s^i - \varphi_p^i - \phi_0 + \xi(\varphi_p^i). \quad (\text{A4})$$

This is the full relation corresponding to Eq. 4 in Sec. II B. The additional phase factor  $\phi_0$  is manually adjusted to minimize the synchronization error. This corrects an eventual mismatch in  $t_d$  between the primary and secondary board and corrects for our choice to measure  $\varphi_p^i$  on the trailing edge and  $\varphi_s^i$  on the leading edge of the pulse<sup>61</sup>. When  $\varphi_p^i$  happens to be close to the critical phase  $\varphi_p^{crit}$ , the resulting synchronization error shows random jumps by  $T$  in either positive or negative direction<sup>62</sup>. The security phase  $\xi(\varphi_p^i)$  is introduced to avoid this region which we define as  $\pm\Delta\varphi_p^{crit}$  around  $\varphi_p^{crit}$ .  $\xi(\varphi_p^i)$  is nonzero only if  $\varphi_p^i$  is inside this region and adds in this case  $\pm\delta\varphi_p^{crit}$  to  $\phi_{ext}$  according to:

$$\begin{aligned} \xi_0^i &= -\text{sign}(\varphi_p^i - \varphi_p^{crit}) \times \delta\varphi_p^{crit} \\ \xi(\varphi_p^i) &= \begin{cases} \xi_0^i & \text{for } |\varphi_p^i - \varphi_p^{crit}| < \Delta\varphi_p^{crit} \\ 0 & \text{otherwise} \end{cases}. \end{aligned} \quad (\text{A5})$$

The function  $\text{sign}(x)$  gives  $\pm 1$  depending on the sign of  $x$ . When  $\xi(\varphi_p^i)$  is nonzero, the synchronization error increases by about  $\delta\varphi_p^{crit} \frac{T}{360^\circ} \approx 1.7$  ns, but avoids uncontrollable outliers. The data points at 20 m and 31.3 m in Fig. 6c and 6d represent such cases where the measured  $\varphi_p$  is about  $\pm 15^\circ$  near  $\varphi_p^{crit}$  (see green shaded region in Fig. 8). Note, that this correction depends only on the measured  $\varphi_p^i$  and is automatically applied by the boards. For applications where the added synchronization error is unacceptable, the board can give a warning to the user and a slightly shorter or longer trigger cable might be used.

The detection clock phase  $\phi_{det}^i$  is used not only during the auto-synchronization measurement, but also afterwards to detect the pulse on the secondary boards. It does not directly influence the synchronization error, but it is set such that the detection of the trigger pulse happens neither close to the rising or falling edges of the pulse, nor to the rising edge of the bus clock. This ensures reliable timing but might require one additional cycle to wait.  $\phi_{det}^i$  is set at least  $\varphi_{add}$  after the arrival of the pulse:

$$\phi_{det}^i = \varphi_s^i - \phi_{ext}^i + \varphi_{add}$$

$$\phi_{det}^i = \begin{cases} \varphi_m & \text{for } \varphi_{det}^i \leq \varphi_m \\ \varphi_{det}^i & \text{for } \varphi_m < \varphi_{det}^i \leq 360^\circ - \varphi_m \\ \varphi_m & \text{for } 360^\circ - \varphi_m < \varphi_{det}^i \leq 360^\circ + \varphi_m \\ \varphi_{det}^i - 360^\circ & \text{otherwise.} \end{cases} \quad (\text{A6})$$

The phase margin  $\varphi_m$  ensures that  $\phi_{det}^i$  has a phase outside of the region  $[-\varphi_m \dots + \varphi_m]$  to avoid that the detection of the pulse is too close to the bus clock rising edge where the timing would be unreliable. It was chosen to be significantly larger than  $\phi_+$ .

The last parameters to be determined are the number of cycles each board has to wait before it can start output data on the bus. For this the propagation number of cycles  $N_p^i$  have to be calculated:

$$N_p^i = (\tau_p^i + t_g + t_d) // T + \begin{cases} N_0 & \text{for } \varphi_{det}^i \leq 360^\circ - \varphi_m \\ N_0 + 1 & \text{otherwise.} \end{cases} \quad (\text{A7})$$

Here the experimentally determined constant integer  $N_0 \varepsilon \mathbb{Z}$  adds a few cycles to account for the cycles needed to start the output. The +1 accounts for the above mentioned case, that the detection clock was adjusted to detect the pulse one cycle later, to ensure reliable timing. With the knowledge of all  $N_p^i$  of the secondary boards the waiting number of cycles of the primary and secondary boards can be calculated:

$$\begin{aligned} N_w^{prim} &= \max_j(N_p^j) \\ N_w^i &= N_w^{prim} - N_p^i. \end{aligned} \quad (\text{A8})$$

The waiting number of cycles of the primary board is the largest of the  $N_p^i$ , i.e.  $\max_j(N_p^j)$ , and each secondary board has to wait less until the last board does not need to wait.

The first demonstration of this scheme is presented in Sec. IV C and Fig. 6 shows the results. In Fig. 8 in the next section the different phases are shown for the same data.

## 2. Measured external clock phase

Fig. 8a shows the phases  $\varphi_p$  (green circles),  $\varphi_s$  (blue squares) and  $\phi_{ext}$  (orange diamonds) for the corresponding data presented in Fig. 6 in Sec. IV C. The linear fit (modulo  $360^\circ$ ) of  $\varphi$  vs. cable length (blue dashed line) gives a propagation delay per unit length of  $\frac{d\tau_p}{dL} = 4.9(3)$  ns/m (averaged over leading and trailing edge of the pulse), which is the same as the one measured on the primary board (see Fig. 6b). When  $\Delta\tau_c = 0$ , the offset of the linear fit gives the detection delay of the pulse  $t_d$ . For nonzero  $\Delta\tau_c$  the offset is shifted accordingly.

In Fig. 8b we show the synchronization error as a function of the sum  $\phi_{ext} + \Delta\tau_c \frac{360^\circ}{T}$ , i.e. how well the measured  $\phi_{ext}$  compensates the externally applied clock delay  $\Delta\tau_c$  (see Eq. (A4)). A linear fit (orange dashed line) gives a slope of 80(1) ps/degree which is slightly larger than the expected  $\frac{20\text{ns}}{360^\circ} = 53$  ps/degree and the offset of 263(4)° indicates that there is an additional unaccounted phase shift on the external clock. The used 20 cm long clock cable would introduce a phase shift of only about 20° at the 50 MHz external clock frequency used for this measurement. Additional phase shifts can come from input and clock buffers and propagation delays inside of the FPGA<sup>63</sup>. The main contribution to the synchronization error can be attributed to the small difference of the measured pulse propagation delay per unit length  $\frac{d\tau_p}{dL}$  between the primary board  $\tau_p$  and the secondary board  $\tau_s$ . To compensate for this we have chosen to use the leading edge of the pulse on the primary board and the trailing edge on the secondary board. But with this choice the pulse width needs to be compensated (using  $\phi_0$ ), which we do at the moment only



FIG. 8. a) Example phases for  $\Delta\tau_c \frac{360^\circ}{T} = 90^\circ$  corresponding to the data in Fig. 6c: primary pulse phase ( $\phi_p$ , green circles), secondary pulse phase ( $\phi_s$ , blue squares) and external PLL phase ( $\phi_{ext}$ , orange diamonds). The dashed (blue) line is a linear fit modulo  $360^\circ$  through  $\phi_s$  and gives  $\frac{d\tau_p}{dl} = 4.9(3)$  ns/m. For cable lengths 20 m and 31.3 m, the phase  $\phi_p$  is within  $\pm 20^\circ$  of  $\phi_p^{crit} = 180^\circ$  (green shaded area) and the security phase is set nonzero  $\xi(\phi_p) = \mp 30^\circ$ . This causes that  $\phi_{ext}$  is shifted away from the ideal value but ensures that the synchronization error, although slightly increased, does not jump arbitrarily by  $\pm T$ . b) Correlation between the error of the external clock phase ( $\phi_{ext} + \Delta\tau_c \frac{360^\circ}{T}$ ) and measured synchronization error plotted for all  $\tau_c$  (different colors). The dashed (orange) line is a linear fit which gives a slope of  $80(1)$  ps/degree and an offset of  $263(4)^\circ$ .

under the assumption that it does not change for varying cable length. This assumption is not true due to the dispersion of the pulse. Nevertheless, even with the present scheme, the resulting synchronization error in Fig. 6c is already very low.

### 3. Fitting function for the synchronization error

Here we present the fitting function used to fit the oscilloscope traces shown in the inset of Fig. 6. For each trace the auto-synchronization was performed as described in Sec. IV C. After this, in order to measure the resulting synchronization error, another pulse is generated by the primary board and it waits the calculated waiting time  $N_w^{prim}$  and generates a signal on an auxiliary I/O pin which is recorded by an oscilloscope (blue traces in inset of Fig. 6). Each trace consists of 14 data points with a resolution of 2 ns). After the secondary board detects the pulse, it immediately generates a signal on an auxiliary I/O pin which is used to trigger the oscilloscope and is recorded (orange traces) together with that of



FIG. 9. Trigger coaxial and detector signals for 10 m cable length measured on the primary board. a) Signals in the coaxial cable at the primary (violet) and secondary (cyan) boards for  $360^\circ$  phase. The detector signal (green, active low) is generated by the primary FPGA when a pulse has been detected. b) Detector signals (green, offset by phase) for different phases. The pulse generation time (orange) is delayed linearly with phase and the detector signal shows jumps in the leading (blue) and trailing edge (red) of the reflected pulse.

the primary board. The saved traces are fitted with a sigmoid function which is constructed from a piecewise defined linear slope  $s(t, t_0, k, y_-, y_+)$  and is smoothed with a Gaussian kernel  $g(t, \sigma)$ :

$$g(t, \sigma) = \frac{1}{norm} e^{-\frac{t^2}{2\sigma^2}}$$

$$\mu = \frac{y_+ + y_-}{2}, \quad v = \frac{y_+ - y_-}{2k}$$

$$s(t, t_0, k, y_-, y_+) = \begin{cases} y_- & t - t_0 \leq -v \\ \mu + (t - t_0)k & |t - t_0| < v \\ y_+ & t - t_0 \geq v \end{cases} \quad (A9)$$

$$f(t, t_0, k, \sigma, y_-, y_+) = s(t, t_0, k, y_-, y_+) \star g(t, \sigma).$$

The symbol  $\star$  means the discrete convolution with fixed steps in time and the Gaussian is normalized (norm) such that the sum over the discrete kernel entries is one. The function  $f(t, t_0, k, \sigma, y_-, y_+)$  smoothly changes from the value  $y_-$  for  $t < t_0$  to the value  $y_+$  for  $t > t_0$ . The slope  $k$  and the width  $\sigma$  of the Gaussian define how fast is the change between the extremes around the time  $t_0$ .

Each trace is fitted individually with  $f(t, t_0, k, \sigma, y_-, y_+)$  with free parameters  $t_0$ ,  $k$ ,  $y_-$  and  $y_+$  and  $\sigma = 2$  ns is kept fixed<sup>64</sup>. The resulting synchronization error is the difference of the fitted  $t_0^{sec}$  of the secondary board minus that one of the primary board  $t_0^{prim}$ .

#### 4. Measured detection signal

Here we show examples of trigger signals and the detection signal for varying detector phase used for the auto-synchronization described in Sec. II B. The schematics of the pulse generation and detection electronics can be found in Ref. 22. The present electronics was however designed for a first test and has not been optimized for efficiency and noise resilience. Additionally, it was designed for a test with two boards, where the  $50\ \Omega$  termination is part of the generation and detection circuitry and a bipolar transistor, responsible for the reflection of the pulse, is inducing a high impedance in the coaxial cable instead of a short circuit as proposed.

Fig. 9a shows the un-amplified signals in the trigger coaxial cable for primary (violet) and secondary (cyan) boards for  $360^\circ$  phase and 10 m cable length. The detector signal (green, active low) is generated by the primary board on an auxiliary I/O pin of the FPGA-SoC and indicates when the pulse has been detected after amplification and rectification by the FPGA-SoC. The first peak at 20 ns is caused by noise on the supply when the pulse is created, the second at 80 ns is the detection of the generated pulse, and the last peak at 200 ns is the detection of the reflected pulse, which we are interested in. The delay of 3 cycles of these signals is caused by the required detector input synchronization stage (which is the same as a CDC) consisting of 2 flip-flops in series and one additional cycle to set or reset the output flip-flop. The small ripples on the signal is caused by the un-shielded and un-terminated clock signal cable used during this measurement. Fig. 9b shows the detector signal (green) at 10 m cable length for different phases between the bus clock and the pulse. The time of the generation of the pulse is indicated by the orange line. The leading and trailing edges of the reflected pulse are indicated by the blue and red lines respectively. The jumps in these times are clearly visible and allow to measure the precise round-trip time with sub-cycle time resolution. See Fig. 3b for comparison.

#### Appendix B: Data rate fitting function

Here we give the function used for modeling the measurements of the DMA transmission rates presented in Fig. 4, Sec. IV A, and the data uploading rates presented in Fig. 5, Sec. IV B. The fit results can be found in Tab. II.

The model function gives the resulting rate  $\Gamma(N)$  as a function of number of samples  $N$  and includes a delay time (latency)  $\tau$  and two data transmission rates where  $\Gamma_0$  is active for  $N \leq N_\Theta$  and  $\Gamma_1$  active for  $N > N_\Theta$ :

$$\Gamma(N) = \frac{1}{\frac{\tau}{N\beta} + \frac{\Theta(N_\Theta - N)}{\Gamma_0} + \frac{\Theta(N - N_\Theta)}{\Gamma_1}} \quad (\text{B1})$$

$$\Theta(x) = \begin{cases} 0 & \text{for } x \leq 0 \\ 1 & \text{otherwise} \end{cases}$$

The value  $\beta = 12$  bytes per sample for this measurement. The delay takes into account that data cannot be transmitted immediately after the start signal has been given. The two rates

|               | units         | TX               | DMA RX           | RX-FIFO             | upload & write -10 | -07S               |
|---------------|---------------|------------------|------------------|---------------------|--------------------|--------------------|
| max. $\Gamma$ | MB/s          | 400 <sup>a</sup> | 300 <sup>a</sup> | 300 <sup>a</sup>    | 118.7 <sup>b</sup> | 118.7 <sup>b</sup> |
| $\tau$        | $\mu\text{s}$ | 0 <sup>c</sup>   | 4.0(2)           | 0 <sup>c</sup>      | 450(40)            | 350(50)            |
| $N_\Theta$    | 1             | 4 <sup>c</sup>   | —                | $20(1) \times 10^3$ | —                  | —                  |
| $\Gamma_0$    | MB/s          | 600 <sup>c</sup> | 341(2)           | 600 <sup>c</sup>    | 56.5(3)            | 47.2(4)            |
| $\Gamma_1$    | MB/s          | 342.73(3)        | —                | 340.50(5)           | —                  | —                  |

TABLE II. Fit results of the DMA and uploading data rates shown in figures 4 and 5 obtained with the model Eq. (B1). The DMA rates are given for the Cora-Z7-10 board but the rates of the Cora-Z7-07S board is within the error the same. The uploading rates include the writing to reserved DMA memory and is given for the Cora-Z7-10 and Cora-Z7-07S board (-10 and -07S in table headings respectively). The top row (max.  $\Gamma$ ) gives the expected or theoretical maximum data rates in MB/s.

<sup>a</sup> See Ref. 25 for the expected rates with the default DMA settings.

<sup>b</sup> See Ref. 44 for the maximum uploading rates. The measured rates include writing to reserved DMA memory.

<sup>c</sup> Fixed.

are used to model that data transmission can run at different speeds, for example when FIFO buffers are involved.

For the measurement of the TX DMA rate an eventual delay cannot be detected and it was set to  $\tau = 0$ . The initial rate was set to the maximum possible  $\Gamma_0 = \Gamma_{\max}$  and the second rate  $\Gamma_1$  is left as a fitting parameter. The threshold number of samples is set to fixed  $N_\Theta = 4$  since this is the smallest number of samples which can be transmitted. This is because we have chosen to use a 16 byte wide (128 bits) data stream and  $\beta = 12$  bytes, which have 48 bytes as the least common multiple, i.e. 4 samples. Unused samples are marked by the driver with a “no-operation” (NOP) bit, such that non-multiple number of samples of 4 are no problem. For the measurement of the RX DMA rate and the data uploading rate, the fitting parameters are the delay  $\tau$  and the rate  $\Gamma_0$ . No second rate is needed. For the measurement of the RX FIFO rate  $\Gamma_1$  and  $N_\Theta$  are fitting parameters, the delay and initial rate is again set to 0 and  $\Gamma_{\max}$  respectively.

Note that the DMA rate measurements give not only the maximum possible bus output rate, but are as well an excellent tool to verify the efficiency of the driver. Any delays in time-critical parts, like the interrupt service routine or where the DMA buffers are updated, severely impact the DMA transmission rate. For example, output of text messages for debugging purposes cannot be done since the serial transmission of the text via USB to a host computer is too slow and would block the driver.

#### Appendix C: Start- and Stop trigger

In Fig. 10 we present a measurement of the start trigger and the cycling mode<sup>65</sup> of the board. In addition, we implemented for demonstration the possibility to interrupt the execution of the sequence when the start trigger signal is reset after the board has been started. This might be useful to manually check the state of the experiment, or to wait for some external event, like waiting until the atom number reaches a



FIG. 10. Demonstration of a start and stop trigger option in cycling mode. a) Analog output triangular ramp (orange) running with  $4\ \mu\text{s}$  per sample in cycling mode. The FPGA board is freely running without waiting for the trigger signal (blue). b) Same ramp but with start and stop trigger enabled. Both panels show the accumulated signal for 5 repetitions (number in brackets of labels). The vertical dotted lines indicate the beginning of each experimental cycle.

certain value. The experimental sequence consists of an analog output performing a triangular ramp (orange) which is executed repeatedly in cycling mode. The dotted lines indicate the beginning of each cycle. A waveform generator provides the trigger signal (blue). See Fig. 10a for the unperturbed experiment: without the start-stop trigger activated, there is no relation between the trigger and the ramp, which we show for 5 realisations of the experiment. In Fig. 10b we show the result when the start-stop trigger is activated which is starting the execution of the ramp and then interrupting it as long as the trigger signal is low. We have again repeated this measurement 5 times and now all repetitions overlap.

#### Appendix D: Resource utilization

Tab. III gives a summary of the used resources of the PL part and shows that we do not use all of the available resources although the FPGA is relatively small. This allows to implement further improvements or customization in case it is needed.

<sup>1</sup>National Instruments Digital Reconfigurable I/O Device, <https://www.ni.com/en-us/shop/hardware/products/digital-reconfigurable-io-device.html>.

<sup>2</sup>ARTIQ, open-source experimental control system, <https://m-labs.hk/experiment-control/artiq/>.

<sup>3</sup>A. Keshet and W. Ketterle, Rev. Sci. Instrum. **84**, 015105 (2013).

<sup>4</sup>G. Ramola, *A versatile digital frequency synthesizer for state-dependent transport of trapped neutral atoms*, Master thesis, Rheinischen Friedrich-Wilhelms-Universität Bonn (2015).

<sup>5</sup>T. Pruttivarasin and H. Katori, Rev. Sci. Instrum. **86**, 115106 (2015).

<sup>6</sup>Y. Du, W. Li, Y. Ge, H. Lu, K. Deng, and Z. Lu, Rev. Sci. Instrum. **88**, 096103 (2017).

<sup>7</sup>S. Donnellan, I. R. Hill, W. Bowden, and R. Hobson, Rev. Sci. Instrum. **90**, 043101 (2019).

| device |           | FF    | LUT   | BRAM | MMCM | PLL | DSP |
|--------|-----------|-------|-------|------|------|-----|-----|
| Z7-10  | available | 35200 | 17600 | 60   | 2    | 2   | 80  |
|        | used      | 13275 | 9824  | 38   | 2    | 0   | 0   |
|        | percent   | 38    | 56    | 63   | 100  | 0   | 0   |
| Z7-07S | available | 28800 | 14400 | 50   | 2    | 2   | 66  |
|        | used      | 13274 | 9825  | 38   | 2    | 0   | 0   |
|        | percent   | 46    | 68    | 76   | 100  | 0   | 0   |

TABLE III. Used resources for Cora Z7-10 (Zynq-7010) and Cora Z7-07S (Zynq-7007S) FPGA-SoC boards: Flip-flops (FF) store single-bit data, lookup tables (LUT) are used to represent logic operations, block-RAM (BRAM) is a much larger (36 kbit) collection of flip-flops, and mixed-mode manager (MMCM), are phase-locked loops (PLL) but allow dynamic phase shifting. We use neither classical PLLs nor digital signal processing (DSP) cells.

- <sup>8</sup>S. W. Mattingly and F. Skiff, Rev. Sci. Instrum. **89**, 043508 (2018).
- <sup>9</sup>S. Shu, L. Wang, D. Liu, C. Meiven, Y. Zhang, L. Jiarong, and F. Ji, Rev. Sci. Instrum. **89** (2018), 10.1063/1.5035364.
- <sup>10</sup>E. Pergo, M. Pomponio, A. Detti, L. Duca, C. Sias, and C. E. Calosso, Rev. Sci. Instrum. **89**, 113116 (2018).
- <sup>11</sup>S. J. Yu, E. Fajneau, L. Q. Liu, D. J. Jones, and K. W. Madison, Rev. Sci. Instrum. **89**, 025107 (2018).
- <sup>12</sup>D. Ristè, M. Dukalski, C. A. Watson, G. de Lange, M. J. Tiggelman, Y. M. Blanter, K. W. Lehnert, R. N. Schouten, and L. DiCarlo, Nature **502**, 350 (2013).
- <sup>13</sup>I. Lamb, J. Colless, J. Hornibrook, S. Pauka, S. Waddy, M. Frechtlings, and D. Reilly, Rev. Sci. Instrum. **87**, 014701 (2016).
- <sup>14</sup>H. Homulle, S. Visser, B. Patra, G. Ferrari, E. Prati, F. Sebastianiano, and E. Charbon, Rev. Sci. Instrum. **88**, 045103 (2017).
- <sup>15</sup>X. Qin, W. Zhang, L. Wang, Y. Zhao, Y. Tong, X. Rong, and J. Du, IEEE Trans. Instrum. Meas. **69**, 1127 (2020).
- <sup>16</sup>Y. Xu, G. Huang, J. Balewski, R. Naik, A. Morvan, B. Mitchell, K. Nowrouzi, D. I. Santiago, and I. Siddiqi, arXiv (2021), arxiv.org/abs/2101.00071.
- <sup>17</sup>S. Habinc, *Suitability of reprogrammable FPGAs in space applications* (Gaisler Research, 2002) [http://microelectronics.esa.int/techno/fpga\\_002\\_01-0-4.pdf](http://microelectronics.esa.int/techno/fpga_002_01-0-4.pdf).
- <sup>18</sup>A. Bertoldi, C.-H. Feng, H. Eneriz, M. Carey, D. S. Naik, Z. Junca, X. Zou, D. O. Sabulsky, B. Canuel, P. Bouyer, and M. Prevedelli, Rev. Sci. Instrum. **91**, 033203 (2020).
- <sup>19</sup>CERN, The White Rabbit Project, <https://white-rabbit.web.cern.ch/>.
- <sup>20</sup>Cora-Z7-10 and Cora-Z7-07S development boards from Digilent Inc., <https://reference.digilentinc.com/reference/programmable-logic/cora-z7/start>.
- <sup>21</sup>DE10-Nano Kit from Terasic Inc., <https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=1046>.
- <sup>22</sup>The source code, electronic schemes and Gerber files, the instructions for installation of the software and the compilation of the sources can be found at <https://github.com/INO-quantum/FPGA-SoC-experiment-control>.
- <sup>23</sup>C. E. Cummings, “Clock Domain Crossings (CDC) Design & Verification Techniques Using System Verilog,” (2008), SNUG 2008, Boston. [https://www.sunburst-design.com/papers/CummingsSNUG2008Boston\\_CDC.pdf](https://www.sunburst-design.com/papers/CummingsSNUG2008Boston_CDC.pdf).
- <sup>24</sup>Second release of AMBA AXI and ACE Protocol Specification, Issue E, 22 February 2013. The Advanced eXtensible Interface (AXI) protocol is a part of ARM Advanced Microcontroller Bus (AMBA) structure. <https://developer.arm.com/documentation/ihi0022/latest/>.
- <sup>25</sup>Xilinx AXI DMA v7.1 LogicCORE IP product guide, PG021, June 14 2019, [https://www.xilinx.com/support/documentation/ip\\_documentation/axi\\_dma/v7\\_1/pg021\\_axi\\_dma.pdf](https://www.xilinx.com/support/documentation/ip_documentation/axi_dma/v7_1/pg021_axi_dma.pdf).
- <sup>26</sup>C. E. Cummings, “Simulation and Synthesis Techniques for Asynchronous FIFO design,” (2002), SNUG 2002, San Jose. [https://www.sunburst-design.com/papers/CummingsSNUG2002SJ\\_FIFO1.pdf](https://www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIFO1.pdf).

- <sup>27</sup>C. E. Cummings and P. Alfke, "Simulation and Synthesis techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons," (2002), SNUG 2002, San Jose. [https://www.sunburst-design.com/papers/CummingsSNUG2002SJ\\_FIF02.pdf](https://www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIF02.pdf).
- <sup>28</sup>An optional extended version uses 12 instead of 8 bytes per sample. This allows to have two independent buses driven by a single FPGA-SoC board with a modified buffer card.
- <sup>29</sup>The strobe signal is generated by the FPGA. For  $\Gamma_{sample} = 1$  MHz it is a 500 ns long pulse starting 240 ns after the bus has been updated. The bus clock frequency must be at least twice the bus output rate to generate the strobe signal.
- <sup>30</sup>Cascading two PLL's is not advised, but in our case, we need both for dynamic phase shifting. In addition, this allows to use an external clock input pin in a different clocking region which would be otherwise inaccessible.
- <sup>31</sup>For simplicity, the two cycles delay introduced by the clock-domain crossing (CDC) is not shown in Fig. 3b.
- <sup>32</sup>The actual algorithm to find the phase jump is similar to the Bisection method of finding the root of a function.
- <sup>33</sup>Petalinux 2017.4 from Xilinx which is built on Linux kernel version 4.9 and is compiled on Ubuntu LTS 18.04.
- <sup>34</sup>National Instruments Labview and LabWindows/CVI, Programming Environments for Electronic Test and Instrumentation. <https://www.ni.com/en-us/shop/software/programming-environments-for-electronic-test-and-instrumentation-category.html#.>
- <sup>35</sup>P. Starkey, C. Billington, S. Johnstone, M. Jasperse, K. Helmerson, L. Turner, and R. Anderson, Rev. Sci. Instrum. **84**, 085111 (2013), see also <https://labscriptsuite.org/>.
- <sup>36</sup>DIO64 PCI I/O board from Viewpoint Systems, Inc. Requires Windows XP/7/8 and PCI slot and is no longer available. <https://www.viewpointusa.com/product/pxi/dio-64-event-detection-control>.
- <sup>37</sup>As permanent storage medium the board uses a micro-SD (Secure Digital) card which primarily contains the Linux boot loader and boot image but can contain additional files and folders and can be used as a hard drive. The Linux image is unpacked by the bootloader in a RAM drive, but if needed it can also be expanded into a partition of the SD card. Additionally, a USB flash drive can be attached to the board for external storage.
- <sup>38</sup>A. Trenkwalder, M. Zaccanti, and N. Poli, *Data and analysis for "A flexible control system for atomic, molecular and optical physics experiments"* (Zenodo, 2021) <https://doi.org/10.5281/zenodo.4893285>.
- <sup>39</sup>The interrupts are generated in the PL part and are thus directly accessible during the transmission rate measurement without involving the CPU.
- <sup>40</sup>On the TX DMA side we observe a delay of about 30 cycles between the arrival of the last data out of the FIFO and the TX interrupt.
- <sup>41</sup>Xilinx SDK user guide, system performance analysis, UG1145, v2018.2, [https://www.xilinx.com/support/documentation/sw\\_manuals/xilinx2018\\_1/ug1145-sdk-system-performance.pdf](https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_1/ug1145-sdk-system-performance.pdf).
- <sup>42</sup>Xilinx System Performance Analysis of an All Programmable SoC, XAPP1219 (v1.1) November 5, 2015, [https://www.xilinx.com/support/documentation/application\\_notes/xapp1219-system-performance-modeling.pdf](https://www.xilinx.com/support/documentation/application_notes/xapp1219-system-performance-modeling.pdf).
- <sup>43</sup>The measured  $\Gamma_{DMA}$  corresponds to a maximum  $\Gamma_{sample}$  of 42 MHz (28 MHz) for the 8 (12) bytes per sample versions. The given rates apply independently for data output and input on the bus and for simultaneous output and input (if the bus supports).
- <sup>44</sup>IEEE 802.3ab, (1999), Gigabit Ethernet, 1000BASE-T with TCP/IP over Ethernet (II) protocol efficiency of 95% for 1460 bytes payload per frame of 1538 bytes.
- <sup>45</sup>M. Endres, H. Bernien, A. Keesling, H. Levine, E. R. Anschuetz, A. Krajenbrink, C. Senko, V. Vuletic, M. Greiner, and M. D. Lukin, Science **354**, 1024 (2016), <https://science.sciencemag.org/content/354/6315/1024.full.pdf>.
- <sup>46</sup>C. Sahin, P. Geppert, A. Müllers, and H. Ott, New J. Phys. **19**, 123005 (2017).
- <sup>47</sup>The change in the rate between using a single or two threads on both boards is only about 10%.
- <sup>48</sup>For cable lengths < 3 m the actual setup cannot detect the round-trip time since the reflected pulse is too close to the generated one. However, this situation is automatically detected and with the proposed scheme and further technical improvements shorter cables should be detectable.
- <sup>49</sup>Tasker RG58 CU coaxial cable specification gives velocity factor 0.66, corresponding to a propagation delay of 5.05(4) ns/m. See [https://www.tasker.it/db\\_files/products/276044f7e2.pdf](https://www.tasker.it/db_files/products/276044f7e2.pdf).
- <sup>50</sup>For the measurement on the secondary board the pulse is not reflected to avoid interference of the incoming with the reflected pulse. However, we have not observed a difference in the measurement result.
- <sup>51</sup>USB test and measurement class (USBTMC). <https://www.usb.org/document-library/test-measurement-class-specification>.
- <sup>52</sup>General purpose interface bus (GPIB), IEEE 488.2. <https://standards.ieee.org/standard/488-2-1992.html>.
- <sup>53</sup>L. Liu, D.-S. Lü, W.-B. Chen, T. Li, Q.-Z. Qu, B. Wang, L. Li, W. Ren, Z.-R. Dong, J.-B. Zhao, W.-B. Xia, X. Zhao, J.-W. Ji, M.-F. Ye, Y.-G. Sun, Y.-Y. Yao, D. Song, Z.-G. Liang, S.-J. Hu, D.-H. Yu, X. Hou, W. Shi, H.-G. Zang, J.-F. Xiang, X.-K. Peng, and Y.-Z. Wang, Nat. Commun. **9**, 2760 (2018).
- <sup>54</sup>D. C. Aveline, J. R. Williams, E. R. Elliott, C. Dutenhoffer, J. R. Kellogg, J. M. Kohel, N. E. Lay, K. Oudrhiri, R. F. Shotwell, N. Yu, and R. J. Thompson, Nature **582**, 193 (2020).
- <sup>55</sup>M. D. Lachmann, H. Ahlers, D. Becker, A. N. Dinkelaker, J. Grosse, O. Hellmig, H. Müntinga, V. Schkolnik, S. T. Seidel, T. Wendrich, A. Wenzlowski, B. Carrick, N. Gaaloul, D. Lüdtke, C. Braxmaier, W. Ertmer, M. Krutzik, C. Lämmerzahl, A. Peters, W. P. Schleich, K. Sengstock, A. Wicht, P. Windpassinger, and E. M. Rasel, Nat. Commun. **12**, 1317 (2021).
- <sup>56</sup>B. Hensen, H. Bernien, A. E. Dréau, A. Reiserer, N. Kalb, M. S. Blok, J. Ruitenberg, R. F. L. Vermeulen, R. N. Schouten, C. Abellán, W. Amaya, V. Bruneri, M. W. Mitchell, M. Markham, D. J. Twitchen, D. Elkouss, S. Wehner, T. H. Taminiau, and R. Hanson, Nature **526**, 682 (2015).
- <sup>57</sup>P. W. Graham, J. M. Hogan, M. A. Kasevich, and S. Rajendran, Phys. Rev. Lett. **110**, 171102 (2013).
- <sup>58</sup>B. Canuel, A. Bertoldi, L. Amand, E. Pozzo di Borgo, T. Chantrait, C. Danquigny, M. Dovalé Álvarez, B. Fang, A. Freise, R. Geiger, J. Gillot, S. Henry, J. Hinderer, D. Holleville, J. Junca, G. Lefèvre, M. Merzougui, N. Mielec, T. Monfret, S. Pelisson, M. Prevedelli, S. Reynaud, I. Riou, Y. Rogister, S. Rosat, A. Cormier, E. Landragin, W. Chaibi, S. Gaffet, and P. Bouyer, Sci. Rep. **8**, 14064 (2018).
- <sup>59</sup>M. Takamoto, I. Ushijima, N. Ohmae, T. Yahagi, K. Kokado, H. Shinkai, and H. Katori, Nat. Photonics **14**, 411 (2020).
- <sup>60</sup>If  $\phi_- \approx \phi_+$  the measurement is not reliable due to its sensitivity to noise.
- <sup>61</sup>This choice was motivated to have similar  $\frac{d\tau_p}{dL}$  for the measurements of the primary and secondary board. The average in  $\frac{d\tau_p}{dL}$  for the leading and trailing edge of the pulse is the same for both boards, but the primary board shows a larger discrepancy between the values obtained for the two edges. The difference is caused by the dispersion of the pulse.  $\phi_0$  corrects the phase shift introduced by half of the pulse width  $w_p/2$  (see Fig. 7) but does not correct for the changing width along the path.
- <sup>62</sup>The value of  $\phi_p^{crit}$  (see Tab. I) has been determined experimentally, but there might be a dependence with our choice of parameters. Its exact origin has not been investigated.
- <sup>63</sup>We do not use the feedback option which cancels such phase shifts.
- <sup>64</sup>When fitting  $\sigma$ , the correlation to the slope  $k$  causes that for some traces the fit has problems to converge and attains big errors.
- <sup>65</sup>In cycling mode the board repeats the experimental sequence for a programmed number of times or infinitely until a stop command is sent.