

# Bit-Serial Parallel Processing Systems

KENNETH E. BATCHER

**Abstract**—About a decade ago, a bit-serial parallel processing system STARAN®<sup>1</sup> was developed. It used standard integrated circuits that were available at that time. Now, with the availability of VLSI, a much greater processing capability can be packed in a unit volume. This has led to the recent development of two bit-serial parallel processing systems: an airborne associative processor and a ground based massively parallel processor.

The airborne associative processor has about the same processing capability as a three cabinet STARAN system in a volume less than 0.5 cubic feet. The power required and weight are also reduced dramatically for the airborne environment.

The massively parallel processor has about 100 times the processing capability as the STARAN system in about the same volume. Floating point speeds are better than 100 MOPS (million operations per second). Integer arithmetic speeds depend on operand lengths—for 16 bit integers the speed is better than 3000 MOPS for addition and 450 MOPS for multiplication.

After presenting the basic rationale for bit-serial parallel processors, the organizations of the two recent systems are shown and some of their applications are outlined.

**Index Terms**—Airborne processors, bit-serial processors, custom VLSI chips, image processing, multidimensional access, parallel processors, radar processing.

## I. INTRODUCTION

LIKE other parallel processing systems, the systems we describe in this paper have a number of processing elements (PE's) operating in parallel on separate data streams under the control of a single control unit. The systems have a large number of processing elements (in the thousands) and each element is very simple—operating on its data stream bit by bit. Thus, we call them bit-serial parallel processing systems. Given an array of 1000 16 bit items, a conventional computer would process the array item by item in 1000 steps, whereas a bit-serial parallel processor would process the array broadside in 16 steps with each processing element treating a bit of one item on each step. The array is accessed by bit-slices instead of by items. Since the number of items in a typical array is much larger than the number of bits per item, processing is much faster.

A feature of bit-serial processing is its ability to handle data items of any length. There is no need to extend operands with filler bits to pack them into standard machine words as in a conventional computer. This raises the storage and processing

Manuscript received June 4, 1981; revised December 10, 1981.

The author is with the Digital Technology Department, Goodyear Aerospace Corporation, Akron, OH 44315.

<sup>1</sup> Registered Trademark, Goodyear Aerospace Corporation, Akron, OH 44315.

efficiency. Suppose we have an array of  $M$  operands with  $N$  bits per operand. Any parallel processor suffers some loss in efficiency when  $M$  is less than the number of processing elements. A conventional computer experiences a similar loss in efficiency when  $N$  is less than its word length. Parallel processors with processing elements of a standard word length suffer from both losses in efficiency when  $M$  is less than the number of PE's, and when  $N$  is less than the word length of a PE. Bit-serial parallel processors, like conventional computers, experience only one inefficiency.

Another advantage of bit-serial parallel processors occurs when only a part of the operand needs treatment. For example, only one cycle is required to test the signs of all elements in an array. Associative processing where data items are addressed by content is easily performed by reading out the keys bit by bit and comparing their values to the bits of a comparand. Only the keys involved in the search need be accessed.

While a data array would be accessed broadside by the set of processing elements, the same array would be read and written in the orthogonal direction (item by item) by input-output channels. This is the normal way data arrays are generated, stored, output, and processed in conventional computers. We should accommodate the rest of the world rather than force it to transfer data arrays by bit-slices. Thus, we need a means of accessing data in two directions: item by item for input and output and bit-slice by bit-slice for the set of processing elements.

In STARAN® processors, data arrays are stored in multidimensional access memories [1], [2] so they can be accessed in either direction equally well. The memories use the same kind of random access memory integrated circuits as conventional solid-state stores. The locations of the data bits are scrambled a certain way so data arrays can be accessed many different ways including the orthogonal directions. The inclusion of a few EXCLUSIVE-OR gates in the memory address bus generates the necessary addresses for the memory elements. A network called the flip network [3] is included in the memory data bus to scramble data being written into storage, and to unscramble data being read from storage. This network is also used to route data between processing elements and is akin to a number of multistage interconnection networks [4].

The first STARAN was demonstrated in 1972. It used standard off-the-shelf integrated circuits that were available at that time. Now with the availability of VLSI, much greater packing densities can be obtained. This has led to the recent development of two bit-serial parallel processors: the airborne associative processor and the massively parallel processor.

## II. AIRBORNE ASSOCIATIVE PROCESSOR

During 1978, studies to determine the feasibility of an airborne version of a STARAN processor were conducted under company research and development programs and under U.S. Navy contracts. The studies culminated in the design of an airborne associative processor using VLSI [5]. The architecture of the processor is shown in Fig. 1. The five major blocks are as follows:

- 1) the array unit containing over 2000 PE's and over 1 Mbyte of data storage,
- 2) the array control unit which supplies the control signals for the array unit,
- 3) the register and arithmetic section which controls the input and output of array unit data and generates array addresses,
- 4) the program execution control unit which executes the application program and drives the array control unit and the register and arithmetic section, and
- 5) the control memory which stores the application program and buffers data between the airborne associative processor and a host computer.

### A. Array Unit

The array unit comprises 17 array modules, 16 for the application and a spare module to replace any other array module found to be in error. Each array module contains 128 PE's and four  $32 \times 4096$  bit arrays of multidimensional access storage. Thus, the array unit contains 2048 PE's (plus spare) and one Mbyte of data storage (plus spare).

Fig. 2 depicts one array module. The module contains four custom design VLSI chips with each chip containing the registers for 32 PE's, a 32 line flip network, and a resolver (Fig. 3). The multidimensional access memories for one array module are packaged in eight hybrid packages with each hybrid containing 16  $1K \times 4$  random access memory (RAM) chips. The 32 PE's of one VLSI chip access the storage of two memory hybrid circuits through the 32 line flip network on the VLSI chip. The arrangement is akin to one STARAN array module except that the flip network is only 32 lines wide instead of 256 lines wide [1], [3]. The multidimensional access memories allow the register and arithmetic section to input or output all bits of a 32 bit operand in one memory cycle and the set of PE's to access one bit-slice from all operands in one memory cycle.

The processing elements are much like the PE's of STARAN. Each PE contains a one bit  $X$ -register and a one bit  $Y$ -register which perform the bulk of the bit-serial arithmetic [6]. The one-bit mask register holds a mask bit which governs writing of PE data into the multidimensional access storage in masked-write operations. Each PE also has a one bit hold register for masked-write operations. This register was not needed in STARAN systems because they performed masked writes by selectively setting the write-enable pins of the memory chips. In the airborne associative processor the memory chips are four bits wide so individual masking is not possible at the memory chips. Masked write operations are performed by reading the memory data into the hold register,



Fig. 1. Block diagram of the airborne associative processor.



Fig. 2. One airborne associative processor array module.



Fig. 3. Custom PE/Flip network chip.

changing those data bits where the mask register is set, and then writing the data back into the multidimensional access memory.

The resolver on each PE/Flip network chip tests the 32 bit data bus for the presence of one or more 1-states. The SUM-OR output is an INCLUSIVE-OR of the 32 data lines. The position of a 1-state is output on a 5 bit wide resolver address bus. The resolver is used after associative searches. After each PE has checked its associated storage for the presence or absence of data satisfying the search criteria the match bits are fed to the resolver which locates the position of one item matching the

criteria. The resolver outputs of the PE chips are collected together in a super resolver to see if any of the 2048 PE's in the system have matching data and to generate the address of one match.

The PE/Flip network chip is a custom VLSI chip using CMOS/SOS technology. Cycle times are longer than the STARAN cycle times which used ECL technology. The set of 2048 PE's in the airborne associative processor has about the same processing power as the set of 1024 PE's in a four-array STARAN system.

### B. Array Control Unit

The array control unit broadcasts control signals to all PE's in the array unit. Several basic array operations are possible using the PE registers, multidimensional access storage, or the common register in the control unit as operands:

- 1) the  $X$  and/or  $Y$  registers can be loaded or logically combined with data from the common register, the multidimensional access storage, or the PE registers;
- 2) the mask register can be loaded from the common register, the multidimensional access storage, or the PE registers;
- 3) the multidimensional access storage can be written (either masked or unmasked) with data from the common register or the PE registers; and
- 4) the common register can be loaded with data from the PE registers or the multidimensional access storage.

### C. Register and Arithmetic Section

The register and arithmetic section generates the addresses for the array unit operations and controls the input and output of array unit data. It contains 24 16 bit registers, two 32 bit registers, an Arithmetic and Logic Unit (ALU), and a number of selection gates (Fig. 4).

### D. Program Execution Control

The program execution control unit executes the application program. Conditional branching is provided to any instruction in the control memory. Instruction fetching is overlapped with execution for faster processing. Besides the program counter and the instruction register, the program control unit contains a stack to store up to 16 return addresses and logic to allow prioritized interrupts.

### E. Control Memory

The control memory has three types of storage: program memory holding the application program, buffer memory to buffer data to and from the host computer, and a read-only memory to store routines for certain essential operations such as the data transfer program and the basic built-in test routines. The host computer is a dual processor and the buffer memory communicates with each host processor over its memory bus. It has two banks, each holding 8192 32 bit words. To the host computer, the buffer memory appears as one of its own memory banks (the whole airborne associative processor occupies a memory bank space in the host computer).



Fig. 4. Register and arithmetic section.

### F. Applications

The airborne associative processor was designed to upgrade early warning radar surveillance and command and control processing aboard the Navy's E-2C aircraft. The E-2C is a carrier-based aircraft that receives target data from its own radar and track data from data links. Targets are processed and correlated to form track data which are output to three operator displays. The airborne associative processor fits inside the existing computer as an easily replaceable unit. The high-speed content addressability of the airborne associative processor is used to correlate radar reports with each other and with existing tracks. The software required to maintain the surveillance database is simplified as well.

With suitable modifications, the airborne associative processor could be adapted to other uses in airborne environments or to other applications where a large amount of specialized processing power must be packed into a small volume.

## III. MASSIVELY PARALLEL PROCESSOR

In 1971, the NASA Goddard Space Flight Center initiated a program to develop high-speed image processing systems. They will be required to process the large amount of image data that will come from satellites that NASA will orbit during the 1980's. These systems use thousands of PE's operating simultaneously to achieve their speed (massive parallelism). A typical satellite image contains millions of picture elements (pixels) that can generally be processed in parallel. In 1979 a contract was awarded to construct a massively parallel processor to be delivered in 1982 [7]. The processor has 16 896 PE's arranged in a 128 row  $\times$  132 column rectangular array. The PE's are in the array unit (Fig. 5). Other major blocks in the massively parallel processor are the array control unit, the staging memory, the program and data management unit, and the interface to a host computer.

### A. Array Unit

Logically, the array unit contains 16 384 PE's arranged in a 128 row  $\times$  128 column square array. Physically, the array unit contains an extra 128 row  $\times$  4 column rectangle of PE's



Fig. 5. Block diagram of the massively parallel processor.

for redundancy. Each PE communicates with its four nearest neighbors, north, south, east, and west. Each PE is a bit-serial processor. With a 10 MHz clock rate and 16 384 PE's operating in parallel, the system has a very high processing rate. Each PE can read two 16 bit integers, generate their sum, and store the 17 bit sum in 49 clock cycles so that 16 384 additions are performed in less than 5  $\mu$ s (more than 3000 MOPS). Floating-point operations are performed at a fast rate even though they are not particularly suited for bit-serial processing. Many different floating-point formats are possible. With a 32 bit format (1 bit sign, 7 bit base-16 exponent, and 24 bit fraction), the floating-point addition is better than 400 MOPS and multiplication is better than 200 MOPS.

1) *Array Topology*: The major application of the massively parallel processor is image processing. Since most of the processing is conducted between neighboring pixels, it is natural to connect the thousands of PE's together in a square array with each PE communicating with its nearest neighbors. We investigated the use of other interconnection networks like the multistage SIMD interconnection networks [4], but with over 16 000 PE's they become unwieldy. The layout of a square array is very simple with no long runs to slow down the transfer rate.

Certain image processing operations like the fast Fourier transform (FFT) require communication between pixels or points located far apart in the image. If we store one point in each PE then the routing time would be severe in a nearest neighbor square array topology. But this is not the best way to do FFT's on the MPP. Each PE can store several points in its RAM so that the number of PE's required to do an FFT can be reduced to a small compact subarray of the array unit. The processing power of the other PE's is not wasted since if we want to do one FFT, we will invariably want to do many FFT's so that we can divide the array unit up into many compact subarrays, each doing one FFT. For example, suppose we want to do many 5120 point FFT's. Ten points can be packed into each PE so each FFT can be performed in a 16 row  $\times$  32 column subarray of the array unit. Thirty-two such subarrays can do 32 FFT's in parallel. The longest communication path in each FFT is half the width of the subarray (16 columns), so the routing time can be reduced to a fraction of the computation time.

One may ask the question of how the data can be input and output effectively especially when they have a peculiar layout like in the FFT example. A 5120 point FFT is most easily performed by combining 1024 five point FFT's with five 1024 point FFT's where the position of any point is a function of its index modulo 5 and 1024. The 5120 points of one FFT have a scrambled layout. The permutations required are akin to the permutations required to change a data array from an item by item format to a bit-slice format. Some kind of buffer memory will be required in the array unit input-output path to convert data arrays to and from the bit-slice format. If it is properly designed the same buffer memory could perform other permutations as well, such as those required by the 5120 point FFT example.

Thus, in the massively parallel processor we use a simple square array topology in the array unit and insert a buffer memory (the staging memory) in its input-output path to perform the permutations required by particular application programs. The staging memory transforms the bit-serial format of the array unit to the item by item format of the outside world. With 16 000 PE's this is a better solution to the problem than the solution used in STARAN where a common multi-dimensional access memory was used for both PE random access storage and input-output transformation. Because of the planar nature of the array unit in the massively parallel processor we will refer to accesses as bit-plane accesses instead of bit-slice accesses.

Given a square array with 128 rows and 128 columns, what do we do around the edges? Some application programs would like to see a planar-topology where, for example, the PE's on the north edge see zero when data items are routed to the south. Other programs would like to see a cylindrical-topology where the PE's on the north edge see data from PE's on the south edge when data items are routed to the south. Also, some programs would rather have the 16 384 PE's connected in one long linear string rather than in a  $128 \times 128$  plane. Thus, the edge connections should be a programmable function.

A topology register is included in the array control unit to allow programming of the edge connections. Between the north and south edges of the array unit, one can either stitch them together to make the array look like a cylinder or separate them to make the array look like a plane (Fig. 6). Similarly, the east and west edges can independently be stitched together or separated (if both pairs of edges are stitched together the array looks like a torus). When the east and west edges are stitched together one can either stitch corresponding rows together or slide the stitching by one row so the ~~west~~ PE of row  $i$  communicates with the ~~east~~ PE of row  $i + 1$ . If one slides the stitching, the rows are connected together in spiral fashion so that the array of PE's looks like a long linear string.

2) *Redundancy*: One advantage of the rectangular nearest neighbor connection network is the easy way it allows faulty PE's to be bypassed. When a faulty PE is discovered one bypasses all the PE's in its column (or row) so the topology is not disturbed. We have found no similar technique for bypassing faults in a multistage SIMD interconnection network. To add redundancy to the array unit, we implement more than 128 columns and insert bypass gates in the east-west routing paths.

east  
west



Fig. 6. MPP array topologies.

The array is reduced to 128 columns logically by bypassing some columns. If a faulty PE is discovered we bypass its column and use one of the spare columns instead. Logically, the array will still have 128 columns. Of course, the physical position of many data items will be shifted when the bypassed columns are shifted, but this presents no problem if we do not try to save the data when a fault is discovered. Since the discovery of a fault usually implies the presence of faulty data in the faulty PE and/or its neighbors, we should not try to save the data anyway. After the array unit is reconfigured, recovery is accomplished by restarting the application program from the last checkpoint.

We could just add one redundant column of PE's and bypass the 129 columns individually. Instead, we divide the array up into 32 four-column groups and add a redundant four-column group so only 33 sets of bypass gates are required instead of 129. When a faulty PE is discovered, we bypass all PE's in its four-column group. All outputs from the group are disabled and the east-west routing paths of its two neighboring groups are stitched together. The redundancy of 3 percent ( $4/128$ ) is a small price to pay for the ability to reconfigure around any single faulty PE. The scheme does not handle the case of multiple PE's failing but the probability of this event within a reasonable service interval is minuscule.

**3) Processing Elements:** Each PE is a bit-serial element. Initially, the PE's had down-shifting binary counters for arithmetic [8], [9]. The PE design was modified to use a full adder and a shift register for arithmetic. The modified design performs the basic arithmetic functions faster. Each PE has six 1 bit registers ( $A$ ,  $B$ ,  $C$ ,  $G$ ,  $P$ , and  $S$ ), a shift register with programmable length, a RAM, a data-bus ( $D$ ), a full-adder, and some combinatorial logic (see Fig. 7). The nominal clock rate of the PE's is 10 MHz. In each clock cycle all PE's perform the same operations on their respective data streams (except where masked). The basic PE operations are microsteps of the



Fig. 7. One MPP processing element.

array instruction set. The control signals come from the PE control unit of the array control unit which reads the microcode from a writable control store. As long as there are no conflicts many PE operations can be combined into one 100 ns clock cycle.

**a) Data-Bus Source Selection:** The source of the data bus can be the state of the  $B$ -,  $C$ -,  $P$ -, or  $S$ -register, the state of a selected bit from the RAM, or the output of the equivalence function between  $P$  and  $G$  ( $P \equiv G$  equals 1 if and only if  $P$  and  $G$  are in the same state). The data bus state ( $D$ ) feeds a number of other parts of the PE.

**b) Logic and Routing:** The  $P$ -register is used for logic and routing operations. A logic operation combines the state of the  $P$ -register with the state of the data bus ( $D$ ) to form the new state of the  $P$ -register. Any of the 16 Boolean functions of  $P$  and  $D$  can be selected. A routing operation reads the state of the  $P$ -register in a neighboring PE (north, south, east, or west) and stores the state in the  $P$ -register. When routing occurs, the  $128 \times 128$  plane of  $P$ -registers is shifted synchronously in any of the four cardinal directions.

A logic or routing operation can be unmasked or masked. An unmasked operation is performed in all 16 384 PE's. A masked operation is performed in only those PE's where  $G = 1$ —the  $P$ -register is not disturbed in those PE's where  $G = 0$ .

**c) Arithmetic:** The full adder, the shift register, and registers  $A$ ,  $B$ , and  $C$  are used for bit-serial arithmetic operations. A full-add operation sums the bits in the  $A$ ,  $P$ , and  $C$  registers to form a 2 bit sum which is placed in the  $B$  and  $C$  registers with  $B$  receiving the least significant bit and  $C$  receiving the most significant bit. A half-add operation is similar except that only the bits in registers  $A$  and  $C$  contribute to the sum.

The shift register has a programmable length. Its length can be set to 2, 6, 10, 14, 18, 22, 26, or 30 bits. A shift operation shifts the register one place to the right with the state of the  $B$ -register entering at the left end of the shift register. If register  $A$  is shifted simultaneously then it reads the rightmost bit in the shift register. An operand of length  $4n$ , where  $n$  is an integer from 1 to 8, can be recirculated around the path formed by register  $B$ , the shift register, register  $A$ , and the full adder; the shift register length is set to  $4n - 2$ .

Register  $A$  has three operations: clear  $A$ , load  $A$  with the data-bus state  $D$ , or load  $A$  with the rightmost bit in the shift

register (shift *A*). Register *C* receives the carry bit in full-add and half-add operations and has two other operations: clear *C* and set *C*.

These microarithmetic operations are combined to perform the array arithmetic instruction set. The addition of two arrays of *n* bit integers is performed with each PE treating one pair of integers. Corresponding bits of the integers are fed to the *P* and *A* registers, respectively, starting with the least significant bits. They are added with full-add operations with the carry bits recirculating through register *C* and the sum bits being formed in register *B* and stored back in the RAM. It requires  $3n + 1$  cycles to read the two *n* bit integers from memory and store the  $(n + 1)$  bit sum back into memory. Subtraction is performed similarly except that the 1's complement of the subtrahend is loaded into the *P* register instead of its true value. Two's complement subtraction is done by initializing the *C*-register to 1 instead of 0. Note that the add and subtract operations we described read two operands from storage, and put the result back in storage so they are equivalent to a sequence of three instructions (load accumulator, add or subtract, and store accumulator) executed 16 384 times.

The result of an arithmetic operation can be sent to the shift register instead of storing it to memory. Multiplication is performed by recirculating the partial product through the shift register and adding the multiplicand to it with appropriate shifts. A multiplier bit in the *G*-register controls the loading of the *P*-register. Multiplication of an array of *n* bit integers by corresponding elements of an array of *m* bit integers to produce an array of  $(m + n)$  bit integers requires  $(m - 1)p + 2(m + n)$  cycles, where *p* is a multiple of 4 equal to *n*, *n* + 1, *n* + 2, or *n* + 3.

Division uses a nonrestoring algorithm where the partial dividend is recirculated through the shift register and the divisor or its complement is added for each quotient bit.

Floating-point multiplication is an addition of the exponents and a rounded multiplication of the fractions. Floating-point addition is a comparison of the values followed by an alignment of the fractions, addition of the fractions, and then a normalization of the result.

*d) Other PE Operations:* Other PE operations include loading the *G*-register from the data bus, writing the data bus to a selected bit of the RAM, loading the *S*-register from the data bus, feeding the SUM-OR tree from the data bus, and clearing the memory parity error indicator.

The SUM-OR tree is a tree of INCLUSIVE-OR gates with inputs from all 16 384 PE's. The output is fed to the array control unit which can test and store the result. The SUM-OR tree is used in maximum value and minimum value searches and in other operations where it is necessary to get a global result from the set of PE's.

The memory parity error indicator senses a parity error in the RAM and latches in the 1-state until cleared. The state of all indicators feeds the SUM-OR tree when the tree is not being used for a SUM-OR operation so the control unit will sense the presence of an error in any PE memory and take appropriate action.

*e) Input-Output:* The *S*-register in all PE's is used for input and output of array data. Columns of input data are shifted into the *S*-registers at the west edge of the array unit and across the array until all 16 384 *S*-registers are loaded.

Then the PE processing is interrupted for one machine cycle while the *S*-register plane is transferred to a selected bit-plane of the RAM's. *S*-register shifting can run at a 10 MHz rate so data can be input at a rate of 160 Mbytes/s (128 bits every 100 ns). Note that PE processing is only interrupted once every 128 columns or less than 1 percent of the time.

Data output is similar. The PE processing is interrupted for one cycle and a bit-plane of data is transferred from the RAM's to the *S*-registers. Processing resumes while the output plane is shifted across the array to the east edge where it is output column by column. Each column is 128 bits long and can be shifted out at a 10 MHz rate so that the output rate is also 160 Mbytes/s. Note that while an output plane is being shifted out, an input plane can be shifted into the array unit so input and output can proceed simultaneously.

At first glance an I/O rate of 320 Mbytes/s (160 in and 160 out) would seem to be more than adequate. But the processing rate is so high that some applications may still become I/O bound. One can see an indication of this from the fact that running I/O at a full rate slows the processing down by only a few percent. When such an application arises (and when fast enough peripherals are available), the array unit I/O scheme can be modified to input and output data at several places in the array instead of just at the east and west edges.

*e) Random Access Memories (RAM's):* Each PE has a RAM storing 1024 bits. The address lines of all PE's are tied together so the memories are accessed by bit-planes with one bit of a bit-plane accessed by each PE. Conventional RAM integrated circuits are used to make it easy to expand storage when advances in solid-state memory technology allow it. Four PE's share one  $1\text{K} \times 4$  RAM chip with an access time of about 50 ns. The address bus can be expanded up to 16 address lines so PE memory can be expanded to 65 536 bits per PE or 128 Mbytes total. The array unit clock system has enough flexibility to accommodate a wide variety of memory speeds so the massively parallel processor can be tailored to other applications which may require more memory at a slower speed.

*4) Packaging:* The PE RAM's use standard RAM integrated-circuits. All other components of eight PE's are put on a custom VLSI chip. The chip holds a 2 row  $\times$  4 column subarray of PE's and 2112 such chips are used in the array unit. The chip pinout is 52 pins and the complexity is about 8000 transistors.

A 16 row  $\times$  12 column subarray of 192 PE's is packaged on one  $22\text{ cm} \times 36\text{ cm}$  printed circuit board. The board contains 24 VLSI chips, 54 memory elements (48 for data plus 6 for parity), and some support circuitry. Eleven boards make up an array slice of 16 rows  $\times$  132 columns. Eight array slices (88 boards) make up the array unit and eight other boards hold the topology switches, the control signal fan-out, and other support circuitry. The 96 boards are packaged in one cabinet and cooled by forced air.

#### B. Array Control Unit

The array control unit has three subunits: the PE control unit to control processing in the array unit PE's, the I/O control unit to manage the flow of input/output data through the array unit, and the main control unit which runs the application program, performs any necessary scalar processing, and controls the other two subunits (Fig. 8). This division of



Fig. 8. Block diagram of the MPP array control unit.

responsibility allows array processing, scalar processing, and I/O to proceed simultaneously. A queue between the main control unit and the PE control unit can hold up to 16 calls to array processing routines.

1) *PE Control Unit*: The PE control unit generates all PE control signals except those associated with I/O and the *S*-registers. The control unit reads 64 bit wide microinstructions from the PE control memory. The PE control memory holds the standard library of array processing routines plus any user-generated routines so it is like the writable control store in other computers. When the PE control unit receives a call from the queue, it reads the calling parameters and jumps to the entry point of the called array processing routine. After executing the routine, the PE control unit then processes the next call from the queue.

The PE control unit contains a 64 bit wide common register to hold the scalar values required by routines that combine scalars with arrays (e.g., add scalar to array), search arrays for values (e.g., find all the elements of an array greater than a scalar), or generate a scalar from an array (e.g., find the minimum value in an array).

There are eight 16 bit index registers in the PE control unit. One index register holds the index of a selected bit in the common register. Since array processing is bit-serial, the common register scalar is also usually treated bit by bit. The selected common-register bit (*W*) can be tested by branch instructions, used to select a *P*-register logic function in all PE's, and loaded by the SUM-OR tree output. Note that using the common register bit (*W*) to select a *P*-register logic function allows one to select any of the 256 logic functions of three variables—in every PE the selected function between register *P*, the data bus state (*D*), and the common register bit (*W*) is stored in register *P*. This is the mechanism used to broadcast common register value to all PE's.

The other seven index registers can hold the addresses of array bit-planes in the PE RAM's. An array is usually processed by stepping through its bit-planes either from the most significant bit to least significant bit or vice versa. Any of the eight index registers can be used to hold the length of an array. Many of the array processing routines are of variable length so they use an index register to hold a loop count and decrement it once for each bit-plane treated.

Other registers in PE control include the topology register

to select the array unit topology, a program counter holding the location of the current microinstruction in the PE control memory, and a subroutine return stack to facilitate using some array processing routines as subroutines to other routines.

The instruction register is 64 bits wide. Most instructions are executed at a nominal 10 MHz rate. Several operations can be merged into one instruction, e.g., several PE operations, modification of several index registers, and conditional branching. Merging allows most of the control unit overhead to be absorbed so the PE's are doing useful work on every machine cycle.

2) *I/O Control Unit*: The I/O control unit shifts the PE *S*-registers, manages the flow of data in and out of the array unit, interrupts PE control to transfer data between the *S*-registers and the PE memory elements, and can also control the staging memory. Once initiated by main control or the program and data management unit the I/O control unit chains through a sequence of I/O commands in main control memory.

3) *Main Control Unit*: Main control reads and executes the application program from the main control memory. It performs all scalar processing itself and sends all array processing calls to the queue for processing by the PE control unit. Input and output operations for the I/O control unit are either sent directly to the I/O control unit or sent to the program and data management unit for coordination with its peripheral transfers.

The main control has 16 general-purpose registers, some registers to enter calling parameters into the PE control unit queue, and other registers to receive scalars generated by certain array processing routines.

### C. Staging Memory

The staging memory is in the I/O path of the array unit. Besides acting as a buffer between the array unit and the outside world, the staging memory reformats data so that both the array unit and the outside world can transfer data in the optimum format. The array unit sees data in a bit-plane format (one bit from 16 384 different items) while the outside world sees data in an item format (all bits of one item). The staging memory can also rearrange data to match the scrambled layouts of some application programs. The 5120 point FFT example (Section III-A.1) is one such program.

The staging memory comprises a main stager memory, an input substager, and an output substager (Fig. 9). The main stager memory can have 4, 8, 16, or 32 banks of storage with 16K, 64K, or 256K words per bank. Each word holds 64 data bits plus 8 error correction bits for single error correction and double error detection. A fully implemented main stager would hold 67 Mbytes of data. Each bank contains 72 dynamic MOS RAM elements. Initially, 16K-bit elements are used. When 64K bit elements are readily available, the storage in each bank can be quadrupled to 64K words, and when 256K bit elements are feasible, the storage per bank can be quadrupled again. Each bank can accept a 64 bit word and present a 64 bit word every 1.6  $\mu$ s cycle time (the cycle time also includes any memory refresh required), so that each bank has a 10 Mbyte/s I/O rate (5 Mbytes/s input and 5 Mbytes/s output). A 32 bank main stager can accept and present data at the 160 Mbyte/s rate of the array unit I/O ports.



Fig. 9. Block diagram of the staging memory.

The substagers are fast 128 bit  $\times$  1024 bit ECL multidimensional access memories [1]. The input substager accepts data in the format of the source (array unit, program and data management unit, or the host) and rearranges the data to agree with the main stager format. The output substager performs the complementary function of rearranging data from the main stager format to the format of the destination.

Many different main stager formats are possible—a main stager word may contain one bit of 64 different elements, two bits from 32 different elements, etc. The main stager format is selected based on the data formats in the source and the destination. A software module in the program and data management unit can be used to select the main stager format and program the internal transfers of the staging memory.

#### D. Program and Data Management Unit

The program and data management unit can control the overall flow of programs and data in and out of the massively parallel processor. It acts as a small-scale host when the normal host is not available. The program and data management units is a DEC PDP-11 minicomputer with a number of terminals, a line printer, disk storage, and a tape unit operating under DEC's RSX-11M real-time multiprogramming system. Custom interfaces provide communication with the array control unit and the staging memory.

The program development software package for the massively parallel processor is executed in the program and data management unit. The package includes an assembler for the PE control unit to facilitate developing array processing routines, an assembler for the main control unit to develop application programs, a linker to form load modules for the array control unit, and a control and debug module to load, execute, and debug programs. Much of the software development package is written in Fortran using a Ratfor preprocessor to ease the transporting of the package to the host computer.

#### E. Host Interface

The massively parallel processor to be delivered to NASA will use a DEC VAX-11/780 for a host computer. The staging memory is connected to a DEC DR-780 high-speed interface of the VAX which can transfer data at a rate of 6 Mbytes/s. The staging memory interface is designed to accommodate other devices such as high-speed disks. To allow control of the massively parallel processor by the host the array control interface can be switched from the program and data manage-

ment unit to the host computer. Transfer of the software is simplified by writing much of it in Fortran.

#### F. Applications

The massively parallel processor is designed for high speed processing of satellite imagery. The typical operations may include radiometric and geometric corrections and multispectral classification. Preliminary application studies indicate that the processor may also be useful for other image processing tasks, weather simulation, aerodynamic studies, radar processing, reactor diffusion analysis, and computer image generation.

The modular nature of the processor allows the number of processing elements and the capacities of its memories to be scaled up or down to match the requirements of the application.

## IV. CONCLUSIONS

Bit-serial parallel processors can perform certain tasks much faster than other architectures. The use of VLSI allows a large amount of processing power to be packed in a small volume. The airborne associative processor illustrates the use of bit-serial parallel processors in an airborne environment, while the massively parallel processor is an illustration of a ground-based system.

## REFERENCES

- [1] K. E. Batcher, "The multidimensional access memory in STARAN," *IEEE Trans. Comput.*, vol. C-26, pp. 174-177, Feb. 1977.
- [2] ———, "STARAN series E," in *Proc. 1977 Int. Conf. Parallel Processing*, Aug. 1977, pp. 140-143.
- [3] ———, "The Flip network in STARAN," in *Proc. 1976 Int. Conf. Parallel Processing*, Aug. 1976, pp. 65-71.
- [4] H. J. Siegel and S. D. Smith, "Study of multistage SIMD interconnection networks," in *Proc. 5th Annu. Symp. Comput. Architecture*, Apr. 1978, pp. 223-229.
- [5] T. DiGiacinto, "Airborne associative processor (ASPRO)," in *Proc. AIAA Comput. in Aerosp. III Conf.*, Oct. 1981, pp. 202-205.
- [6] K. E. Batcher, "STARAN parallel processor system hardware," in *Proc. 1974 Nat. Comput. Conf.*, May 1974, pp. 405-410.
- [7] ———, "Design of a massively parallel processor," *IEEE Trans. Comput.*, vol. C-29, pp. 836-840, Sept. 1980.
- [8] L. W. Fung, "A massively parallel processing computer," in *High-Speed Computer and Algorithm Organization*, D. J. Kuck et al., Eds. New York: Academic, 1977, pp. 203-204.
- [9] ———, "MPPC: A massively parallel processing computer," Goddard Space Flight Cen., Greenbelt, MD, GSFC Image Syst. Section Rep., Sept. 1976.



**Kenneth E. Batcher** received the B.S. degree in electrical engineering from Iowa State University, Ames, in 1957, and the M.S. and Ph.D. degrees from the University of Illinois, Urbana, in 1962 and 1964, respectively.

In 1957 he was a trainee at Goodyear Atomic Corporation, Portsmouth, OH, and in 1958 he joined the Goodyear Aerospace Corporation, Akron, OH, where he is now an Engineer, Principal in the Digital Technology Department. His main field of interest is parallel processing. He developed the odd-even and bitonic sorting networks, and was the chief architect on the STARAN and STARAN-E parallel processors. Currently, he is the chief architect on the Massively Parallel Processor.

Dr. Batcher is a member of the Association for Computing Machinery, Phi Eta Sigma, Phi Kappa Phi, and Sigma Xi.