Skip to content

GPU Acceleration

Rahul Rajesh edited this page Apr 15, 2020 · 10 revisions

Source Programs on the GPU

Note: This page is still a work in progress!

This page outlines the current work done to write source programs that can be accelerated using a GPU. The documentation will cover the techniques employed to do GPU acceleration and the language syntax to allow this form of acceleration.

Try it out here: https://source-gpu.herokuapp.com/
Source Code: https://github.com/rrtheonlyone/js-slang/tree/add-gpu
Slides: https://bit.ly/3cjZrdX

Graphical Processing Unit (GPU) to perform arbitrary computation

Essentially, the GPUs of the early 2000s were designed to produce a color for every pixel on the screen using programmable arithmetic units known as pixel shaders. In general, a pixel shader uses its (x,y) position on the screen as well as some additional information to combine various inputs in computing a final color. The additional information could be input colors, texture coordinates, or other attributes that would be passed to the shader when it ran. But because the arithmetic being performed on the input colors and textures was completely controlled by the programmer, researchers observed that these input “colors” could actually be any data.

So if the inputs were actually numerical data signifying something other than color, programmers could then program the pixel shaders to perform arbitrary computations on this data. The results would be handed back to the GPU as the final pixel “color,” although the colors would simply be the result of whatever computations the programmer had instructed the GPU to perform on their inputs. This data could be read back by the researchers, and the GPU would never be the wiser. In essence, the GPU was being tricked into performing nonrendering tasks by making those tasks appear as if they were a standard rendering. This trickery was very clever but also very convoluted.

Ref: "Cuda by Example, An introduction to General-Purpose GPU programming", chapter 1, 1.3.2

The GPU.js library

https://github.com/gpujs/gpu.js

GPU.js is a Javascript acceleration library for GPGPU. It is written in Javascript and works for both web and node. We will be making use of their API to run source programs on the GPU. An example GPU.js program is as follows:

import { GPU } from 'gpu.js';
const gpu = new GPU();
const multiplyMatrix = gpu.createKernel(function(a: number[][], b: number[][]) {
  let sum = 0;
  for (let i = 0; i < 512; i++) {
    sum += a[this.thread.y][i] * b[i][this.thread.x];
  }
  return sum;
}).setOutput([512, 512]);

const c = multiplyMatrix(a, b) as number[][];

How are source programs going to take advantage of this?

We aim for seamless integration with Source. Hence, students will write source programs as per normal and under certain conditions, the GPU will be invoked to run the code. In order to do this, we will define a formal specification of what constitutes a valid source-gpu program. Following which, we run through the implementation and show some results of this change.

GPU-Source language specification

Similar to source §4 specifications, we define our GPU-Source language using Backus-Naur form as follows:

gpu_statement := for ( ( gpu_for_let );
                         gpu_for_condition ;
                         gpu_for_assignment ) gpu_block     

gpu_for_let := let name = 0
gpu_for_condition := name <= number | name < number
gpu_for_assignment := name = name + 1

gpu_block := { gpu_statement } | { core_statements }

core_statements := core_statement... gpu_result_assignment

gpu_result_assignment := gpu_access[name] = gpu_expression;
gpu_access := name | gpu_access[name]

core_statement ::= const name = gpu_expression;
                  | gpu_let ; 
                  | gpu_assignment ;
                  | gpu_expression[gpu_expression] = gpu_expression; 
                  | while (gpu_expression) gpu_block
                  | for ((gpu_assignment | gpu_let); 
                         gpu_expression;
                         gpu_assignment) gpu_block

gpu_assignment ::= name = gpu_expression
gpu_let ::= let name = gpu_expression

gpu_expression ::= number
                  | true | false
                  | null
                  | name
                  | string
                  | gpu_expression binary-operator gpu_expression
                  | unary-operator expression
                  | gpu_function ( gpu_expressions ) 
                  | gpu_expression ? gpu_expression : gpu_expression
                  | gpu_expression[gpu_expression] 
                  | [ gpu_expressions ] 
                  | ( gpu_expression ) 

gpu_expressions ::= epsilon | gpu_expression ( , gpu_expression ) ... 

Restrictions

Even if the BNF syntax is met, GPU acceleration can only take place if all the restrictions below are satisfied. If all criteria are met, the gpu_statement loops are embarassingly parallel.

Special For Loops

In the BNF, we have special loops that take on this form:

for ( ( gpu_for_let );
        gpu_for_condition ;
        gpu_for_assignment ) 

These are the loops that will be taken into consideration for parallelization. However, on top of the BNF syntax, the below requirement must also be statisfied:

  • the names used in each for loop has to be different across the loops (no repeating identifiers among the for loops)
  • the names have to be the same within the loop

GPU Function

A gpu_function has to be a math_* function

Core Statement

Within core statement, there are some constraints:

  • no assignment to any global variables (all assignments only can be done to variables defined in the gpu_block)

GPU Result Statement

The GPU Result statement is the statement that stores a value calculated in core statements into a result array. It access an array at a certain coordinate e.g. array[i1][i2][i3]. This result array has to be defined outside the gpu_block.

The sequence of coordinates which we access in the result array i1, i2, i3... ik must be a prefix of the Special For Loop counters [c1,c2..cn]. If you have n special for loops, the array expression can take on k coordinates where 0 < k <= n. The order matters as well, it has to follow the same order as the special for loops; you cannot have name[c2][c1].

Examples of valid/invalid source programs based on the above rule:

Valid - Using first loop counter. (meaning the loop will be run across N threads; the first loop is parallelized away):

for (let i = 0; i < 10; i = i + 1) {
    for (let k = 0; k < 10; k = k + 1) {
        res[i] = arr[k % 2] + 1;
    }
}

Invalid - Counter used is not a prefix of for loop counters:

for (let i = 0; i < 10; i = i + 1) {
    for (let k = 0; k < 10; k = k + 1) {
        res[k] = arr[i % 2] + 1;
    }
}

Valid - Using first three loop counters:

for (let i = 0; i < 10; i = i + 1) {
    for (let j = 0; j < 10; j = j + 1) {
        for (let k = 0; k < 10; k = k + 1) {
            let x = math_pow(2, 10);
            let y = x * (1000);
            arr[i][j][k] = (x + y * 2);
        }
    }
}

Invalid - Indices are in wrong order (must respect for loop counter orders):

for (let i = 0; i < 10; i = i + 1) {
    for (let j = 0; j < 10; j = j + 1) {
        for (let k = 0; k < 10; k = k + 1) {
            let x = math_pow(2, 10);
            let y = x * (1000);
            res[k][j][i] = (x + y * 2);
        }
    }
}

Invalid - Using an index that is not part of a special for loop (see above):

for (let i = 1; i < 10; i = i + 2) {
    for (let j = 0; j < 10; j = j + 1) {
        for (let k = 0; k < 10; k = k + 1) {
            res[i] = arr1[i] + arr2[j];
        }
    }
}

GPU Acceleration

If all of the above requirements are satisfied, the code will be run in parallel for each of the counter variables specified in GPU Result Statement.

For example, if we consider the gpu-source program below:

for (let r = 0; r < 10; r = r + 1) {
    for (let c = 0; c < 10; c = c + 1) {
        let sum = 0;
        for (let i = 0; i < 10; i = i + 1) {
            sum += mat[r][i] * mat[i][c]  l
        }
        res[r][c] = sum;
    }
}

We see that r and c are used as part of our gpu_result_assignment so the inner body below will be parallelised using the GPU:

let sum = 0;
for (let i = 0; i < 10; i = i + 1) {
    sum += mat[r][i] * mat[i][c]  l
}
res[r][c] = sum;