I posted about this project last year, but I've made some progress since. First, a quick demo:
This is an experimental GPGPU I've been working on running on FPGA. It utilizes a Larrabee-esque programming model that is software focused. Full source (RTL and tools) are available here:
https://github.com/jbush001/GPGPU
Documentation is here:
https://github.com/jbush001/GPGPU/wiki
This demo is texture mapping a wrapping 16 x 16 image (using a nearest neighbor algorithm) onto the screen. Each texture coordinate is computed using a 2x2 floating point matrix multiply. The program is written in C++, using the Clang compiler with an LLVM backend I implemented for this instruction set.
This demo probably isn't the cleanest or most optimal implementation, but exercises many of the major hardware features of the design. Source for the demo is here. Here's a quick walkthrough of how it is implemented:
This processor supports multiple hardware threads to hide latency of floating point operations and memory accesses. The number of hardware threads in this architecture is a synthesis parameter, but I'm using four right now. Tests I've run with some simple workloads suggest that more than four doesn't improve performance much, but this is an area I will probably explore more.
This processor has a wide vector arithmetic pipeline with 16 lanes, which supports floating point and integer operations. There are 32 general purpose vector registers. The compiler uses GCC style extensions (http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) for vector types, which can be declared like this:
The compiler treats vectors as first class types, which can be passed as parameters to functions, used as local variables, members of structs, etc. They can also be used directly for simple arithmetic operations. For more advanced features such as scatter/gather loads, I've added a number of compiler builtins (which use C functional syntax, but compile directly to instructions).
In this program, each lane of the vector represents a single pixel instance. A hardware thread in this architecture is more akin to a warp in a traditional GPU architecture. So, in this program there are 4 x 16 = 64 pixels being actively processed at a time.
The threads processed batches of 16 pixels, interleaved on multiples of 16 along each scanline (I use the term 'strand' to refer to each hardware thread):
For each batch of pixels, it performs a matrix multiply to determine the texture coordinates. There is a simple class that contains the matrix:
In the next code snippet, xv and yv are vectors that contain the x and y screen coordinates of each pixel. The matrix elements(a, b, c, d) are scalar values. The Clang compiler doesn't allow mixing scalar and vector register operands. I've added a builtin that allows expanding a scalar to a vector (__builtin_vp_makevector).
This processor uses a unified pipeline. Scalar operations just use the lowest lane of the vector ALU. This design also allows mixing scalar and vector operands. The compiler automatically determines where this can occur and can emit instructions like:
This code eventually ends up computing a vector 'pixelPtrs' of 16 different pointers into the source bitmap. It can then do a gather load of all of the pixel pointer, with the results for each pointer going into another vector, which is stored as a single contiguous block in the into the framebuffer:
This compiles to two instructions.
The load takes 16 cycles, but the store completes in a single cycle because the elements are contiguous in memory and there is a wide 512 bit path between the store buffer and the L2 cache. Since the L2 cache is write back, I need to flush the finished pixels out to memory. I then increment the pointer by 4 vector widths:
And rotate the matrix one step:
This demo is using a single core, running at 25 Mhz (I have run multiple cache-coherent cores in Verilog simulation, but they don't fit on my FPGA). Memory runs at 50Mhz to be able to feed the display controller, which DMAs the image out of the shared SDR SDRAM. This demo achieves about 18 frames per second and takes an average of around 4.3 cycles total to compute and write each pixel (including waiting for memory, etc).
This design requires about 90k logic elements on a Cyclone IV FPGA.
The next thing I'm looking into is a 3d engine. I've implemented a simple 3d renderer in assembly, but I'd like to come up with something more sophisticated. I'm thinking that a tile based approach would probably be appropriate, with a tile size that fits in the cache and one hardware thread per tile.
Comments/feedback are welcome.
This is an experimental GPGPU I've been working on running on FPGA. It utilizes a Larrabee-esque programming model that is software focused. Full source (RTL and tools) are available here:
https://github.com/jbush001/GPGPU
Documentation is here:
https://github.com/jbush001/GPGPU/wiki
This demo is texture mapping a wrapping 16 x 16 image (using a nearest neighbor algorithm) onto the screen. Each texture coordinate is computed using a 2x2 floating point matrix multiply. The program is written in C++, using the Clang compiler with an LLVM backend I implemented for this instruction set.
This demo probably isn't the cleanest or most optimal implementation, but exercises many of the major hardware features of the design. Source for the demo is here. Here's a quick walkthrough of how it is implemented:
This processor supports multiple hardware threads to hide latency of floating point operations and memory accesses. The number of hardware threads in this architecture is a synthesis parameter, but I'm using four right now. Tests I've run with some simple workloads suggest that more than four doesn't improve performance much, but this is an area I will probably explore more.
This processor has a wide vector arithmetic pipeline with 16 lanes, which supports floating point and integer operations. There are 32 general purpose vector registers. The compiler uses GCC style extensions (http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) for vector types, which can be declared like this:
Code:
typedef int veci16 __attribute__((ext_vector_type(16)));
typedef float vecf16 __attribute__((ext_vector_type(16)));
The compiler treats vectors as first class types, which can be passed as parameters to functions, used as local variables, members of structs, etc. They can also be used directly for simple arithmetic operations. For more advanced features such as scatter/gather loads, I've added a number of compiler builtins (which use C functional syntax, but compile directly to instructions).
In this program, each lane of the vector represents a single pixel instance. A hardware thread in this architecture is more akin to a warp in a traditional GPU architecture. So, in this program there are 4 x 16 = 64 pixels being actively processed at a time.
The threads processed batches of 16 pixels, interleaved on multiples of 16 along each scanline (I use the term 'strand' to refer to each hardware thread):
Code:
int myStrandId = __builtin_vp_get_current_strand();
...
veci16 *outputPtr = kFrameBufferAddress + myStrandId;
for (int y = 0; y < kScreenHeight; y++)
{
for (int x = myStrandId * 16; x < kScreenWidth; x += 64)
{
For each batch of pixels, it performs a matrix multiply to determine the texture coordinates. There is a simple class that contains the matrix:
Code:
Matrix2x2 displayMatrix;
In the next code snippet, xv and yv are vectors that contain the x and y screen coordinates of each pixel. The matrix elements(a, b, c, d) are scalar values. The Clang compiler doesn't allow mixing scalar and vector register operands. I've added a builtin that allows expanding a scalar to a vector (__builtin_vp_makevector).
Code:
vecf16 u = xv * __builtin_vp_makevectorf(displayMatrix.a)
+ yv * __builtin_vp_makevectorf(displayMatrix.b);
vecf16 v = xv * __builtin_vp_makevectorf(displayMatrix.c)
+ yv * __builtin_vp_makevectorf(displayMatrix.d);
This processor uses a unified pipeline. Scalar operations just use the lowest lane of the vector ALU. This design also allows mixing scalar and vector operands. The compiler automatically determines where this can occur and can emit instructions like:
Code:
mul.f v0, v1, s2
This code eventually ends up computing a vector 'pixelPtrs' of 16 different pointers into the source bitmap. It can then do a gather load of all of the pixel pointer, with the results for each pointer going into another vector, which is stored as a single contiguous block in the into the framebuffer:
Code:
*outputPtr = __builtin_vp_gather_loadi(pixelPtrs);
This compiles to two instructions.
Code:
load.v v0, (v0)
store.v v0, (s12)
The load takes 16 cycles, but the store completes in a single cycle because the elements are contiguous in memory and there is a wide 512 bit path between the store buffer and the L2 cache. Since the L2 cache is write back, I need to flush the finished pixels out to memory. I then increment the pointer by 4 vector widths:
Code:
dflush(outputPtr);
outputPtr += 4;
And rotate the matrix one step:
Code:
displayMatrix = displayMatrix * stepMatrix;
This demo is using a single core, running at 25 Mhz (I have run multiple cache-coherent cores in Verilog simulation, but they don't fit on my FPGA). Memory runs at 50Mhz to be able to feed the display controller, which DMAs the image out of the shared SDR SDRAM. This demo achieves about 18 frames per second and takes an average of around 4.3 cycles total to compute and write each pixel (including waiting for memory, etc).
This design requires about 90k logic elements on a Cyclone IV FPGA.
The next thing I'm looking into is a 3d engine. I've implemented a simple 3d renderer in assembly, but I'd like to come up with something more sophisticated. I'm thinking that a tile based approach would probably be appropriate, with a tile size that fits in the cache and one hardware thread per tile.
Comments/feedback are welcome.