GPGPU running on FPGA

Jeff B

Newcomer
I posted about this project last year, but I've made some progress since. First, a quick demo:


This is an experimental GPGPU I've been working on running on FPGA. It utilizes a Larrabee-esque programming model that is software focused. Full source (RTL and tools) are available here:

https://github.com/jbush001/GPGPU

Documentation is here:

https://github.com/jbush001/GPGPU/wiki

This demo is texture mapping a wrapping 16 x 16 image (using a nearest neighbor algorithm) onto the screen. Each texture coordinate is computed using a 2x2 floating point matrix multiply. The program is written in C++, using the Clang compiler with an LLVM backend I implemented for this instruction set.

This demo probably isn't the cleanest or most optimal implementation, but exercises many of the major hardware features of the design. Source for the demo is here. Here's a quick walkthrough of how it is implemented:

This processor supports multiple hardware threads to hide latency of floating point operations and memory accesses. The number of hardware threads in this architecture is a synthesis parameter, but I'm using four right now. Tests I've run with some simple workloads suggest that more than four doesn't improve performance much, but this is an area I will probably explore more.

This processor has a wide vector arithmetic pipeline with 16 lanes, which supports floating point and integer operations. There are 32 general purpose vector registers. The compiler uses GCC style extensions (http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) for vector types, which can be declared like this:

Code:
typedef int veci16 __attribute__((ext_vector_type(16)));
typedef float vecf16 __attribute__((ext_vector_type(16)));

The compiler treats vectors as first class types, which can be passed as parameters to functions, used as local variables, members of structs, etc. They can also be used directly for simple arithmetic operations. For more advanced features such as scatter/gather loads, I've added a number of compiler builtins (which use C functional syntax, but compile directly to instructions).

In this program, each lane of the vector represents a single pixel instance. A hardware thread in this architecture is more akin to a warp in a traditional GPU architecture. So, in this program there are 4 x 16 = 64 pixels being actively processed at a time.

The threads processed batches of 16 pixels, interleaved on multiples of 16 along each scanline (I use the term 'strand' to refer to each hardware thread):

Code:
	int myStrandId = __builtin_vp_get_current_strand();

...
		veci16 *outputPtr = kFrameBufferAddress + myStrandId;
		for (int y = 0; y < kScreenHeight; y++)
		{
			for (int x = myStrandId * 16; x < kScreenWidth; x += 64)
			{

For each batch of pixels, it performs a matrix multiply to determine the texture coordinates. There is a simple class that contains the matrix:

Code:
	Matrix2x2 displayMatrix;

In the next code snippet, xv and yv are vectors that contain the x and y screen coordinates of each pixel. The matrix elements(a, b, c, d) are scalar values. The Clang compiler doesn't allow mixing scalar and vector register operands. I've added a builtin that allows expanding a scalar to a vector (__builtin_vp_makevector).

Code:
	vecf16 u = xv * __builtin_vp_makevectorf(displayMatrix.a)
		 + yv * __builtin_vp_makevectorf(displayMatrix.b);
	vecf16 v = xv * __builtin_vp_makevectorf(displayMatrix.c) 
		+ yv * __builtin_vp_makevectorf(displayMatrix.d);

This processor uses a unified pipeline. Scalar operations just use the lowest lane of the vector ALU. This design also allows mixing scalar and vector operands. The compiler automatically determines where this can occur and can emit instructions like:

Code:
   mul.f v0, v1, s2

This code eventually ends up computing a vector 'pixelPtrs' of 16 different pointers into the source bitmap. It can then do a gather load of all of the pixel pointer, with the results for each pointer going into another vector, which is stored as a single contiguous block in the into the framebuffer:

Code:
	*outputPtr = __builtin_vp_gather_loadi(pixelPtrs);

This compiles to two instructions.

Code:
	load.v v0, (v0)
	store.v v0, (s12)

The load takes 16 cycles, but the store completes in a single cycle because the elements are contiguous in memory and there is a wide 512 bit path between the store buffer and the L2 cache. Since the L2 cache is write back, I need to flush the finished pixels out to memory. I then increment the pointer by 4 vector widths:

Code:
	dflush(outputPtr);
	outputPtr += 4;

And rotate the matrix one step:

Code:
	displayMatrix = displayMatrix * stepMatrix;

This demo is using a single core, running at 25 Mhz (I have run multiple cache-coherent cores in Verilog simulation, but they don't fit on my FPGA). Memory runs at 50Mhz to be able to feed the display controller, which DMAs the image out of the shared SDR SDRAM. This demo achieves about 18 frames per second and takes an average of around 4.3 cycles total to compute and write each pixel (including waiting for memory, etc).

This design requires about 90k logic elements on a Cyclone IV FPGA.

The next thing I'm looking into is a 3d engine. I've implemented a simple 3d renderer in assembly, but I'd like to come up with something more sophisticated. I'm thinking that a tile based approach would probably be appropriate, with a tile size that fits in the cache and one hardware thread per tile.

Comments/feedback are welcome.
 
I posted about this project last year, but I've made some progress since. First, a quick demo:


This is an experimental GPGPU I've been working on running on FPGA. It utilizes a Larrabee-esque programming model that is software focused. Full source (RTL and tools) are available here:

https://github.com/jbush001/GPGPU

Documentation is here:

https://github.com/jbush001/GPGPU/wiki

This demo is texture mapping a wrapping 16 x 16 image (using a nearest neighbor algorithm) onto the screen. Each texture coordinate is computed using a 2x2 floating point matrix multiply. The program is written in C++, using the Clang compiler with an LLVM backend I implemented for this instruction set.

This demo probably isn't the cleanest or most optimal implementation, but exercises many of the major hardware features of the design. Source for the demo is here. Here's a quick walkthrough of how it is implemented:

This processor supports multiple hardware threads to hide latency of floating point operations and memory accesses. The number of hardware threads in this architecture is a synthesis parameter, but I'm using four right now. Tests I've run with some simple workloads suggest that more than four doesn't improve performance much, but this is an area I will probably explore more.

This processor has a wide vector arithmetic pipeline with 16 lanes, which supports floating point and integer operations. There are 32 general purpose vector registers. The compiler uses GCC style extensions (http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) for vector types, which can be declared like this:

Code:
typedef int veci16 __attribute__((ext_vector_type(16)));
typedef float vecf16 __attribute__((ext_vector_type(16)));

The compiler treats vectors as first class types, which can be passed as parameters to functions, used as local variables, members of structs, etc. They can also be used directly for simple arithmetic operations. For more advanced features such as scatter/gather loads, I've added a number of compiler builtins (which use C functional syntax, but compile directly to instructions).

In this program, each lane of the vector represents a single pixel instance. A hardware thread in this architecture is more akin to a warp in a traditional GPU architecture. So, in this program there are 4 x 16 = 64 pixels being actively processed at a time.

The threads processed batches of 16 pixels, interleaved on multiples of 16 along each scanline (I use the term 'strand' to refer to each hardware thread):

Code:
	int myStrandId = __builtin_vp_get_current_strand();

...
		veci16 *outputPtr = kFrameBufferAddress + myStrandId;
		for (int y = 0; y < kScreenHeight; y++)
		{
			for (int x = myStrandId * 16; x < kScreenWidth; x += 64)
			{

For each batch of pixels, it performs a matrix multiply to determine the texture coordinates. There is a simple class that contains the matrix:

Code:
	Matrix2x2 displayMatrix;

In the next code snippet, xv and yv are vectors that contain the x and y screen coordinates of each pixel. The matrix elements(a, b, c, d) are scalar values. The Clang compiler doesn't allow mixing scalar and vector register operands. I've added a builtin that allows expanding a scalar to a vector (__builtin_vp_makevector).

Code:
	vecf16 u = xv * __builtin_vp_makevectorf(displayMatrix.a)
		 + yv * __builtin_vp_makevectorf(displayMatrix.b);
	vecf16 v = xv * __builtin_vp_makevectorf(displayMatrix.c) 
		+ yv * __builtin_vp_makevectorf(displayMatrix.d);

This processor uses a unified pipeline. Scalar operations just use the lowest lane of the vector ALU. This design also allows mixing scalar and vector operands. The compiler automatically determines where this can occur and can emit instructions like:

Code:
   mul.f v0, v1, s2

This code eventually ends up computing a vector 'pixelPtrs' of 16 different pointers into the source bitmap. It can then do a gather load of all of the pixel pointer, with the results for each pointer going into another vector, which is stored as a single contiguous block in the into the framebuffer:

Code:
	*outputPtr = __builtin_vp_gather_loadi(pixelPtrs);

This compiles to two instructions.

Code:
	load.v v0, (v0)
	store.v v0, (s12)

The load takes 16 cycles, but the store completes in a single cycle because the elements are contiguous in memory and there is a wide 512 bit path between the store buffer and the L2 cache. Since the L2 cache is write back, I need to flush the finished pixels out to memory. I then increment the pointer by 4 vector widths:

Code:
	dflush(outputPtr);
	outputPtr += 4;

And rotate the matrix one step:

Code:
	displayMatrix = displayMatrix * stepMatrix;

This demo is using a single core, running at 25 Mhz (I have run multiple cache-coherent cores in Verilog simulation, but they don't fit on my FPGA). Memory runs at 50Mhz to be able to feed the display controller, which DMAs the image out of the shared SDR SDRAM. This demo achieves about 18 frames per second and takes an average of around 4.3 cycles total to compute and write each pixel (including waiting for memory, etc).

This design requires about 90k logic elements on a Cyclone IV FPGA.

The next thing I'm looking into is a 3d engine. I've implemented a simple 3d renderer in assembly, but I'd like to come up with something more sophisticated. I'm thinking that a tile based approach would probably be appropriate, with a tile size that fits in the cache and one hardware thread per tile.

Comments/feedback are welcome.

Cool project what board are you using to test it on?.

Btw Altera has some openCL stuff but i haven't look into it much.
http://www.altera.com/products/software/opencl/opencl-index.html
 
Thanks. I'm using the Terasic's DE2-115 board.

The OpenCL stuff is interesting; I hadn't looked at that either. It looks like it compiles the OpenCL kernel into a FPGA bitstream. It's doesn't have quite the flexibility of a GPU, but would be useful for dedicated applications.
 
would that be suitable for bitcoin mining ?

This project is cool, quite amazing how much only one person put together, but it's only capable of 400M ALU ops per second. That's several orders of magnitude behind what a modern GPU or even CPU can do, let alone a bitcoin mining ASIC.

The cost to area, clock speed, and flexibility doing this on an FPGA is just too much.
 
Yeah, an FPGA tends to suffer from a lot of delay in the routing fabric that connects the logic elements together. For example, the UltraSparc T1 chip runs at 1.4 Ghz, but the same design synthesized on a Xilinx FPGA has a maximum clock speed of around 75 Mhz. I've seen preproduction mobile GPUs that run in the single digits on FPGA (they're obviously much more complex).

Most FPGA accelerated applications are specifically designed to take advantage of FPGA logic resources, and generally utilize very specialized and highly parallel logic (for example, this one)

However, my intent with bringing up on FPGA wasn't really to do useful computation, but to explore the performance and flexibility of this type of architecture (while the design running on FPGA certainly isn't setting any speed records, it is a vast improvement over the same running in Verilog simulation, which takes around a half an hour to simulate a single frame on my laptop. :)
 
so it's a vector cpu, running in-order, but supporting 4x SMT, with 512bit wide vector units and running a software 'rasterizer' in your demo?

I have not much experience with FPGA, have you used some libs to synthesize e.g. the float units? have you used some units designed for throughput, not for low latency?

are your 'threads' scheduled dynamically or are you interleaving the execution 'hard wired'?

you said you use those 'threads' to also hide latency on the math side, is that because of the a) 'waiting' for the result oder b) do your math units have less throughput than 1op/cycle? in the case of a), would it make sense to serialize the lanes to use the same unit and this way automagically hide latency?

really cool project!
 
so it's a vector cpu, running in-order, but supporting 4x SMT, with 512bit wide vector units and running a software 'rasterizer' in your demo?

Yeah, basically, although this demo is more of a texture engine than a rasterizer.

I have not much experience with FPGA, have you used some libs to synthesize e.g. the float units? have you used some units designed for throughput, not for low latency?

I built them from scratch. The floating point pipeline is four stages. I broke it up this way to spread out the logic delay and get a higher clock speed, at the expense of more overall latency (ie the result doesn't appear until four cycles later). So, I guess I'd say it is optimized for throughput.

are your 'threads' scheduled dynamically or are you interleaving the execution 'hard wired'?

It's round-robin. Basically, it tries to cycle through threads 0..1..2..3..0..1..., but skips any threads that are not runnnable because they are blocked waiting for memory or arithmetic.

you said you use those 'threads' to also hide latency on the math side, is that because of the a) 'waiting' for the result oder b) do your math units have less throughput than 1op/cycle?

They do 1op/cycle, but, as mentioned above, there are four stages in the floating point pipeline and thus four cycles of latency. So, if you have two instructions where the second depends on the result from the first, the thread must stall for that many cycles until the result is available. However, during that time, other threads can be scheduled, hiding the latency.

I built a visualizer app that allows graphically displaying thread states over time. Each horizontal strip represents a hardware thread and the horizontal axis is time. The color at each instant is as follows:

- Red indicates a thread is waiting on data accesses (L1 data cache load miss or store buffer full).
- Yellow indicates a thread is waiting on a long latency instruction (floating point, integer multiply)
- Black indicates a thread is waiting on the instruction cache
- Green indicates a thread that is ready to issue.

The thin blue line on the bottom indicates where instructions are issued (with gaps in the line showing where no instruction is available because all threads are blocked).

alpha-state-trace.png


As you can see, threads spent a bunch of time blocked, but overall utilization seems to be pretty good.

in the case of a), would it make sense to serialize the lanes to use the same unit and this way automagically hide latency?

I believe a number of vector processors (like the Cray family) do this and call it 'vector chaining'. In my case, threads seem to be pretty effective at achieving the same goal. I think it's just a different philosophical approach.

really cool project!

Thanks!
 
Requires one to be quite a renaissance man to pull off a project like this. Matemathics, boolean logic design, microprocessor design, computer programming, graphics programming... Only major piece is missing is actually fabricating the actual hardware, which thankfully can be skipped through the wonders of FPGAs... :)

Thoroughly, monstrously impressive work. There's no other words for it.
 
thanks for the detailed reply.

I built them from scratch. The floating point pipeline is four stages. I broke it up this way to spread out the logic delay and get a higher clock speed, at the expense of more overall latency (ie the result doesn't appear until four cycles later). So, I guess I'd say it is optimized for throughput.
4cycles looks actually very short, even those small Arm7 (gameboy) cpus had 3cycles, I'd have expected latencies of 20-80cycles.


It's round-robin. Basically, it tries to cycle through threads 0..1..2..3..0..1..., but skips any threads that are not runnnable because they are blocked waiting for memory or arithmetic.
is that 'for free'? (I mean, can it cycle a whole round to the same thread for the next cycle if the 3 others were stalled?)
have you tried other scheduling strategies? I would have thought that running the same thread for as long as possible would be more efficient, as you have a lot of thread switching to hide latency without any need, that could be useful once one thread reall needs it e.g. reading memory. (but I'm no way an expert, just curious :) ).

I built a visualizer app that allows graphically displaying thread states over time. ...
is there anything you haven't build yourself? looks like you had really all the fun :)

Each horizontal strip represents a hardware thread and the horizontal axis is time. The color at each instant is as follows:

- Red indicates a thread is waiting on data accesses (L1 data cache load miss or store buffer full).
- Yellow indicates a thread is waiting on a long latency instruction (floating point, integer multiply)
- Black indicates a thread is waiting on the instruction cache
- Green indicates a thread that is ready to issue.

The thin blue line on the bottom indicates where instructions are issued (with gaps in the line showing where no instruction is available because all threads are blocked).

alpha-state-trace.png


As you can see, threads spent a bunch of time blocked, but overall utilization seems to be pretty good.
which makes me think, is it true that halfing the latency usually costs about 150% extra/more transistors/logic space? the other way around, if you'd double the latency, you'd reduce the usage to 40%, exchanging that for twice the thread count. the ratio of threads that don't wait for memory should rise, or would that trash the cache/controller ?

I believe a number of vector processors (like the Cray family) do this and call it 'vector chaining'. In my case, threads seem to be pretty effective at achieving the same goal. I think it's just a different philosophical approach.
I admit, it wasn't my idea. I think even the pentium4 does the same with two math units to server the 4 lanes of SSE.
 
4cycles looks actually very short, even those small Arm7 (gameboy) cpus had 3cycles, I'd have expected latencies of 20-80cycles.

Those 4 cycles are probably just for the execution of the FP op, while in the comparison (ARM7TDMI) you're including all the front end stuff for getting the instruction going. I'm sure this design has more pipeline stages for that, but they don't have an impact on how many cycles apart dependent FP operations need to be.

4 cycles is actually pretty normal for FP operation latency if you compare it with various CPUs and even some other GPUs - a lot are 3-5 cycles. I think that includes Larrabee which this GPU design is similar to.

I wouldn't really know for sure but breaking up FPU operations into 20+ cycles sounds arduous, just from a design perspective.. Maybe not if a lot of those stages are just delays..
 
is that 'for free'? (I mean, can it cycle a whole round to the same thread for the next cycle if the 3 others were stalled?)

Yeah. Combinational logic finds the next ready thread each cycle. RTL is here, if you're curious:

https://github.com/jbush001/GPGPU/blob/master/rtl/core/arbiter.v

have you tried other scheduling strategies? I would have thought that running the same thread for as long as possible would be more efficient, as you have a lot of thread switching to hide latency without any need, that could be useful once one thread reall needs it e.g. reading memory. (but I'm no way an expert, just curious :) ).

That's a good question. I've played with that a bit, but didn't see much difference between switch every cycle vs. run until block. Since the threads spend a lot of time waiting, scheduling appears to be driven more by the order threads are unblocked (which is in turn is controlled by the policies for queuing memory requests, etc).

is there anything you haven't build yourself? looks like you had really all the fun :)

Heh... yeah. I'd be happy to take submissions, but unfortunately I haven't found many other people who are interested in hacking on this kind of stuff.

which makes me think, is it true that halfing the latency usually costs about 150% extra/more transistors/logic space? the other way around, if you'd double the latency, you'd reduce the usage to 40%, exchanging that for twice the thread count. the ratio of threads that don't wait for memory should rise, or would that trash the cache/controller ?

I haven't heard that about area/memory latency tradeoff. I've always thought of memory latency as an unavoidable fact of nature that must hidden somehow. :)

The cache effects of increasing the number of threads depend on how the software operates, I think. It is certainly necessary for the software to be very cache aware in this type of architecture.

Those 4 cycles are probably just for the execution of the FP op, while in the comparison (ARM7TDMI) you're including all the front end stuff for getting the instruction going. I'm sure this design has more pipeline stages for that, but they don't have an impact on how many cycles apart dependent FP operations need to be.

Yeah, exactly. In total, integer instructions use a 6 stage pipeline and floating point instructions use a 9 stage pipeline (most of the stages are shared between the two, except the arithmetic units).

It doesn't do IEEE 754 round towards nearest mode (just truncation). The former would require another stage or two. The other parts could probably be broken up into a few more stages to get lower delay in the arithmetic portion, but this doesn't appear to be the critical path in terms of clock speed right now.
 
File this one under the "Holy Thread Resurrection, Batman" category, but 4 years later, this project is still alive and kicking, with 4245 GitHub commits, lots of them just in the past week, 99.9% by the original author. The thing received a new name and a major architecture rework, the documentation is excellent, and if that's not enough, you can read more on his blog.

A mini GPU that works on an FPGA, custom compiler backend, JTAG debugging, code that is clean and very well commented, and whole bunch of graphics demos.

Unbelievable.
 
Back
Top