GPGPU hardware project

Jeff B

Newcomer
I've been working on a project and I thought people might find it interesting. It is a GPGPU hardware architecture, inspired philosophically by Larrabee (although the ISA is quite a bit different). I have implemented an assembler, C emulator, and synthesizable behavioral verilog model (including L1 and L2 caches, hardware multi-threading, and vector floating point and integer arithmetic) which currently runs in simulation. Early versions of the pipeline ran on FPGA, but the design has since exceeded the capacity of my low end development board (it currently requires around 85k logic elements on the Cyclone IV family, so a single core should fit on something like the DE2-115 eval board).

Code and wiki documentation are here:

https://github.com/jbush001/VectorProc/

The processor uses a unified arithmetic pipeline. Rather than having separate scalar and vector functional units (each with their own set of instructions), there is a single 16 element wide vector pipeline. Scalar operations simply use the lowest vector lane. One advantage of this design is that instructions can mix vector and scalar operands, with the latter being duplicated to all lanes. As in many vector architectures, vector instructions can be predicated with a mask register, with the result only being written back to selected lanes. This allows SPMD style execution, with mask registers helping to track divergence/reconvergence. The processor uses a load-store style architecture. There are a number of flexible vector transfer modes, including block, strided, and scatter/gather.

Like OpenSPARC, this uses an in-order, single issue pipeline. Generally speaking, I opt for simplicity at the expense of latency, then hide the latency using hardware multithreading. The simple tests I've run so far suggest that it seems to be working. The verilog model is instrumented with a number of performance counters, but the visualizer tool (located in tools/visualizer) allows seeing this in action. The trace below shows a simple alpha blend benchmark running (which blends two 64x64 bitmaps). Each thread is represented by a horizontal stripe:
- Red indicates a thread is waiting on data accesses (L1 data cache load miss or store buffer full).
- Yellow indicates a thread is waiting on a long latency instruction (for example, multiplication, which has 4 cycles of latency).
- Black indicates a thread is waiting on the instruction cache
- Green indicates a thread that is ready to issue.

The thin blue line on the bottom indicates where instructions are issued (with gaps in the line showing where no instruction is available because all threads are blocked).

alpha-state-trace.png


As you can see, there is quite a bit of memory latency (indicated by the long red strips), but the processor still manages to achieve pretty good utilization by keeping at least one hardware thread ready in most cases.

I also hacked together a simple program that renders a 3d object. It's simplistic and doesn't take advantage of hardware multi-threading, but does use a hierarchical parallel rasterizer, as Michael Abrash described in Dr. Dobb's Journal. This image was produced by the core running in verilog simulation:

vsim.png


This project is primarily a learning exercise for me, so I'd love to hear comments and suggestions. The development environment should be fairly straightforward to set up, so patches or pull requests are also welcome if anyone is interested in hacking on it.
 
I had a very quick look at your wiki and Verilog code and other git repos. I think what you did here is absolutely awesome.
 
This is too impressive not to deserve a respectable work offer. There aren't many persons capable of doing or understanding what you did, hence the paucity of posts in this thread. Congratulations!
 
The most impressive part is it looks like you have good documentation. Most engineers don't document well. :smile:

Was this a solo or team project? Is it part of a thesis?
 
Thanks :) This has been a solo hobby project so far. My background is in software, so it has been fun learning more about hardware.

There have been some interesting discussions here and elsewhere about the benefits/drawbacks of more programmable architectures like Larrabee vs. ones that rely more on fixed function units, and architectural tradeoffs between them (number of available registers and threads, for example). Personally, I was really excited about the idea of Larrabee when I first heard about it, but I'm probably a bit biased because I'm a software guy. The thing that I like about this project is that it is possible to explore some of those ideas in a working system, albeit at a much smaller scale.
 
That's quite interesting, I've always wondered what a fusion of CPU & GPU architectures could look like.

I wonder if I could use it as a basis to test a few ideas both hardware & software.
 
I'd certainly be interested to see what you come up with if you do. Let me know if you have any questions.
 
Personally, I was really excited about the idea of Larrabee when I first heard about it, but I'm probably a bit biased because I'm a software guy.
I work at AMD and know a lot of the hardware designers here were excited (and some probably scared) about the potential of a Larrabee type architecture. It generated a lot of discussion and I'm sure it was the same at Nvidia. Competition is great at getting ideas flowing and generating passionate discussion.
 
Back
Top