I've been working on a project and I thought people might find it interesting. It is a GPGPU hardware architecture, inspired philosophically by Larrabee (although the ISA is quite a bit different). I have implemented an assembler, C emulator, and synthesizable behavioral verilog model (including L1 and L2 caches, hardware multi-threading, and vector floating point and integer arithmetic) which currently runs in simulation. Early versions of the pipeline ran on FPGA, but the design has since exceeded the capacity of my low end development board (it currently requires around 85k logic elements on the Cyclone IV family, so a single core should fit on something like the DE2-115 eval board).
Code and wiki documentation are here:
https://github.com/jbush001/VectorProc/
The processor uses a unified arithmetic pipeline. Rather than having separate scalar and vector functional units (each with their own set of instructions), there is a single 16 element wide vector pipeline. Scalar operations simply use the lowest vector lane. One advantage of this design is that instructions can mix vector and scalar operands, with the latter being duplicated to all lanes. As in many vector architectures, vector instructions can be predicated with a mask register, with the result only being written back to selected lanes. This allows SPMD style execution, with mask registers helping to track divergence/reconvergence. The processor uses a load-store style architecture. There are a number of flexible vector transfer modes, including block, strided, and scatter/gather.
Like OpenSPARC, this uses an in-order, single issue pipeline. Generally speaking, I opt for simplicity at the expense of latency, then hide the latency using hardware multithreading. The simple tests I've run so far suggest that it seems to be working. The verilog model is instrumented with a number of performance counters, but the visualizer tool (located in tools/visualizer) allows seeing this in action. The trace below shows a simple alpha blend benchmark running (which blends two 64x64 bitmaps). Each thread is represented by a horizontal stripe:
- Red indicates a thread is waiting on data accesses (L1 data cache load miss or store buffer full).
- Yellow indicates a thread is waiting on a long latency instruction (for example, multiplication, which has 4 cycles of latency).
- Black indicates a thread is waiting on the instruction cache
- Green indicates a thread that is ready to issue.
The thin blue line on the bottom indicates where instructions are issued (with gaps in the line showing where no instruction is available because all threads are blocked).
As you can see, there is quite a bit of memory latency (indicated by the long red strips), but the processor still manages to achieve pretty good utilization by keeping at least one hardware thread ready in most cases.
I also hacked together a simple program that renders a 3d object. It's simplistic and doesn't take advantage of hardware multi-threading, but does use a hierarchical parallel rasterizer, as Michael Abrash described in Dr. Dobb's Journal. This image was produced by the core running in verilog simulation:
This project is primarily a learning exercise for me, so I'd love to hear comments and suggestions. The development environment should be fairly straightforward to set up, so patches or pull requests are also welcome if anyone is interested in hacking on it.
Code and wiki documentation are here:
https://github.com/jbush001/VectorProc/
The processor uses a unified arithmetic pipeline. Rather than having separate scalar and vector functional units (each with their own set of instructions), there is a single 16 element wide vector pipeline. Scalar operations simply use the lowest vector lane. One advantage of this design is that instructions can mix vector and scalar operands, with the latter being duplicated to all lanes. As in many vector architectures, vector instructions can be predicated with a mask register, with the result only being written back to selected lanes. This allows SPMD style execution, with mask registers helping to track divergence/reconvergence. The processor uses a load-store style architecture. There are a number of flexible vector transfer modes, including block, strided, and scatter/gather.
Like OpenSPARC, this uses an in-order, single issue pipeline. Generally speaking, I opt for simplicity at the expense of latency, then hide the latency using hardware multithreading. The simple tests I've run so far suggest that it seems to be working. The verilog model is instrumented with a number of performance counters, but the visualizer tool (located in tools/visualizer) allows seeing this in action. The trace below shows a simple alpha blend benchmark running (which blends two 64x64 bitmaps). Each thread is represented by a horizontal stripe:
- Red indicates a thread is waiting on data accesses (L1 data cache load miss or store buffer full).
- Yellow indicates a thread is waiting on a long latency instruction (for example, multiplication, which has 4 cycles of latency).
- Black indicates a thread is waiting on the instruction cache
- Green indicates a thread that is ready to issue.
The thin blue line on the bottom indicates where instructions are issued (with gaps in the line showing where no instruction is available because all threads are blocked).

As you can see, there is quite a bit of memory latency (indicated by the long red strips), but the processor still manages to achieve pretty good utilization by keeping at least one hardware thread ready in most cases.
I also hacked together a simple program that renders a 3d object. It's simplistic and doesn't take advantage of hardware multi-threading, but does use a hierarchical parallel rasterizer, as Michael Abrash described in Dr. Dobb's Journal. This image was produced by the core running in verilog simulation:

This project is primarily a learning exercise for me, so I'd love to hear comments and suggestions. The development environment should be fairly straightforward to set up, so patches or pull requests are also welcome if anyone is interested in hacking on it.