Nyami, open source GPU

The creator did a thread here some years ago. Does it have TMUs yet?

No, this is what the paper says:

A texture sampler function can be called by the pixel shader.
This is implemented completely in software and performs bilin-
ear filtering. Like other parts of the renderer, it is vectorized and
computes values for up to 16 texels at a time using vector gather
load instructions.

It doesn't go into detail what the renderer's texture mapping actually does but I doubt it implements anything close to the functions of a modern TMU. That is, complex address generation, mip-mapping, trilinear and anisotropic filtering, and texture compression.

They did however implement a hardware rasterization unit, and found that for their test cases it had a significant (28% average) speedup over the software rasterizer. But the test cases sound very primitive and/or archaic, and I doubt they represent rasterization to shading work ratios that are comparable to those of modern loads.
 
It doesn't go into detail what the renderer's texture mapping actually does but I doubt it implements anything close to the functions of a modern TMU. That is, complex address generation, mip-mapping, trilinear and anisotropic filtering, and texture compression.
FWIW TMUs of modern graphic cards actually don't do all of the address generation anymore. The really complex ones like for instance for cube maps, they've got special ALU instructions to help with that, which then get passed to the TMU. Albeit it's true the rest is all handled by hw (mip-mapping, trilinear isn't too complex to emulate with ordinary shader arithmetic, a decent anisotropic implementations sounds a bit problematic if you want to get it performant, and texture compression is a pain in the butt you'd probably want some hw for that though it wouldn't really need a tmu as such - could just decode into some cache)
 
This GPU is more like a wide SIMD little RISC core. There are 32 register and 32 vector registers, there's no huge static RF. (can't remove this line break, wtf)
Thus, being stack-based, any non-trivial (hello mandelbrot) graphics/compute task will start happily spilling (1 cache line wide) registers on the tiny stack. As such, cache pressure on a 4-thread core working on 16-wide SP vectors (512-bits) is probably a key consideration factor.

The paper shows a 40% (or 71% when looking from above) worse cycle times when going from 256 lines to 64 lines (I'll pick the associativity that better matches my argument). Those are 64-bytes wide lines so we're talking 4kB to 16kB caches. I assume those are the per core L1s, there was a L2/MEMCTRL on a ring joining the CUs, err, cores, at least on previous versions of the project.

IIRC texturing is quite cache friendly, with a hello-goodbye pattern as a triangle gets textured (even more so on fancy bi-linear, or better, filtering). I'd venture TMUs with separate texture caches would have a marked effect on performance on graphics tasks more complicated than a phong torus. Thanks to cleaner L1Ds, to be spilled into, and HW parallelism.

Edit: there was another open source HDL FPGA project that had simple TMUs (don't remember if they did more than bi-linear RGB). It was a video synthesizer for video DJs that was even sold as a product on a standard FPGA demo board in a little enclosure.
 
This GPU is more like a wide SIMD little RISC core. There are 32 register and 32 vector registers, there's no huge static RF. (can't remove this line break, wtf)
Thus, being stack-based, any non-trivial (hello mandelbrot) graphics/compute task will start happily spilling (1 cache line wide) registers on the tiny stack. As such, cache pressure on a 4-thread core working on 16-wide SP vectors (512-bits) is probably a key consideration factor.

Yes, it does need to spill during function calls. I don't think the overhead of that dominates compared to framebuffer, texture, and data structure access, but I haven't measured that explicitly. That might be an interesting experiment. I've posted some rendering profile information here:

http://latchup.blogspot.com/2015/02/improved-3d-engine-profile.html

The paper shows a 40% (or 71% when looking from above) worse cycle times when going from 256 lines to 64 lines (I'll pick the associativity that better matches my argument). Those are 64-bytes wide lines so we're talking 4kB to 16kB caches. I assume those are the per core L1s, there was a L2/MEMCTRL on a ring joining the CUs, err, cores, at least on previous versions of the project.

The default configuration is 16k per core L1D and 128k shared L2 cache. Currently, it's just a switched interconnect between the cores, not a ring. The L1 is write-through.

IIRC texturing is quite cache friendly, with a hello-goodbye pattern as a triangle gets textured (even more so on fancy bi-linear, or better, filtering). I'd venture TMUs with separate texture caches would have a marked effect on performance on graphics tasks more complicated than a phong torus. Thanks to cleaner L1Ds, to be spilled into, and HW parallelism.

Yes, I think so too. It's on my list of things to experiment with, but I haven't gotten around to it yet. ETC2 texture decompression seems surprisingly simple in hardware.

My software texture sampler is here:

https://github.com/jbush001/NyuziProcessor/blob/master/software/libs/librender/Texture.cpp#L95

It supports mip mapping and bilinear filtering.
 
FWIW TMUs of modern graphic cards actually don't do all of the address generation anymore. The really complex ones like for instance for cube maps, they've got special ALU instructions to help with that, which then get passed to the TMU. Albeit it's true the rest is all handled by hw (mip-mapping, trilinear isn't too complex to emulate with ordinary shader arithmetic, a decent anisotropic implementations sounds a bit problematic if you want to get it performant, and texture compression is a pain in the butt you'd probably want some hw for that though it wouldn't really need a tmu as such - could just decode into some cache)

That makes sense. Do you have any examples showing what the ALU to TMU path is like in shaders generated for modern hardware? And yeah, I included some fixed function parts that aren't necessarily in the TMUs themselves. Caches could also handle some of the addressing like performing blocking.

When Intel talked about adding TMUs to Larrabee one of the biggest reasons they cited was that they didn't need anything close to FP32 precision for blending texels in most ordinary loads, and could get much higher throughput without it. Of course, they could have also added some other packed datatype sizes and operations for this, but fixed function makes it easier to handle less regular/constant data widths.

My software texture sampler is here:

https://github.com/jbush001/NyuziProcessor/blob/master/software/libs/librender/Texture.cpp#L95

It supports mip mapping and bilinear filtering.

Thanks, that's informative.

So am I right that the mip-level is selected only from the u-derivative for the top left two pixels in a 4x4 block? Is this something you got from another renderer or in rendering literature?
 
The default configuration is 16k per core L1D and 128k shared L2 cache. Currently, it's just a switched interconnect between the cores, not a ring. The L1 is write-through.

You were working on a ring for v2, weren't you? Did simulations suggest a switched architecture was better?
 
That makes sense. Do you have any examples showing what the ALU to TMU path is like in shaders generated for modern hardware?
Well this is what the open-source radeon driver does, albeit that's only for cube maps, I think everything else is really handled by the TMUs natively (but cube map address calcs are really complex):
http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
Look at radeon_llvm_emit_prepare_cube_coords. There's a "AMDGPU.cube" llvm intrinsic there (which really will issue the hw V_CUBETC_F32, V_CUBESC_F32, V_CUBEMA_F32, V_CUBEID_F32 instructions, this is not unlike what you'd do in a software rasterizer, calculate new s/t coords, determine major axis and face) plus a bunch of ordinary shader math (and lots more of that with explicit derivatives as they need to be transformed into cube coord space). (This is then passed to the TMU as would be "simple" non-cube texture lookups elsewhere.)

So am I right that the mip-level is selected only from the u-derivative for the top left two pixels in a 4x4 block? Is this something you got from another renderer or in rendering literature?
Looks like quite the cheat to me :).
 
Last edited:
You were working on a ring for v2, weren't you? Did simulations suggest a switched architecture was better?

No, I think a ring bus with a write-back/write-invalidate protocol would still be better. I got it basically working, but started running into edge cases, some of which I documented here:

http://latchup.blogspot.com/2014/07/messy-details.html

For the simpler write-through/write-update protocol, a ring didn't seem to make as much sense, because the responses are broadcast to all cores.

The traffic seems to be much lower with a write-invalidate protocol. This analysis doesn't capture the improvement in latency for writes, which later scalability analysis suggests is the bottleneck (vs. interconnect bandwidth):

http://latchup.blogspot.com/2014/07/write-back-vs-write-through-bake-off.html

I'll probably redesign this again for the next iteration.
 
So am I right that the mip-level is selected only from the u-derivative for the top left two pixels in a 4x4 block? Is this something you got from another renderer or in rendering literature?

Yeah, it's a hack. Most algorithms I'm aware of use a diagonal in a 2x2 quad, but it was simpler and faster to just look at dU.
 
Yeah, it's a hack. Most algorithms I'm aware of use a diagonal in a 2x2 quad, but it was simpler and faster to just look at dU.
Generally, for compliant rendering, the APIs allow you to use the same lod value for a 2x2 quad (and you can use either the dx values from the top or bottom row, and the dy values from either the left or right row), but not larger. This is also what hw does. Obviously you always need both dx and dy however, without that quality is going to be unacceptably low unless the geometry with the texture happens to be nearly co-planar to the screen (I think that's probably a bigger issue quality-wise than applying the same lod to 4x4 pixels even). (Note this does _not_ include things like sampling with explicit lod or explicit derivatives, which require you to do things per-pixel - albeit I know at least some intel hw has tunable parameter so it can recognize if the calculated lods are "similar enough" and force the same mip level if they are as that's apparently cheaper for the lookups even with dedicated hw).
 
Back
Top