Nyami, open source GPU

Discussion in 'Architecture and Products' started by Kaotik, Jan 21, 2016.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,166
    Likes Received:
    1,836
    Location:
    Finland
  2. fuboi

    Newcomer

    Joined:
    Aug 6, 2011
    Messages:
    90
    Likes Received:
    45
  3. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    No, this is what the paper says:

    It doesn't go into detail what the renderer's texture mapping actually does but I doubt it implements anything close to the functions of a modern TMU. That is, complex address generation, mip-mapping, trilinear and anisotropic filtering, and texture compression.

    They did however implement a hardware rasterization unit, and found that for their test cases it had a significant (28% average) speedup over the software rasterizer. But the test cases sound very primitive and/or archaic, and I doubt they represent rasterization to shading work ratios that are comparable to those of modern loads.
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    FWIW TMUs of modern graphic cards actually don't do all of the address generation anymore. The really complex ones like for instance for cube maps, they've got special ALU instructions to help with that, which then get passed to the TMU. Albeit it's true the rest is all handled by hw (mip-mapping, trilinear isn't too complex to emulate with ordinary shader arithmetic, a decent anisotropic implementations sounds a bit problematic if you want to get it performant, and texture compression is a pain in the butt you'd probably want some hw for that though it wouldn't really need a tmu as such - could just decode into some cache)
     
  5. fuboi

    Newcomer

    Joined:
    Aug 6, 2011
    Messages:
    90
    Likes Received:
    45
    This GPU is more like a wide SIMD little RISC core. There are 32 register and 32 vector registers, there's no huge static RF. (can't remove this line break, wtf)
    Thus, being stack-based, any non-trivial (hello mandelbrot) graphics/compute task will start happily spilling (1 cache line wide) registers on the tiny stack. As such, cache pressure on a 4-thread core working on 16-wide SP vectors (512-bits) is probably a key consideration factor.

    The paper shows a 40% (or 71% when looking from above) worse cycle times when going from 256 lines to 64 lines (I'll pick the associativity that better matches my argument). Those are 64-bytes wide lines so we're talking 4kB to 16kB caches. I assume those are the per core L1s, there was a L2/MEMCTRL on a ring joining the CUs, err, cores, at least on previous versions of the project.

    IIRC texturing is quite cache friendly, with a hello-goodbye pattern as a triangle gets textured (even more so on fancy bi-linear, or better, filtering). I'd venture TMUs with separate texture caches would have a marked effect on performance on graphics tasks more complicated than a phong torus. Thanks to cleaner L1Ds, to be spilled into, and HW parallelism.

    Edit: there was another open source HDL FPGA project that had simple TMUs (don't remember if they did more than bi-linear RGB). It was a video synthesizer for video DJs that was even sold as a product on a standard FPGA demo board in a little enclosure.
     
  6. Jeff B

    Newcomer

    Joined:
    Jul 7, 2012
    Messages:
    11
    Likes Received:
    2
    Yes, it does need to spill during function calls. I don't think the overhead of that dominates compared to framebuffer, texture, and data structure access, but I haven't measured that explicitly. That might be an interesting experiment. I've posted some rendering profile information here:

    http://latchup.blogspot.com/2015/02/improved-3d-engine-profile.html

    The default configuration is 16k per core L1D and 128k shared L2 cache. Currently, it's just a switched interconnect between the cores, not a ring. The L1 is write-through.

    Yes, I think so too. It's on my list of things to experiment with, but I haven't gotten around to it yet. ETC2 texture decompression seems surprisingly simple in hardware.

    My software texture sampler is here:

    https://github.com/jbush001/NyuziProcessor/blob/master/software/libs/librender/Texture.cpp#L95

    It supports mip mapping and bilinear filtering.
     
  7. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    That makes sense. Do you have any examples showing what the ALU to TMU path is like in shaders generated for modern hardware? And yeah, I included some fixed function parts that aren't necessarily in the TMUs themselves. Caches could also handle some of the addressing like performing blocking.

    When Intel talked about adding TMUs to Larrabee one of the biggest reasons they cited was that they didn't need anything close to FP32 precision for blending texels in most ordinary loads, and could get much higher throughput without it. Of course, they could have also added some other packed datatype sizes and operations for this, but fixed function makes it easier to handle less regular/constant data widths.

    Thanks, that's informative.

    So am I right that the mip-level is selected only from the u-derivative for the top left two pixels in a 4x4 block? Is this something you got from another renderer or in rendering literature?
     
  8. fuboi

    Newcomer

    Joined:
    Aug 6, 2011
    Messages:
    90
    Likes Received:
    45
    You were working on a ring for v2, weren't you? Did simulations suggest a switched architecture was better?
     
  9. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Well this is what the open-source radeon driver does, albeit that's only for cube maps, I think everything else is really handled by the TMUs natively (but cube map address calcs are really complex):
    http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
    Look at radeon_llvm_emit_prepare_cube_coords. There's a "AMDGPU.cube" llvm intrinsic there (which really will issue the hw V_CUBETC_F32, V_CUBESC_F32, V_CUBEMA_F32, V_CUBEID_F32 instructions, this is not unlike what you'd do in a software rasterizer, calculate new s/t coords, determine major axis and face) plus a bunch of ordinary shader math (and lots more of that with explicit derivatives as they need to be transformed into cube coord space). (This is then passed to the TMU as would be "simple" non-cube texture lookups elsewhere.)

    Looks like quite the cheat to me :).
     
    #9 mczak, Jan 22, 2016
    Last edited: Jan 22, 2016
  10. Jeff B

    Newcomer

    Joined:
    Jul 7, 2012
    Messages:
    11
    Likes Received:
    2
    No, I think a ring bus with a write-back/write-invalidate protocol would still be better. I got it basically working, but started running into edge cases, some of which I documented here:

    http://latchup.blogspot.com/2014/07/messy-details.html

    For the simpler write-through/write-update protocol, a ring didn't seem to make as much sense, because the responses are broadcast to all cores.

    The traffic seems to be much lower with a write-invalidate protocol. This analysis doesn't capture the improvement in latency for writes, which later scalability analysis suggests is the bottleneck (vs. interconnect bandwidth):

    http://latchup.blogspot.com/2014/07/write-back-vs-write-through-bake-off.html

    I'll probably redesign this again for the next iteration.
     
  11. Jeff B

    Newcomer

    Joined:
    Jul 7, 2012
    Messages:
    11
    Likes Received:
    2
    Yeah, it's a hack. Most algorithms I'm aware of use a diagonal in a 2x2 quad, but it was simpler and faster to just look at dU.
     
  12. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Generally, for compliant rendering, the APIs allow you to use the same lod value for a 2x2 quad (and you can use either the dx values from the top or bottom row, and the dy values from either the left or right row), but not larger. This is also what hw does. Obviously you always need both dx and dy however, without that quality is going to be unacceptably low unless the geometry with the texture happens to be nearly co-planar to the screen (I think that's probably a bigger issue quality-wise than applying the same lod to 4x4 pixels even). (Note this does _not_ include things like sampling with explicit lod or explicit derivatives, which require you to do things per-pixel - albeit I know at least some intel hw has tunable parameter so it can recognize if the calculated lods are "similar enough" and force the same mip level if they are as that's apparently cheaper for the lookups even with dedicated hw).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...