Really wide SIMT in GPUs

Discussion in 'Architecture and Products' started by Infinisearch, Jul 24, 2015.

  1. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,031
    Likes Received:
    898
    Location:
    Planet Earth.
    On GCN I think the sweet spot is 16 fragments per triangle.
     
    Simon F likes this.
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    869
    Likes Received:
    277
    Sure, from consumer PoV, I thought about producer PoV.
    To be totally scalable you would have to be able to rasterize 1px triangles as fast as 300px large ones (relatively, so at 1/300th rate). No current architecture achieves this I think.
     
  3. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    But why would you want to? Rasterising one pixel triangles is enormously wasteful - and I don't mean just execution unit occupancy, but density of input data per surface area. At that level of detail you're much better off using alternative representations.
     
    3dcgi and Simon F like this.
  4. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    869
    Likes Received:
    277
    Because if you can do that your hardware architecture is of the kind where you are able to merge all 1px+ triangles into larger vectors and rasterize that without penalty at full throughput.
    Currently you always loose for anything multiple of > 1. Either it becomes super complex with lots of overhead to merge and then split, or you can't because one vector needs to serve exactly one z-tile or other reasons.
     
  5. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    Pixels are generally rendered in quads in order to support the gradient functions of the rendering APIs, so a 1x1 triangle will be as slow as a 2x2, at least AFAICS. Also, the triangle set-up is a relatively non-trivial operation and would still need to be done if you wanted anything other than point-sampling.
     
  6. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    27
    nowadays it would probably be more efficient and accurate if the fragment programs would calculate the gradient (yes, this implies FPs would also need to interpolate explicitly), then fragment merging would work out better and small triangles would be less of a problem.
     
  7. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    869
    Likes Received:
    277
    You can always generate gradients, which really are gradients, per pixel because they are always geometrically derived. The other stuff, is actually cross-lane "delta", and less useful than a real cross-lane operation. Cross-lane can be configured symmetric with the packing when you have full vector swizzles. I don't see the need for ddx/ddy support.

    True. But I think you could load-balance way better. Anyway, it's a fantasy. I don't think something large could be redesigned radically different nowadays. And nobody buys mini-simple GPUs which could start from scratch. Honestly I don't know how we could possibly get alternatives, except by going full software compute. And even then you wouldn't have the right instructions and paths to make it great.
     
  8. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    27
    SIMT/Compute is a little bit limiting. But if you schedule your vector lanes/threads by hand, then you could maybe achieve a way higher utilization and efficiency.
    with the increasing complexity of shader programs, it will become useful at some point to split those into 'passes'. this is simply because the shader is limited by the worst case scenario. maybe 90% of your shader could actually run with half of the resources (e.g. register count) but you allocate for the worst case. if you split work into passes, you could use a completely different setup to gather shadowing informations than for doing your PBR shading than for applying detail textures or gathering light probe informations.....
    NVidia has a paper about it targeting raytracing: http://www.nvidia.com/docs/IO/76976/HPG2009-Trace-Efficiency.pdf where they manage 'jobs' per lane by hand and also divide the task into two parts and whatever part has higher occupancy is executed. it's a bit cumbersome on SIMT/Compute, but on a CPU alike programming model, you could increase the efficiency. I'm referring to
    and IF you could increase efficiency, then you could go for wider SIMT, because the sweet-spot would move from the current SIMT16 (on NV/ATI afaik) to maybe 64.

    but I have doubts this could be done on driver level or just a small extension. you'd need to create a higher level control of how data flows (which is currently handled by drivers/hardware). e.g. you could bin fragments by the light source count, but not only using some screen-space carving and depth bound checks, but by really using the shadowing information on top (that you gather anyway). BUT you cannot just do that for the whole screen and potentially create hundrets of megabyte of data that is written to main memory and consumed later on, you'd rather just keep the buffers in L2 or L3 size and dynamically schedule between various task types that generate and consume that data.

    AVX512 should have all the needed parts to research that.
     
  9. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,439
    Likes Received:
    280
    Some, like Tonga, need 8 pixel triangles to max out the prim and pixel rates.
     
  10. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,031
    Likes Received:
    898
    Location:
    Planet Earth.
    Ah nice to know ty.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...