AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Here:

    http://www.gamedev.net/community/forums/topic.asp?topic_id=540832

    is an HS and DS. Looking at the ISA for the DS:

    Code:
    00 ALU: ADDR(32) CNT(19) 
          0  x: ADD         T0.x, -R0.y,  1.0f      
             y: MUL_e       T0.y,  R0.x,  (0x40400000, 3.0f).x      
             z: ADD         T0.z, -R0.x,  1.0f      
             w: MUL_e       T0.w,  R0.y,  (0x40400000, 3.0f).x      
             t: MOV         R1.x,  0.0f      
          1  x: MUL_e       ____,  R0.x,  PV0.y      
             y: MUL_e       ____,  PV0.z,  PV0.z      
             z: MUL_e       ____,  R0.y,  PV0.w      
             w: MUL_e       T1.w,  PV0.x,  PV0.x      
             t: MOV         R1.y,  1      
          2  x: MUL_e       R3.x,  T0.w,  PV1.w      
             y: MUL_e       R5.y,  T0.z,  PV1.x      
             z: MUL_e       R5.z,  T0.y,  PV1.y      
             w: MUL_e       R5.w,  T0.x,  PV1.z      
             t: MUL_e       R4.y,  T0.z,  PV1.y      
          3  x: MUL_e       R2.x,  R0.x,  R0.x      
             y: MUL_e       R2.y,  R0.y,  R0.y      
             w: MUL_e       R2.w,  T0.x,  T1.w      VEC_120 
    [...]
    
    indicates that R0.xy contains the vertex output by TS. Sigh, first time I've actually studied the shader closely enough.

    So LDS is not holding the output from TS in this case, it's being written directly into GPRs. This is pretty similar to what the interpolator block used to do, creating allocations in the register file and populating shader inputs with data by setting GPRs.

    Taken at face value (and ignoring the registers HS uses), the 8 GPR allocation for this DS means that 32 threads can be in flight on the SIMD. Which is 2048 vertices output from TS.

    So the count of vertices that a SIMD can accept from TS is determined by GPR allocation for the DS.

    This shader also has an integer in R0.z as input. What is it?
     
  2. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
  3. caveman-jim

    Regular

    Joined:
    Sep 19, 2005
    Messages:
    305
    Likes Received:
    0
    Location:
    Austin, TX
    Nah, came from email from NV.
     
  4. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    From what I understand, it sounds more like something to reduce overdraw.
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    It's still doesn't mean much for BW, especially for NI's 1-2 triangles per clock peak.
    I don't think you're reading my post properly. There is no reason to store all the vertices produced by a patch.
    Not much more than typical vertex shaders.
    For patches like that, you just have to write the 3 or 4 control points to GDS. Then any SIMD can do the DS.
    In a way it is. If you compile shaders to use the same number of registers per fragment, then you can basically have an ubershader to work on wavefronts using any of those shaders.
    When did I say it was irrelevent? Yes, I did give you a possibility: there could be a data path bottleneck somewhere. You could have bank conflicts, or maybe limitations from all fragments in the DS accessing the same control point. Caches used for regular vertex processing (where all data comes from RAM) may alleviate that.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Well you've probably seen my reply to 3dcgi by now, which indicates that each vertex output by TS is effectively consuming 16 bytes, an entire register, as it's written to the register file, not LDS. And that allocation is multiplied by the entire allocation of registers for the DS.

    The basis of my position isn't that all the vertices need to be stored. It's that when a huge lump of vertices are the result of one or a few patches owned by a single SIMD, TS throughput can be affected. Subject to the total count of SIMDs that can accept TS output.

    In the trivial case (which I suspect isn't realistic): if there's only one SIMD running HS/DS then TS has to stop while waiting for DS threads to complete processing. More realistically several SIMDs will be there to take on the workload of DS.

    Regardless of the number of SIMDs occupied by HS/DS, when a patch results in TS outputting more than X vertices (X dependent on register allocation of DS), TS is going to stall because it can't multi-thread patches - it treats them strictly sequentially. That's my interpretation, and I suspect it's a major factor in the performance cliff we see. Cypress falls to 1/6 to 1/10 of GTX480 throughput in the worst cases.

    Fermi probably stalls its TSs too, when amplification is very high. There's certainly warnings:

    http://www.highperformancegraphics.org/media/Hot3D/HPG2010_Hot3D_NVIDIA.pdf

    Migration/sharing of workload, something like Fermi, but which appears not to be part of Evergreen. Although, to be fair, there's no hard evidence for this.

    I imagine the architecture would have to be re-jigged for that kind of support. The implicit inputs to a pixel shader, say, aren't like the inputs to a DS. This ubershader would be of a type distinct from all those currently implemented.

    Under "Hardware Tessellator Progression", the slide says "Gen 8 - AMD Radeon HD 6900 Series - Scalability and off-chip buffering". You keep trying to dodge around "off-chip buffering" as though it has nothing to do with making tessellation faster. If moving data off-die is a performance win it's probably for the same reason as seen in GS: coarse granularity data, in huge wodges, is too voluminous to keep on-die.

    Anyway, it turns out that LDS wasn't the buffer under strain, it was the register file, which appears to make the strain worse...

    Yes, those can happen, definitely. Why would they scale with tessellation factor? LDS reads are solely for HS params that are inputs to DS, i.e. control-points and tessellation factors.

    Broadcast is fine in ATI as far as I know.

    The L1 texture/vertex cache you mean?

    The DS I referenced earlier does read stuff from RAM (two VFETCH instructions). I suspect it might be something to do with the original stream of patches - perhaps it's just scale/bias and offset data for the patch buffer, to enable calculation of the right address in LDS to fetch HS params from.

    In the same vein, the HS I referenced earlier reads two ints using a single VFETCH, which appear to be to generate LDS write addresses.

    In both cases VFETCH doesn't look like it could be a bottleneck. There's a lot of ALU work in HS/DS for the SIMD to hide VFETCH latency.
     
  7. SirPauly

    Regular

    Joined:
    Feb 16, 2002
    Messages:
    491
    Likes Received:
    14
  8. R300King!

    Newcomer

    Joined:
    Aug 4, 2002
    Messages:
    231
    Likes Received:
    5
    From that link.. googled of course.

    Umm, why would anyone think the 6970 would have the same number of shaders as the 5870?

    I don't get it. I think it'll probably be 4-D shaders but the total will be more than 1920. How many more, not sure, maybe 2240. :)
     
  9. TOAO_Cyrus

    Newcomer

    Joined:
    Feb 16, 2007
    Messages:
    16
    Likes Received:
    0
    I dont get what your saying, it doesnt have the same number of shaders, its got 50% more if the 480 number is to be believed. Correct me if I am wrong but removing one lane probably doesnt save that much die area so its a pretty substantial increase for the same process.
     
  10. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    In that case, what you're saying is that the DS doesn't run fast enough. Could have saved a lot of typing and miscommunication that way...

    I don't agree with the assertion that a vertex must be domain shaded by the SIMD that created the patch from which it was born. You can put some control point data in the GDS, and the workload sharing is very similar to vertices being pumped into several SIMDs for regular geometry sans geometry shader or tessellation. It's not some Fermi exclusive technology.
    I've done no such thing. In fact, my entry into this discussion was about the fact that the BW required by off-chip buffering of TS output is very low, so it may well make things faster. They don't have the potentially 50-100 byte vertex size of non-tessellated geometry.
    By your own calculations, the buffer can still be very big. In your scenario, TS is not stalling because the buffer is inadequate, it's stalling because the DS isn't working fast enough. Unless you increase the speed of the DS, more buffer space won't help.
    What concrete evidence do we have about this scaling? Recall this thread:
    http://forum.beyond3d.com/showthread.php?t=57035
    Cypress showed the same 6 clocks per triangle added when the factor is 25, 50, and 100.
    What I was suggesting is that a VS with VFETCH is fast enough to do at least one vert per clock, so regardless of what is holding back the TS/DS, reading the tessellated triangles (and maybe even the control points) from memory has the potential to hit the same rate. This basically transforms the DS into a VS.
     
    #4390 Mintmaster, Nov 2, 2010
    Last edited by a moderator: Nov 2, 2010
  11. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Something has to tell you where to fetch HS output data from.
     
  12. R300King!

    Newcomer

    Joined:
    Aug 4, 2002
    Messages:
    231
    Likes Received:
    5
    The actual number of shaders is less than 50% increase. Only if your counting the clusters from 320 to 480, then yes. But the 320 had 5, and the 480 only has 4. So the actual increase in shaders is 20%.

    And my first statement was..

    The article was saying that the hottest rumor was that the NEW 6970 has 1920 4D and not 1600 5D. Why on earth would the 6970 have 1600 5D shaders? WHY? The OLD 5870 had that, why would they even think the new chip would have the same damn thing. That's what I was saying. ;)
     
  13. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    297
    Likes Received:
    38
    Location:
    Herwood, Tampere, Finland
    Cypress has 20 processor cores, each of which are 16-way SIMD, and for each SIMD line there is 6-way VLIW capable of 5 FP ALU operations and one branch.


    6970 might be 30 processor cores, each having 16-way SIMD, and for each SIMD line 5-way VLIW, capable of 4 FP ALU operations and one branch.


    The number of processors, the total number of SIMD lines, the number of threads operated simultaneously will grow by 50%.



    But:

    I think the shared counts are not very interesting here:

    6870 series got almost same performance as 5870 series, with about 80% of the pixel shader power, 80% of the texturing power, with 87.5% of the memory bandwidth, and 106% of theoretical pixel output performance, and with some tweaks to internal buffering, thread handling etc.



    So I don't see much point of increasing shader power much from 5800 to 6900 series, as 5870 seems to already have "too much" shading power, and especially if they stay at 32 ROPs, there just would not be any use for much more shader power in any games coming out soon. Of course, GPGPU performance , where ROPs are not used at all would increase, but it's no a priority.

    Going from 32 to 64 ROPs would propably give much bigger performance increase than going from 480x4 to some >500x4 shaders.


    So, I think either of the following makes sense:


    1) Just the internal buffering and thread handling tweaks Barts got, better geometry performance, and no big increase in shader count, and quite small increase of die size. maximum of 480x4 shaders, maybe only 400x4.

    2) considerably bigger chip, doubling of ROPs from 32 to 64, at least 480x4 shaders, either very high-clocked 256-bit memory bus or 512-bit memory bus. And of course all the buffering and thread handling tweaks Barts has and better geometry performance
     
  14. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    297
    Likes Received:
    38
    Location:
    Herwood, Tampere, Finland
    Another thought/speculation about Barts:

    We know, that originally Barts was supposed to be "6700" manufactured with 32nm.

    So when the 32nm process was cancelled, what was changed.

    I think the 256-bit memory bus and "slow-clocked" memory controller came with that:

    With 32nm the chip might have been too small for the padding of 256-bit memory bus, so the original plan was probably 128-bit bus, same 32 ROPs, but a very high-clocked memories for the 6770 sku.

    When they had to change to 40nm and the chip size increased, they had space for the pads 256-bit memory bus required. And then they could use the "lower-end" memory controller and slower clocked memories and still get better bandwidth than 128-bit high-cloked memory.
     
    #4394 hkultala, Nov 2, 2010
    Last edited by a moderator: Nov 2, 2010
  15. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    We already calculated that the 256-Bit ~4Gbps pads might fit on a ~200mm² die. So the 32nm performance chip might was such a sized die, but maybe with some more SIMDs (2x8 ~1280SPs) or 4D design.
     
  16. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
    hkultala: Now I think, that Barts was originally a refresh part, which was planned for launch around Easter. I think it's possible, that Bart wasn't prepared for 32nm manufacturing, but it was 40nm part from the start. And during its development it was completed by some NI's features (better filtering, UVD3.0 etc.), belated and launched as a part of NI's family. Because of this reasons it could be quite misleading to use Barts as a basis for extrapolating Cayman's performance (or anything)...
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yes. I originally contemplated using TF=64 as my argument, but I was expecting people to reject that as totally unreasonable "because TF=64 is something no-one would ever use". Only later did I realise that in something like terrain rendering would TF=64 be completely normal - as would TF=1024 if it were available, because application of a height-field to a plane isn't helped by a high density of control points.

    The shaders I've played with show that HS is writing patch params to LDS and DS is reading them from LDS. Sure, there may be some magic happening behind the scenes. The inconvenient truth here is that Evergreen tanks when given high TF, and in my view that's consistent with TS stalling because the sink buffer is too small.

    It shouldn't be, no. But Cypress had features cut due to the 40nm fuck-up...

    So what is the mechanism by which an off-chip buffer increases tessellation performance (or allows it to increase along with other changes for better scaling in the architecture)?

    The usual way to increase DS throughput is to increase the count of SIMDs running DS. But this is where coarse-grained granularity hits hard: a high TF overflows the capacity of any single SIMD, unless DS throughput is >1 vertex per cycle (in which case 1 SIMD would be enough anyway). Once a SIMD is chock full of vertices from a patch, TS has got nowhere to send more vertices, because a single SIMD is the only place those vertices can go (according to the locked HS/DS theory).

    If TS could just switch to another SIMD for DS execution, then it wouldn't have to stall. Well, unless there's not enough SIMDs with available capacity (which is more likely on the smaller GPUs).

    An off-die buffer can soak-up the peaks caused by high-TF. Without it, even occasional high-TF patches will cause grief. That's the basis of my interpretation of that line on the slide. Producer-consumer relies upon big-enough intermediate buffers, "80:20 rule", whatever. The original ATI GS design uses an off-die buffer to avoid stalling caused by large amplification.

    Obviously there's the underlying argument in some people's eyes that Cypress/Barts achieves adequate performance in its 80:20 compromise. If Cayman is re-designed for "scalability" in its tessellation performance, and buffering is re-worked to achieve that increased performance, then it points to the current design having inadequate buffering to support scaling.

    6 clocks (you later adjusted to 6.5 clocks) is so slow, even at LOD 25, that it can't get any slower? A comparison with Juniper and Redwood could be useful, as only math would vary.

    What is TF for LOD 25? Is the count of patches constant regardless of LOD?

    And the other crucial factor being that the GPU can use as many or as few SIMDs for DS as are necessary - without the "static" allocation issue that the locked-HS/DS theory implies. It's real load-balancing, whereas the current architecture appears to behave like a non-unified GPU - adequate only for low TFs.

    From that thread:

     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Thanks, yes, you've probably since noticed the penny had dropped last night.
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Sure it might look rosy if you compare to the 5870. What happens when you consider the 6870 has 124% of the clock-speed and fill-rate of the 5850, 97% of the flops and texture rate, 105% bandwidth and is only 6-7% faster? Looks to me that pixel fill-rate isn't the determining factor at all and bandwidth/texturing still rules. Couple that with the much higher clock speed and Barts isn't doing anything magical with respect to its theoretical numbers vs Cypress.

    For Cayman to achieve higher performance than its theoretical numbers would indicate vs Cypress it would need to have more dramatic efficiency improvements than we've seen in Barts.
     
  20. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    You know these things? :p

    Barts did not exist on the roadmap in 32nm. Barts in 40nm turned up before the 32nm cancellation.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...