AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    I was thinking the same, ofc the 7970 will get his fps limited at 60fps max, and then divided to 30fps due to v-sync on. ( if fps goes to close of the 35fps ) . sadly cant find a 7970 benchmark results .

    Note AMD give the results at 2560x1600, not 1920x1080 . ( and AA extreme = supersampling enabled )

    http://forums.guru3d.com/showthread.php?t=367114
    670without v-sync ( or maybe with Adapatative v.sync )
    [​IMG]
     
    #3921 lanek, Aug 18, 2012
    Last edited by a moderator: Aug 18, 2012
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,910
    Likes Received:
    2,230
    Location:
    Germany
    You like those better?
    http://imgur.com/wRWL0
     
  3. Broken Hope

    Regular

    Joined:
    Jul 13, 2004
    Messages:
    483
    Likes Received:
    1
    Location:
    England
    I think I had adaptive vsync enabled by accident, didn't make much difference with normal Vsync though since Sleeping Dogs supports triple buffering anyway.

    [​IMG]

    My 670 isn't massively overclocked, it's a Gigabyte Windforce running at stock, but yeah it does have a slight factory overclock. Put's it at around stock 680 performance from what I can see in most reviews.

    Disabling Vsync doesn't make much difference on the 670

    [​IMG]

    I'd prefer if the graphs had actual FPS numbers on them, but AMD don't like to do that.

    If anyone with a 7970 wants to do a comparison and has a similar system I'd like to see how much better GCN handles the game.

    I have a 2500K at 4.4GHZ and as I said earlier a Gigabyte 670 Windforce.
     
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,859
    Likes Received:
    2,790
    Location:
    Finland
    Dunno if this is the right thread to ask, but which app is most likely correct on core voltage (vddc) - furmark, afterburner or gpu-z?
    According to Afterburner, the 7970 I just got is 1175mV one - I lowered the voltage to 1024mV on Afterburner, and according to it, it stays there during 3D load now.
    Furmark says the VDDC however is 1170mV, and according to GPU-Z it fluctuates around 968-970mV or so
     
  5. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    Adaptive vsync can override in game vsync. That's interesting to know. What's also interesting is that regardless if you have vsync on or not your min FPS remains the same. Which brings me back to my original question about vram playing a factor even at just 1080 resolution. But it does look, to me, that you are at a pretty high OC on that 670 though.
     
  6. Broken Hope

    Regular

    Joined:
    Jul 13, 2004
    Messages:
    483
    Likes Received:
    1
    Location:
    England
    VRAM doesn't appear to be an issue, it doesn't even get close to maxing out the card.

    [​IMG]

    Unless GPU-Z isn't reporting correctly.
     
  7. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    Is the low frame rates sustained or reported at it's lowest peak? A line graph using fraps might shed some light. Can anyone with a 7970 GHz edition do the same?
     
  8. Broken Hope

    Regular

    Joined:
    Jul 13, 2004
    Messages:
    483
    Likes Received:
    1
    Location:
    England
  9. AlphaWolf

    AlphaWolf Specious Misanthrope
    Legend

    Joined:
    May 28, 2003
    Messages:
    8,949
    Likes Received:
    951
    Location:
    Treading Water
    Yes truly amazing. Certainly a first. /sarcasm
     
  10. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    First I've heard of that. Haven't seen any pics to know exactly what's being said.
    Oh yeah, do you have a line graph for that benchmark you posted?
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,346
    Likes Received:
    3,864
    Location:
    Well within 3d
    Being able to wait based on VALU_CNT dosn't matter as much on SI because the 4-cycle vector execution phase and issue restrictions mean no subsequent vector op can issue.
    That's an implementation detail that may not be worth exposing in the ISA if AMD wants room to tinker with the vector pipeline or issue rates in the future.

    This section and its diagram is heavily lifted from section 8.2 of Evergreen ISA doc. I'm not even sure if some of those paths or write caches exist as described.
    Still, even if this were an error, this and the collection of other defects are not as bad as the Bulldozer software guide's (edit: at least the initial release) many egregious misses.
     
    #3931 3dilettante, Aug 21, 2012
    Last edited by a moderator: Aug 21, 2012
  12. HMBR

    Regular

    Joined:
    Mar 24, 2009
    Messages:
    417
    Likes Received:
    105
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I can't work out what you're saying here.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,346
    Likes Received:
    3,864
    Location:
    Well within 3d
    Using S_WAITCNT makes sense for the counters in question because the implementation allows multiple memory, export, and data share instructions to issue instead of stalling for completion. There's no fixed time period that they can be expected to complete within.

    The SIMD vector issue currently does not do that because of the SIMD only issues every 4 cycles, which is the same time it takes for a VALU instruction to execute. There is no reason to wait because the instruction is either guaranteed to be done due to the round-robin issue, or it's a special or DP instruction and the unit won't be idle.

    Doing something that changes instruction latency relative to the 1:4 cycle ratio, makes it variable, or raises the VALU issue rate can require additional logic. Alternatively, AMD could use the extra bits in a future version of waitcnt and allow for the compiler to indicate how many instructions the hardware can try to issue before worrying about dependences.
     
    #3934 3dilettante, Aug 21, 2012
    Last edited by a moderator: Aug 21, 2012
  15. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    The 7900 series is winning at a lower gpu overclock.
    1920x1080
    2560x1600
    I'm surprised not to see more reviews with 7900 series overclocked.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Except for the cases described in table 4.2, when no non-dependent instructions are available.

    Which is effectively what table 4.2 is about, for dependent instructions.

    I don't disagree with the general concept here. Table 4.2 fundamentally consists of VALU instructions being dependent upon data moves to/from the scalar unit. Not a whole lot different from data moves to/from LDS.

    NOPs (or non-dependent instructions) avoid a thread switch. Issuance of an S_WAIT VALU_CNT would imply a thread switch (once non-dependent instructions run out). The latter sounds like it should be "costly" yet the same argument could apply to S_WAIT LGKM_CNT when fetching data from LDS, since those should generally be very low latency (similar magnitude to the values shown in table 4.2, at least for a single LDS instruction).

    So, to be quite honest, I don't see why AMD didn't use VALU_CNT to effect the functionality of table 4.2 The only argument I can muster is that these are hard latencies the compiler can see. That "simplification" results in NOPs being possible, though, instead of using those cycles for another thread.

    And that still leaves the mystery of why this counter is named/sized but nothing is said about its use. (Hardware bug for this particular scenario?)
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,346
    Likes Received:
    3,864
    Location:
    Well within 3d
    Most of these scenarios would not work with a WAITCNT as it works with an increment/decrement when an instruction of a given type is issued/completed, and the source instructions of these side-effects or crossovers from the SIMD to scalar domain would have completed as far as data dependence goes.

    NOPS or independent instructions in the 4.2 table probably prevent undefined behavior. The ISA document is stating that the hardware won't catch these dependences, so it's going to blindly issue even if the end results make no sense. The wait counters wouldn't catch these because the originating instructions have completed.

    The LDS is a common resource between wavefronts, and each operation can vary in duration based on bank conflicts. One common thread for these wait states is that they are used in cases where the CU's control logic has deferred some authority to outside scheduling and arbitration, be it the LDS, GDS, the graphics pipeline at the other end of the export bus, the scalar cache controller, and vector memory pipeline.
    The SIMDs seem to be simpler and have little self-management capability.

    There are several awkward points about SI that I'm curious about. One is that S_WAITCNT is severely constrained if you want to use scalar memory reads. These, of all the operations, can complete out of order. It doesn't sound like the architecture will stop you from using a different wait value, but who knows what it'll do.

    Another corner case are VALU instructions with an LDS operand. How are these being handled, and what implications are there for sourcing from something that is not fixed latency?
    Is this one reason why there isn't a wait count available for VALU instructions?

    It's effectively constrained to 1 or 0 for now. The way the architecture acts, an S_WAITCNT for VALU_CNT would only be allowed a value of 0.
    This might be a case of something going on behind the scenes that they can't fully hide; it's in the "wiggle room" category when they came to a crossroads in the design and didn't finalize until later, or it's "room to grow".

    Another place that AMD may have been vacillating is in the HW_ID table. The SIMD and CU identifier bits are separated by two reserved bits.
    Does this mean they reserve the possibility for a combination of increasing SIMDs per CU or CUs per array?


    One thing that could come up in the future is if the SIMDs become more capable of self-management, which might necessitate adding a wait count.
    Vector multi-issue or the introduction of 16-wide instructions for closer alignment with future CPU extensions might make the count more important.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Trying to keep something confidential?
     
  19. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    3,008
    Likes Received:
    538
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I disagree. These CNT registers are "watched" by the top-level logic that controls thread arbitration.

    The issuing instruction (that increments CNT) always completes immediately. The text talks about "completion" in the sense of data being returned, not in terms of thread state. The thread will be switched-out upon S_WAITCNT.

    For example the shader can execute a series of 4 LDS fetches then issue S_WAITCNT for LGKM and then the thread will be switched-out. (R700 also hides LDS access latency by grouping-up LDS accesses - same as for TEX. Evergreen and NI don't hide LDS latency.)

    Yes, this is purely a dependency fault in terms of timeliness of register content (or variability in the "path length" through logic to access that register, causing the timing problem).

    But S_WAITCNT implies sleeping the thread until the counter hits the value specified, so I don't see how your final sentence would apply.

    And cases 5,6 and 7 in Table 4.2 consists of VALU versus non-VALU external dependencies :wink:

    But yes, the SIMDs are dumb. Dumber than R600 SIMDs.

    Yes the programmer (i.e. the shader compiler, 99.99999% of the time) is supposed to know that 0 is the only meaningful wait value.

    Scalar memory reads (from constant memory) will have to be "bunched up" over several sets, if you want to do lots and lots of them.

    Presumably you're referring to bank conflicts during LDS access. LGKM_CNT is only decremented once the variable-latency caused by the bank conflict has fully passed.

    I dare say a good way to think of these S_WAITCNT instructions is that they are protecting the VGPRs. That's all they do, ensuring that VGPRs are permanently coherent. With that model in mind, it becomes clear that table 4.2 consists entirely of non-VGPR operands for shader instructions and expresses dependency constraints that don't relate to VGPRs.

    I still can't think what VALU_CNT could be used for, though.

    I don't think it means anything. Bits 30/31 are also not mentioned. This is just stuff documented for debugging as far as I can tell.

    I'm sure a future chip will have more than 2048 VALU lanes.

    Hopefully AMD will work out how to get past a mere two shader engines and to resolve the coherency problems that multiple shader engines create. If NVidia can tackle that problem there's no reason why AMD can't.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...