AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

why even discuss benchmarks taken with vsync on..

(and unknown different cpus, cp settings etc)

I was thinking the same, ofc the 7970 will get his fps limited at 60fps max, and then divided to 30fps due to v-sync on. ( if fps goes to close of the 35fps ) . sadly cant find a 7970 benchmark results .

Note AMD give the results at 2560x1600, not 1920x1080 . ( and AA extreme = supersampling enabled )

http://forums.guru3d.com/showthread.php?t=367114
670without v-sync ( or maybe with Adapatative v.sync )
 
Last edited by a moderator:
I thought he used a older 7970 at stock (non GHz edition) vs 670 at it's highest boost/oc rate. I noticed that his max FPS is 62.1 for the 670 w/vsync? Something is not right with that.

I think I had adaptive vsync enabled by accident, didn't make much difference with normal Vsync though since Sleeping Dogs supports triple buffering anyway.

wEIEh.png


My 670 isn't massively overclocked, it's a Gigabyte Windforce running at stock, but yeah it does have a slight factory overclock. Put's it at around stock 680 performance from what I can see in most reviews.

Disabling Vsync doesn't make much difference on the 670

lMq7K.png



I'd prefer if the graphs had actual FPS numbers on them, but AMD don't like to do that.

If anyone with a 7970 wants to do a comparison and has a similar system I'd like to see how much better GCN handles the game.

I have a 2500K at 4.4GHZ and as I said earlier a Gigabyte 670 Windforce.
 
Dunno if this is the right thread to ask, but which app is most likely correct on core voltage (vddc) - furmark, afterburner or gpu-z?
According to Afterburner, the 7970 I just got is 1175mV one - I lowered the voltage to 1024mV on Afterburner, and according to it, it stays there during 3D load now.
Furmark says the VDDC however is 1170mV, and according to GPU-Z it fluctuates around 968-970mV or so
 
I think I had adaptive vsync enabled by accident, didn't make much difference with normal Vsync though since Sleeping Dogs supports triple buffering anyway.

http://i.imgur.com/wEIEh.png

My 670 isn't massively overclocked, it's a Gigabyte Windforce running at stock, but yeah it does have a slight factory overclock. Put's it at around stock 680 performance from what I can see in most reviews.

Disabling Vsync doesn't make much difference on the 670

http://i.imgur.com/lMq7K.png

I have a 2500K at 4.4GHZ and as I said earlier a Gigabyte 670 Windforce.
Adaptive vsync can override in game vsync. That's interesting to know. What's also interesting is that regardless if you have vsync on or not your min FPS remains the same. Which brings me back to my original question about vram playing a factor even at just 1080 resolution. But it does look, to me, that you are at a pretty high OC on that 670 though.
 
VRAM doesn't appear to be an issue, it doesn't even get close to maxing out the card.

]http://i.imgur.com/xSO21.png

Unless GPU-Z isn't reporting correctly.
Is the low frame rates sustained or reported at it's lowest peak? A line graph using fraps might shed some light. Can anyone with a 7970 GHz edition do the same?
 
That would only have any benefit for scenarios 5, 6 and 7 in table 4.2, and only when the shader compiler has no non-dependent instructions available to schedule.
Being able to wait based on VALU_CNT dosn't matter as much on SI because the 4-cycle vector execution phase and issue restrictions mean no subsequent vector op can issue.
That's an implementation detail that may not be worth exposing in the ISA if AMD wants room to tinker with the vector pipeline or issue rates in the future.

Strangely, there is a single reference to a "clause", in section 9.2. It made me wonder if there's an underlying 8-instruction cadence for clauses. i.e. AAAAAAAABBBBBBBBAAAAAAAA... represents two wavefronts, alternating in their issue of VALU instructions. But that seems extremely unlikely.

This section and its diagram is heavily lifted from section 8.2 of Evergreen ISA doc. I'm not even sure if some of those paths or write caches exist as described.
Still, even if this were an error, this and the collection of other defects are not as bad as the Bulldozer software guide's (edit: at least the initial release) many egregious misses.
 
Last edited by a moderator:
Being able to wait based on VALU_CNT dosn't matter as much on SI because the 4-cycle vector execution phase and issue restrictions mean no subsequent vector op can issue.
That's an implementation detail that may not be worth exposing in the ISA if AMD wants room to tinker with the vector pipeline or issue rates in the future.
I can't work out what you're saying here.
 
Using S_WAITCNT makes sense for the counters in question because the implementation allows multiple memory, export, and data share instructions to issue instead of stalling for completion. There's no fixed time period that they can be expected to complete within.

The SIMD vector issue currently does not do that because of the SIMD only issues every 4 cycles, which is the same time it takes for a VALU instruction to execute. There is no reason to wait because the instruction is either guaranteed to be done due to the round-robin issue, or it's a special or DP instruction and the unit won't be idle.

Doing something that changes instruction latency relative to the 1:4 cycle ratio, makes it variable, or raises the VALU issue rate can require additional logic. Alternatively, AMD could use the extra bits in a future version of waitcnt and allow for the compiler to indicate how many instructions the hardware can try to issue before worrying about dependences.
 
Last edited by a moderator:
The SIMD vector issue currently does not do that because of the SIMD only issues every 4 cycles, which is the same time it takes for a VALU instruction to execute. There is no reason to wait because the instruction is either guaranteed to be done due to the round-robin issue, or it's a special or DP instruction and the unit won't be idle.
Except for the cases described in table 4.2, when no non-dependent instructions are available.

Doing something that changes instruction latency relative to the 1:4 cycle ratio,
Which is effectively what table 4.2 is about, for dependent instructions.

makes it variable, or raises the VALU issue rate can require additional logic. Alternatively, AMD could use the extra bits in a future version of waitcnt and allow for the compiler to indicate how many instructions the hardware can try to issue before worrying about dependences.
I don't disagree with the general concept here. Table 4.2 fundamentally consists of VALU instructions being dependent upon data moves to/from the scalar unit. Not a whole lot different from data moves to/from LDS.

NOPs (or non-dependent instructions) avoid a thread switch. Issuance of an S_WAIT VALU_CNT would imply a thread switch (once non-dependent instructions run out). The latter sounds like it should be "costly" yet the same argument could apply to S_WAIT LGKM_CNT when fetching data from LDS, since those should generally be very low latency (similar magnitude to the values shown in table 4.2, at least for a single LDS instruction).

So, to be quite honest, I don't see why AMD didn't use VALU_CNT to effect the functionality of table 4.2 The only argument I can muster is that these are hard latencies the compiler can see. That "simplification" results in NOPs being possible, though, instead of using those cycles for another thread.

And that still leaves the mystery of why this counter is named/sized but nothing is said about its use. (Hardware bug for this particular scenario?)
 
Except for the cases described in table 4.2, when no non-dependent instructions are available.
Most of these scenarios would not work with a WAITCNT as it works with an increment/decrement when an instruction of a given type is issued/completed, and the source instructions of these side-effects or crossovers from the SIMD to scalar domain would have completed as far as data dependence goes.

NOPs (or non-dependent instructions) avoid a thread switch. Issuance of an S_WAIT VALU_CNT would imply a thread switch (once non-dependent instructions run out). The latter sounds like it should be "costly" yet the same argument could apply to S_WAIT LGKM_CNT when fetching data from LDS, since those should generally be very low latency (similar magnitude to the values shown in table 4.2, at least for a single LDS instruction).
NOPS or independent instructions in the 4.2 table probably prevent undefined behavior. The ISA document is stating that the hardware won't catch these dependences, so it's going to blindly issue even if the end results make no sense. The wait counters wouldn't catch these because the originating instructions have completed.

The LDS is a common resource between wavefronts, and each operation can vary in duration based on bank conflicts. One common thread for these wait states is that they are used in cases where the CU's control logic has deferred some authority to outside scheduling and arbitration, be it the LDS, GDS, the graphics pipeline at the other end of the export bus, the scalar cache controller, and vector memory pipeline.
The SIMDs seem to be simpler and have little self-management capability.

There are several awkward points about SI that I'm curious about. One is that S_WAITCNT is severely constrained if you want to use scalar memory reads. These, of all the operations, can complete out of order. It doesn't sound like the architecture will stop you from using a different wait value, but who knows what it'll do.

Another corner case are VALU instructions with an LDS operand. How are these being handled, and what implications are there for sourcing from something that is not fixed latency?
Is this one reason why there isn't a wait count available for VALU instructions?

And that still leaves the mystery of why this counter is named/sized but nothing is said about its use. (Hardware bug for this particular scenario?)
It's effectively constrained to 1 or 0 for now. The way the architecture acts, an S_WAITCNT for VALU_CNT would only be allowed a value of 0.
This might be a case of something going on behind the scenes that they can't fully hide; it's in the "wiggle room" category when they came to a crossroads in the design and didn't finalize until later, or it's "room to grow".

Another place that AMD may have been vacillating is in the HW_ID table. The SIMD and CU identifier bits are separated by two reserved bits.
Does this mean they reserve the possibility for a combination of increasing SIMDs per CU or CUs per array?


One thing that could come up in the future is if the SIMDs become more capable of self-management, which might necessitate adding a wait count.
Vector multi-issue or the introduction of 16-wide instructions for closer alignment with future CPU extensions might make the count more important.
 
Most of these scenarios would not work with a WAITCNT as it works with an increment/decrement when an instruction of a given type is issued/completed, and the source instructions of these side-effects or crossovers from the SIMD to scalar domain would have completed as far as data dependence goes.
I disagree. These CNT registers are "watched" by the top-level logic that controls thread arbitration.

The issuing instruction (that increments CNT) always completes immediately. The text talks about "completion" in the sense of data being returned, not in terms of thread state. The thread will be switched-out upon S_WAITCNT.

For example the shader can execute a series of 4 LDS fetches then issue S_WAITCNT for LGKM and then the thread will be switched-out. (R700 also hides LDS access latency by grouping-up LDS accesses - same as for TEX. Evergreen and NI don't hide LDS latency.)

NOPS or independent instructions in the 4.2 table probably prevent undefined behavior. The ISA document is stating that the hardware won't catch these dependences, so it's going to blindly issue even if the end results make no sense. The wait counters wouldn't catch these because the originating instructions have completed.
Yes, this is purely a dependency fault in terms of timeliness of register content (or variability in the "path length" through logic to access that register, causing the timing problem).

But S_WAITCNT implies sleeping the thread until the counter hits the value specified, so I don't see how your final sentence would apply.

The LDS is a common resource between wavefronts, and each operation can vary in duration based on bank conflicts. One common thread for these wait states is that they are used in cases where the CU's control logic has deferred some authority to outside scheduling and arbitration, be it the LDS, GDS, the graphics pipeline at the other end of the export bus, the scalar cache controller, and vector memory pipeline.
The SIMDs seem to be simpler and have little self-management capability.
And cases 5,6 and 7 in Table 4.2 consists of VALU versus non-VALU external dependencies ;)

But yes, the SIMDs are dumb. Dumber than R600 SIMDs.

There are several awkward points about SI that I'm curious about. One is that S_WAITCNT is severely constrained if you want to use scalar memory reads. These, of all the operations, can complete out of order. It doesn't sound like the architecture will stop you from using a different wait value, but who knows what it'll do.
Yes the programmer (i.e. the shader compiler, 99.99999% of the time) is supposed to know that 0 is the only meaningful wait value.

Scalar memory reads (from constant memory) will have to be "bunched up" over several sets, if you want to do lots and lots of them.

Another corner case are VALU instructions with an LDS operand. How are these being handled, and what implications are there for sourcing from something that is not fixed latency?
Is this one reason why there isn't a wait count available for VALU instructions?
Presumably you're referring to bank conflicts during LDS access. LGKM_CNT is only decremented once the variable-latency caused by the bank conflict has fully passed.

I dare say a good way to think of these S_WAITCNT instructions is that they are protecting the VGPRs. That's all they do, ensuring that VGPRs are permanently coherent. With that model in mind, it becomes clear that table 4.2 consists entirely of non-VGPR operands for shader instructions and expresses dependency constraints that don't relate to VGPRs.

I still can't think what VALU_CNT could be used for, though.

Another place that AMD may have been vacillating is in the HW_ID table. The SIMD and CU identifier bits are separated by two reserved bits.
Does this mean they reserve the possibility for a combination of increasing SIMDs per CU or CUs per array?
I don't think it means anything. Bits 30/31 are also not mentioned. This is just stuff documented for debugging as far as I can tell.

I'm sure a future chip will have more than 2048 VALU lanes.

Hopefully AMD will work out how to get past a mere two shader engines and to resolve the coherency problems that multiple shader engines create. If NVidia can tackle that problem there's no reason why AMD can't.
 
Back
Top