AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

I disagree. These CNT registers are "watched" by the top-level logic that controls thread arbitration.
I guess I trying to follow AMD's terminology choice in stating that completion is when the data an instruction is meant to write back is written back and the counter is decremented.

The issuing instruction (that increments CNT) always completes immediately. The text talks about "completion" in the sense of data being returned, not in terms of thread state. The thread will be switched-out upon S_WAITCNT.
Writeback would be the last stage prior to instruction completion, unless there's an exception detect stage for something like ECC failures.
I am somewhat unclear on whay you mean by "switched out". The ten hardware threads per SIMD don't switch out in the same sense as the phrase is used for standard CPU cores and thread context switches. The wavefront is just flagged as not being ready when the SIMD's issue cycle comes back around. The wavefront is still in the group of 10, with hopefully up to 9 others ready to issue in its stead.

But S_WAITCNT implies sleeping the thread until the counter hits the value specified, so I don't see how your final sentence would apply.
None of the counters other than VALU_CNT, if it were being used, would have bearing on the scenarios in table 4.2.
Even if they were used, these scenarios involve instructions that would be considered complete, it's just that a few side effects like flag changes don't propagate as quickly as result forwarding.

Presumably you're referring to bank conflicts during LDS access. LGKM_CNT is only decremented once the variable-latency caused by the bank conflict has fully passed.
I'm under the impression that LGKM_CNT is not global to the entire CU.
Rather, this is tracking the issued instructions for a given wavefront, and that wavefront can't account for what the other 39 wavefronts in the CU might have in flight for the LDS at the time.
The LDS-direct read is also not listed as being one of the things tracked by LGCKM_CNT, which may make S_WAITCNT for LDS and any possible LDS reads to a vector instruction as restrictive as it is for scalar cache reads.

I dare say a good way to think of these S_WAITCNT instructions is that they are protecting the VGPRs. That's all they do, ensuring that VGPRs are permanently coherent. With that model in mind, it becomes clear that table 4.2 consists entirely of non-VGPR operands for shader instructions and expresses dependency constraints that don't relate to VGPRs.
If VGPRs were all that mattered, S_WAITCNT wouldn't be so restrictive with scalar memory reads.
4.2 contains a bunch of corner cases where side-effects and hardware state updates fall outside the regular data forwarding paths. The rest of the cases rely on the 4-cycle gap between instruction issues to allow results to flow over.

I still can't think what VALU_CNT could be used for, though.
With the current ratios in the architecture, possibly very little.

I don't think it means anything. Bits 30/31 are also not mentioned. This is just stuff documented for debugging as far as I can tell.
It's two reserved bits right in the middle of the state encoding. The relationship of each field to the ones above and below is pretty well established.
If during the SI design stage they hadn't yet decided on the maximum number of SIMDs, they probably had to allocate bits in advance for a product that hadn't been finalized.

Hopefully AMD will work out how to get past a mere two shader engines and to resolve the coherency problems that multiple shader engines create. If NVidia can tackle that problem there's no reason why AMD can't.
There's "can't" and then there's "won't", however I am hopeful that something at that level is going to evolve more than we've seen over the last several generations.
Another GPU generation will have to justify its existence on 28nm, with Tahiti already gobbling up much of the TDP room before a successor has the chance to improve on it.
Adding a few more CUs might not cut it.
 
I think what you're missing is two key concepts:

  1. S_WAIT is the sole mechanism by which a thread cedes control to other threads that are waiting.
  2. VGPRs are the target of all operations that result in a thread ceding (i.e. sleeping).
I guess I trying to follow AMD's terminology choice in stating that completion is when the data an instruction is meant to write back is written back and the counter is decremented.
It can be hundreds of cycles before the result is ready in VGPRs.

Writeback would be the last stage prior to instruction completion, unless there's an exception detect stage for something like ECC failures.
I am somewhat unclear on whay you mean by "switched out". The ten hardware threads per SIMD don't switch out in the same sense as the phrase is used for standard CPU cores and thread context switches. The wavefront is just flagged as not being ready when the SIMD's issue cycle comes back around. The wavefront is still in the group of 10, with hopefully up to 9 others ready to issue in its stead.
I have no idea why you are using CPU think when we're discussing GPUs that hide latency by sleeping threads.

I'm under the impression that LGKM_CNT is not global to the entire CU.
Rather, this is tracking the issued instructions for a given wavefront, and that wavefront can't account for what the other 39 wavefronts in the CU might have in flight for the LDS at the time.
Obviously, S_WAITCNT is per thread.

The LDS-direct read is also not listed as being one of the things tracked by LGCKM_CNT,
That's because it's almost as fast (if not as fast, I'm unclear) as a read from VGPRs. It's reading a single value and broadcasting it to all 64 work items in a thread.

If you look at section 9.3 you will see there are three types of LDS access. The first two, LDS_DIRECT and Parameter are "broadcast" reads, i.e. the same value(s) are broadcast to all work items.

Parameter data is contiguously arranged in LDS, so this is a burst read from LDS into VGPRs via VALU and so acts directly, without incurring "variable" latency that would warrant sleeping the thread.

which may make S_WAITCNT for LDS and any possible LDS reads to a vector instruction as restrictive as it is for scalar cache reads.
Indexed/atomic LDS (and GDS) operations and constant cache scalar reads are both handled by LGKM (that's why it's L, G and K). M = message but I don't know anything about messages.

If VGPRs were all that mattered, S_WAITCNT wouldn't be so restrictive with scalar memory reads.
Yes they would because reads of scalars from constant cache can be indexed and so different per work item. The target is still the same: VGPRs.

4.2 contains a bunch of corner cases where side-effects and hardware state updates fall outside the regular data forwarding paths. The rest of the cases rely on the 4-cycle gap between instruction issues to allow results to flow over.
Agreed. Bear in mind that these paths are non-VGPR paths, which are generally all that VALU instructions deal with.

It's two reserved bits right in the middle of the state encoding. The relationship of each field to the ones above and below is pretty well established.
If during the SI design stage they hadn't yet decided on the maximum number of SIMDs, they probably had to allocate bits in advance for a product that hadn't been finalized.
I think we can be pretty sure they don't know how many SIMDs they'll end up with eventually.

There's "can't" and then there's "won't", however I am hopeful that something at that level is going to evolve more than we've seen over the last several generations.
Another GPU generation will have to justify its existence on 28nm, with Tahiti already gobbling up much of the TDP room before a successor has the chance to improve on it.
Adding a few more CUs might not cut it.
I'm "naively" hopeful that stacking/interposer tech will end up saving die-space and power that's "wasted" on GDDR5/6 etc. type tech. But that's beyond 28nm GPUs, I suspect.
 
I think what you're missing is two key concepts:

  1. S_WAIT is the sole mechanism by which a thread cedes control to other threads that are waiting.
  2. VGPRs are the target of all operations that result in a thread ceding (i.e. sleeping).
Threads don't have the ability to sieze control, which makes ceding it a little pointless.
The CU is going to try to issue 5-6 instructions from a SIMD during that SIMD's issue cycle, and a wavefront can only provide one of them.
S_WAIT is necessary to hide the fact that the hardware is incapable of detecting when a long-latency operation has completed, and that the hardware may or may not issue blindly over a dependence, based on whether the CU gets back to that wavefront in time.

S_WAIT always applies VGPR targeted ops, unless it's a scalar read, or sendmsg (not necessarily the only ones, in case I've missed something). The option to have a vector memory read write directly to the LDS is a case I'd be curious about, since the logic in the ISA doc indicates instruction issue should increment VMCNT, but the destination is defined to be the LDS and the count decrements with a write to a VGPR. Maybe it just treats it as if a VGPR is the destination.

It can be hundreds of cycles before the result is ready in VGPRs.
And until then AMD doesn't consider the instruction complete, per table 3.1.
It has to store somewhere the instruction's destination identifiers, but it doesn't have the interlocks necessary to prevent a later instruction from reading too early. If the instruction were truly complete and gone, there would be no telling where to put the read data when it got back.

I have no idea why you are using CPU think when we're discussing GPUs that hide latency by sleeping threads.
It's a question intended for clarity's sake, since GPU discussions often take established terms and mangle them thoroughly, and we already had the term "sleeping".


That's because it's almost as fast (if not as fast, I'm unclear) as a read from VGPRs. It's reading a single value and broadcasting it to all 64 work items in a thread.
The LDS is a heavily banked single-ported structure. It is generally fast, but not fixed latency.
An LDS read is available as a data operand even for the non-parameter instructions.

Parameter data is contiguously arranged in LDS, so this is a burst read from LDS into VGPRs via VALU and so acts directly, without incurring "variable" latency that would warrant sleeping the thread.
This guarantess max bandwidth from the LDS so that the instructions provided in the ISA for interpolation don't become unexpected performance landmines, but how does it guarantee that the LDS isn't busy when this instruction is issued?

Yes they would because reads of scalars from constant cache can be indexed and so different per work item. The target is still the same: VGPRs.
A constant fetch is a scalar memory read, and the does discuss scalar memory reads that target SGPRs. An out of order SGPR completion can wreck a shader as badly as an out of sequence read into the VGPRs, even if that data is never written to a VGPR.
Chapter 7 discusses scalar memory reads to scalar registers, and it states the need for S_WAITCNT=0.
 
Last edited by a moderator:
I'd prefer if the graphs had actual FPS numbers on them, but AMD don't like to do that.

If anyone with a 7970 wants to do a comparison and has a similar system I'd like to see how much better GCN handles the game.

I have a 2500K at 4.4GHZ and as I said earlier a Gigabyte 670 Windforce.

Unfortunately, I cannot compete having only a measly four year old Dualcore to power my HD 7970.
 
Unfortunately, I cannot compete having only a measly four year old Dualcore to power my HD 7970.


I just post this as info, and not responsible if his bench is not accurate.. ( the steam demo dont have the benchmark so cant check ).

For what i see on his sig, 2600K+ 7970 @ 1100mhz.

http://forums.guru3d.com/showpost.php?p=4395030&postcount=19

Now seeing the type of game it is, i will rather choose fraps. ( and lets be honest, you dont forcibly need the highest fps for play it well. )
 
I really miss the days when my 4870X2 cost 450 euros.

I sure as hell ain't making double the money I made back then.

The 4870 was in a lower class than the 7970 though. I bought a 4870X2 as well, my first (and probably last) dual card. I got it because with our dollar being stronger for the first time back then the price was the same as what I usually pay for video cards, and I wanted 1GB VRAM (1GB 4870's didnt show up for a few months) for future proofing but if I had to have insane power consumption and heat then I wanted insane performance, not GTX280 level. Anyway the thing died after a year, and my 5870 that replaced it, while benching slightly slower in benches, gave a nicer gaming experience.

I'm glad AMD are back to covering the high end with single GPU chips myself. Although it seemed to burn them a little this time...
 
The 4870 was in a lower class than the 7970 though. I bought a 4870X2 as well, my first (and probably last) dual card. I got it because with our dollar being stronger for the first time back then the price was the same as what I usually pay for video cards, and I wanted 1GB VRAM (1GB 4870's didnt show up for a few months) for future proofing but if I had to have insane power consumption and heat then I wanted insane performance, not GTX280 level. Anyway the thing died after a year, and my 5870 that replaced it, while benching slightly slower in benches, gave a nicer gaming experience.

I'm glad AMD are back to covering the high end with single GPU chips myself. Although it seemed to burn them a little this time...

How did they burn themselves?
 
So who would like some Southern Islands die shots? The blog post this is attached to won't be going up for a few hours, but there's no harm in letting you guys see these now.

Thanks to AMD's GCN Hot Chips presentation we have some pretty good die shots of SI. These are from the same source as the earlier shots, but in much better quality (~600px x 600px for Tahiti).

Tahiti

Pitcairn

Cape Verde
 
Any larger? :D

We already have a rather good shot of Tahiti around here, but the thing is so dense you can hardly catch on the cheesy details.
 
Any larger? :D

We already have a rather good shot of Tahiti around here, but the thing is so dense you can hardly catch on the cheesy details.
To the best of my knowledge this is much better than the last Tahiti die shot, but Google may be letting me down here.

As for the size of the images, sorry, but that's as large as I have them. If you want anything any better you'll have to get in line and start begging AMD like the rest of us.:p
 
To the best of my knowledge this is much better than the last Tahiti die shot, but Google may be letting me down here.

As for the size of the images, sorry, but that's as large as I have them. If you want anything any better you'll have to get in line and start begging AMD like the rest of us.:p
Here you go. ;)

Edit:
you should rotate your Pitcairn shot so the orientation of the CUs coincides with the other GPUs.

Edit2:
If I interprete the die shots correctly, a complete CU including the 4 TMUs and its part of the shared stuff (L1 sD$, I$) of Pitcairn measures about 4.4 to 4.5 mm², while it measures 5.5 mm² in Tahiti. The TMUs of each CU including the vector L1 vD$ measure ~1.5 (Pitcairn, really blurry) to 1.7 mm² (that's what I got from the old Tahiti shot). That means we can estimate the increased DP capabilities and the ECC in Tahiti costs about 1 mm² per CU and the ECC handling in the TMUs for the vector L1$ maybe 0.2 mm² (with a very high uncertainty because of the quality of the pictures).
 
Last edited by a moderator:
So who would like some Southern Islands die shots? The blog post this is attached to won't be going up for a few hours, but there's no harm in letting you guys see these now.

Thanks to AMD's GCN Hot Chips presentation we have some pretty good die shots of SI. These are from the same source as the earlier shots, but in much better quality (~600px x 600px for Tahiti).

Tahiti

Pitcairn

Cape Verde
Thanks!
 
How did they burn themselves?

NV have adopted the small die strategy while AMD have shifted back a little with the launch of the 6900/7900 series. NV seem to have certainly won over the general gamer community as their smaller GK104 is considered 'better' than Tahiti by most around the net (not me, I just bought a 7950 for a killer $299AUD). They're able to sell a card with a smaller die, less memory, and a narrower bus for more money. AMD never managed that with their small die strategy.

To be clear, they burnt themselves by going back on their strategy somewhat and then their competitor did the exact same strategy but executed much more successfully at the same time. If they'd gone for a slightly bigger Pitcairn with faster memory, like their old strategy ala 4800/5800 series, perhaps they could have had a very similar product to GK104. Of course NV probably would have still marketed it better, with turbo and the like winning over consumers by winning reviews. As a consumer I prefer the path AMD have taken, but what benefits me in being able to buy a card I see as far more future proof than the 660Ti for $50 less, doesn't necessarily benefit AMD. ;)
 
Last edited by a moderator:
NV have adopted the small die strategy while AMD have shifted back a little with the launch of the 6900/7900 series. NV seem to have certainly won over the general gamer community as their smaller GK104 is considered 'better' than Tahiti by most around the net (not me, I just bought a 7950 for a killer $299AUD). They're able to sell a card with a smaller die, less memory, and a narrower bus for more money. AMD never managed that with their small die strategy.

To be clear, they burnt themselves by going back on their strategy somewhat and then their competitor did the exact same strategy but executed much more successfully at the same time. If they'd gone for a slightly bigger Pitcairn with faster memory, like their old strategy ala 4800/5800 series, perhaps they could have had a very similar product to GK104. Of course NV probably would have still marketed it better, with turbo and the like winning over consumers by winning reviews. As a consumer I prefer the path AMD have taken, but what benefits me in being able to buy a card I see as far more future proof than the 660Ti for $50 less, doesn't necessarily benefit AMD. ;)

Actually the decision of Nvidia is more a response to AMD 7900 GCN instead of a strategy.. they have never plan to use the same die for make the 680-670-660TI-660 and maybe even the 650.. it is not a strategy, AMD have release their cards first and Nvidia adapt the performance of the chips they can do right now, to be competitive at thoses performance level in thoses price range ... seriously, i cant imagine they have not plan the GK100-104-106 a long time ago, and even right now, look the time they need for release the cards who are finally just crippled sku of the gk104, with 670 non reference faster of the 680 and 660TI faster of the 670..
 
Last edited by a moderator:
Back
Top