AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

As a side note I find interesting that AMD is doing what Nick has been advocating for future CPUs, run wide vectors on narrower SIMD (ie running 64bytes vectors on 16 wide SIMD).
 
As a side note I find interesting that AMD is doing what Nick has been advocating for future CPUs, run wide vectors on narrower SIMD (ie running 64[strike]bytes[/strike] wide vectors on 16 wide SIMD).
Corrected. ;)
AMD and nvidia (32 wide vectors though) are doing this for ages already. That isn't exactly new.
 
Not really. From the point of view of a divergent wavefront things might be a bit better due to improved occupancy, certainly not 4x better.

Not 4x better in the general case but it goes a long way to reducing the waste due to divergence. An idle lane in Cayman wastes 1/16th of available compute resources per clock. With GCN it's only 1/64th.
 
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU :?:
 
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU :?:

at what point does scaling VLIW4 ALU's start costing you extra transistors? I dont know much about this kind of stuff but i doupt in linear.
 
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU :?:

Albeit I don't think it's the case for the above, how about a hot-clocking ALUs hypothesis breaking your theory above? :p
 
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU :?:

It's too early to make that declaration I think. If Cayman does as well in BF3 compared to Fermi as it does in other titles then you would be correct. There will be a crossover point though where VLIW will be inefficient but that time may still be a few years out.
 
Albeit I don't think it's the case for the above, how about a hot-clocking ALUs hypothesis breaking your theory above? :p
I doubt that. They would need to hotclock everything, not just the ALUs. That's different from Fermi, where they can run the schedulers at half the clock.

The changes on the ALUs and the register files are actually going in the direction of reducing the complexity. That means either they shave off some cycles of latency (so back-to-back issue of dependent instructions work as almost mentioned on a slide [not in a very clear wording]) or the clocks can be raised even without hot clocking (at least the ALUs/register files won't hold them back).
 
I like to reflect on the temporal aspects to GPU development. ;)

Not many enthusiast people consider that like this stuff wasn't designed recently. In reality it's likely been in the works since RV770 days. And yet ATI of course works hard to sell us on Cypress and Cayman being the best things since sliced bread. Less than a year ago VLIW4 was the hottest game in town.

Undoubtedly this recent presentation was also in a way a smokescreen for the next next generation of GPU hardware. If this stuff is taped out it's definitely old news for them.
 
^I am really questioning the NI family right now. We do know that NI was supposedly introduced due to the delay of the 32nm node @ TSMC, but I am not so sure now. So I wonder if NI was exactly how it was intended, just on a different node (and may be less SIMDs). Also, did SI always intended to have the new architecture that was to be launched @ 32nm node? If so, may be the 32nm delay was a good thing for AMD so they could have more time to play around with the new SIMD structure. They may have seen the same growing pains same as Fermi. By not stumbling at the same time NV did, they may have picked up a bigger customer base.
I guess this sounded a little more like conspiracy theory than I intended :)
 
^I am really questioning the NI family right now. We do know that NI was supposedly introduced due to the delay of the 32nm node @ TSMC, but I am not so sure now. So I wonder if NI was exactly how it was intended, just on a different node (and may be less SIMDs). Also, did SI always intended to have the new architecture that was to be launched @ 32nm node? If so, may be the 32nm delay was a good thing for AMD so they could have more time to play around with the new SIMD structure. They may have seen the same growing pains same as Fermi. By not stumbling at the same time NV did, they may have picked up a bigger customer base.
I guess this sounded a little more like conspiracy theory than I intended :)

According to Dave Baumann, Cayman is exactly the same as it would have been on 32nm, but wether it would have been the high end chip on 32nm is completely another thing (since it would have been around the size of Barts or so)
 
Got it. That puts my world back together (in regards to AMD's time line). I guess I may have read too many silly season posts and confused myself.
 
The scalar unit's status is something of a second-class citizen, since it doesn't seem capable of writing to memory, and there are only certain ways it can gather data from the SIMD units.
The CU itself is a multi-issue unit that could with some evolution become a 5-wide FP coprocessor that could allow a CPU/GPU combo where the CPU can issue to a CU like the shared FPU on Bulldozer as one kernel while the GPU shares resources.

The memory subsystem and the interconnect are more significant changes than the right turn taken by the execution hardware.
The caches are probably larger per unit of storage, given they take traffic from multiple directions.
The L1/L2 crossbar would have been changed significantly, adding a write capability and writeback support. The crossbar has more clients than Fermi's and the coherency model sound more intensive on AMD's chip, though it seems it can be optional.
 
Yes, 5 instructions of different types, which have to come from different waves each cycle. That is definitely a bit more than what was traded for the number of operation per issue (4 or 5 max) every 4 cycles.

I thought branches are already co-issued with VLIW instructions on existing architectures.

But you are completely right on the first point, the beefed up issue capabilities doesn't change that there is not much dynamics going on. The instructions are plainly issued in order to a given and predetermined unit with no fancy stuff going on.

I'm struggling to understand how this is any different or less dynamic than what Fermi does. GCN seems to be doing the same thing except that it only has 10 wavefronts to choose from per SIMD instead of 24. The issue logic actually seems to be a lot more complex than Fermi which can only dispatch one instruction per clock (or two in the case of GF10x).
 
GCN seems to guarantee that you won't stall for instruction or raw latency. Fermi has no such guarantees hence has a more complex scoreboarding mechanism.
 
GCN seems to guarantee that you won't stall for instruction or raw latency. Fermi has no such guarantees hence has a more complex scoreboarding mechanism.

I don't think there's any such guarantee. If you don't give GCN enough wavefronts to process it will stall, just like Fermi. The difference seems to be that GCN tracks a single instruction per wavefront and is unable to take advantage of ILP. Fermi's scoreboard allows it to track multiple in-flight instructions per warp- seems to be a maximum of ~4 (see linked paper).

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.
 
I don't think there's any such guarantee. If you don't give GCN enough wavefronts to process it will stall, just like Fermi. The difference seems to be that GCN tracks a single instruction per wavefront and is unable to take advantage of ILP. Fermi's scoreboard allows it to track multiple in-flight instructions per warp- seems to be a maximum of ~4 (see linked paper).

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.

Ok, GCN can hide all instruction and raw latencies with 4 threads/cu, the bare minimum. Fermi needs more than bare minimum to hide it all, hence a more complex scoreboard.
 
Ok, GCN can hide all instruction and raw latencies with 4 threads/cu, the bare minimum.

How are you defining bare minimum - pipeline depth? I haven't seen anything conclusive about GCN's ALU pipeline indicating that it's only 4 cycles. The "vector back-to-back instruction issue" in the slides could be referring to the round robin issue and not necessarily back-to-back issue from the same wave.

Fermi needs more than bare minimum to hide it all, hence a more complex scoreboard.

You don't need a complex scoreboard if you're relying solely on TLP. Fermi's additional complexity comes from a few things:

1. Instruction issue runs at warp execution speed, on GCN it runs at 4x wavefront execution speed so Fermi needs 4x the number of dispatchers to feed an equivalent number of SIMDs.
2. ALU pipeline is deeper so the "bare minimum" required warps for latency hiding is higher.
3. Multiple instructions can be in-flight from the same warp.

The score-boarding is only necessary for #3. This actually lets Fermi get away with fewer warps than "bare minimum" would otherwise suggest.
 
Back
Top