Corrected.As a side note I find interesting that AMD is doing what Nick has been advocating for future CPUs, run wide vectors on narrower SIMD (ie running 64[strike]bytes[/strike] wide vectors on 16 wide SIMD).
Not really. From the point of view of a divergent wavefront things might be a bit better due to improved occupancy, certainly not 4x better.
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU
The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
For the best case scenario a single cayman SIMD should be equal to a single CU
I doubt that. They would need to hotclock everything, not just the ALUs. That's different from Fermi, where they can run the schedulers at half the clock.Albeit I don't think it's the case for the above, how about a hot-clocking ALUs hypothesis breaking your theory above?
^I am really questioning the NI family right now. We do know that NI was supposedly introduced due to the delay of the 32nm node @ TSMC, but I am not so sure now. So I wonder if NI was exactly how it was intended, just on a different node (and may be less SIMDs). Also, did SI always intended to have the new architecture that was to be launched @ 32nm node? If so, may be the 32nm delay was a good thing for AMD so they could have more time to play around with the new SIMD structure. They may have seen the same growing pains same as Fermi. By not stumbling at the same time NV did, they may have picked up a bigger customer base.
I guess this sounded a little more like conspiracy theory than I intended
Yes, 5 instructions of different types, which have to come from different waves each cycle. That is definitely a bit more than what was traded for the number of operation per issue (4 or 5 max) every 4 cycles.
But you are completely right on the first point, the beefed up issue capabilities doesn't change that there is not much dynamics going on. The instructions are plainly issued in order to a given and predetermined unit with no fancy stuff going on.
GCN seems to guarantee that you won't stall for instruction or raw latency. Fermi has no such guarantees hence has a more complex scoreboarding mechanism.
I don't think there's any such guarantee. If you don't give GCN enough wavefronts to process it will stall, just like Fermi. The difference seems to be that GCN tracks a single instruction per wavefront and is unable to take advantage of ILP. Fermi's scoreboard allows it to track multiple in-flight instructions per warp- seems to be a maximum of ~4 (see linked paper).
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.
Ok, GCN can hide all instruction and raw latencies with 4 threads/cu, the bare minimum.
Fermi needs more than bare minimum to hide it all, hence a more complex scoreboard.