AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

If transcendentals are a deciding factor here, wouldn't Cayman not also have to loop over three out of its' four VLIW lanes?
FWIW, HD 5870 sustains the following GFLOPS: 660/1030/1470/1820/1520. Wouldn't Cayman have to be much worse?

edit: We should be discussing this in the SI thread btw.
 
Last edited by a moderator:
My theory is that GCN is running slow, compute-specific, transcendentals designed for maximum accuracy. Prior GPUs are running graphics-transcendentals, which are "single cycle" and approximate.

What are GCN's numbers?
 
I doubt that. I don't think AMD will bounce out hw transcendentals when the competition is apparently keeping a fair chunk of them around.
 
Last edited by a moderator:
I didn't say GCN isn't capable of graphics-transcendentals. I'm hypothesising that the compiler interprets the transcendentals in this IL shader as compute transcendentals, requiring highest precision, causing lowered performance.
 
My theory is that GCN is running slow, compute-specific, transcendentals designed for maximum accuracy. Prior GPUs are running graphics-transcendentals, which are "single cycle" and approximate.

What are GCN's numbers?
I don't think I understand what you're implying here. Could you elaborate a bit, i.e. why would GCN not use the already implemented trancendental routines (that are, btw, partially more precise than in previous generations) and instead do what exactly instead?

edit:
I didn't say GCN isn't capable of graphics-transcendentals. I'm hypothesising that the compiler interprets the transcendentals in this IL shader as compute transcendentals, requiring highest precision, causing lowered performance.

Ah, now I think I'm starting to get a grasp of what you mean. But - if that's the case - wouldn't the compiler decide in the same way for Cayman and possibly Cypress as well?
 
Last edited by a moderator:
Ah, now I think I'm starting to get a grasp of what you mean. But - if that's the case - wouldn't the compiler decide in the same way for Cayman and possibly Cypress as well?
My entire hypothesis is "not necessarily". Documentation and/or tests are required.

Pixel shader, IL and OpenCL versions of this algorithm could be doing different things (working with different degrees of precision) dependent upon the chip. Until we get the manuals (IL for GCN and GCN hardware ISA) we won't know for sure.

OpenCL's precision specification for transcendentals is much looser than CUDA's, for example.
 
I didn't say GCN isn't capable of graphics-transcendentals. I'm hypothesising that the compiler interprets the transcendentals in this IL shader as compute transcendentals, requiring highest precision, causing lowered performance.

That is even more odd. If the code needs accurate transcendentals, why would the compiler for 770 generate approximate ones? If true, that would imply that the compiler for 770 was totally brain damaged.

Chances are that the accurate version of transcendental operation starts from the approximate one and then refines it. In that case, a bit less approximate transcendentals for GCN would lead to less performance. I haven't looked at the code, but unless it is a synthetic, there is no reason to think that GCN should scale like Dhrystone.
 
Can someone speculate what die size could it be based on this specs below or how much bigger than AMD Radeon HD 7970.

Sea Islands (Radeon HD 8970)
28nm tech
900MHz GPU
3072 unified shaders (GCN "1D" architecture)
48 ROP's
192 TMU's
GDDR5 1375MHz on 512bit bus (352.0 GB's bandwidth)

1. The HD 8970 will use the same TSMC's 28nm process.
2. This review shows that Tahiti have ~35% bandwidth left.
3. 7950 = 7 CU-Array (28 CU), 7970 = 8 CU-Array (32 CU).

I believe that AMD will add only 2 CU-Array (+30%) to his "Tahiti-next".
HD 8950 = 9 CU-Array (36 CU), 8970 = 10 CU-Array (40 CU).
Die size: ~420mm² (R600)

The next big jump in performance will happen with the 20-nm process.
 
Another issue is register allocation. GCN's peak register allocation per work item is half R600's. Brute force N-Body is a perfect place to do maximal vectorisation, i.e. stuffing as many particles into a work item as is possible.

GCN may be over-stuffed and so running at much lower performance, with too few threads per core.

Horrible register allocation is a long standing problem for AMD. GCN may be extra-horrible in its youth. It should improve...
Except that GCN registers are scalar and you don't need to worry about packing registers into vec4 registers and dealing with port restrictions as in previous chips. In general, GCN is using less registers than EG/NI, but there's still some tuning to do.

And while peak register space is smaller per thread, it's much easier to hit peak ALU utilization.
 
And while peak register space is smaller per thread, it's much easier to hit peak ALU utilization.
Can you give any insight into the performance seen here:

http://www.hardwarecanucks.com/foru...s/49646-amd-radeon-hd-7970-3gb-review-22.html

I'm looking at the MandelS and MandelV benchmarks. On HD6970 MandelS is slower than MandelV. This would appear to imply that S is a scalar implementation, while V is a vector - I don't know where these benchmarks come from, so I can't be sure.

Assuming the scalar/vector significance of these, why is vector slower on HD7970?
 
Thanks. Looking at the code, they are indeed scalar and vector implementations.

Vector's register allocation on HD5870 is 12 (very old version of GPU Shader Analyzer), so I don't think that register allocation is an issue.

I suspect the slowdown of vector on Tahiti is due purely to branch incoherence for the 4 pixels that are computed by a single work item.

So I don't think this vector versus scalar comparison is of any use in this discussion.
 
It seems Tahiti really likes ComputeMark…

I'd really like to see someone benching graphics cards on a full set of real-life compute workloads. There's so much variability on synthetic benches that it's hard to draw conclusions.
 
It seems Tahiti really likes ComputeMark…
With the exception of Fluid 2D - while much faster than 6970 it does really poor compared to GTX 580 (given the bandwdith and peak flop numbers, you would expect it to be twice as fast as in the other tests). No idea why though.
The Mandel scalar vs. vector issue isn't that dramatic though as the difference is "only" ~13%. The GTX 580 also seems to show the same pattern though it is much less pronounced.

Oh and btw is "Sea Islands" for real? Strange since it would be abbreviated SI just like Southern Islands. Might make sense if it is indeed basically the same but sounds a bit odd.
 
It seems to me that it is the code for Fluid 2D which is borked. If Fluid 3D is just a straightforward extension of dimensions, which I think it is, there is no reason for the performance to be so atrocious.
 
The 2D version of the fluid kernel was done for Fermi as a workaround, due to the poor handling of 3D textures by NV's drivers, I think. At least this was the case by the time Fermi launched.
 
given the bandwdith and peak flop numbers, you would expect it to be twice as fast as in the other tests

Southern Islands made huge strides in perf/flop but still isn't quite up to Fermi. I know there are some who aren't interested in such trivia but it's interesting to note that even after dropping VLIW SI is still lagging Fermi somewhat in utilization, at least in these tests. nVidia will probably still need upward of 3.5Gflops to compete for the compute crown though.

Note that the Mandelbrot numbers should be even higher on a Tesla card given the artificial double-precision hobbling of Geforces. Will have to wait and see if FirePros have more DP throughput than Radeons.

 
Southern Islands made huge strides in perf/flop but still isn't quite up to Fermi. I know there are some who aren't interested in such trivia but it's interesting to note that even after dropping VLIW SI is still lagging Fermi somewhat in utilization, at least in these tests. nVidia will probably still need upward of 3.5Gflops to compete for the compute crown though.
Yes it's still something like 20-30% behind in perf/flop in these compute tests (hopefully they don't hit some bandwidth limitations). Given the much higher "flop density" that's quite ok I think, though I'm wondering why exactly it is slower.
In the Fluid 2D test however it's still doing terrible for some reason.
 
IMHO, perf/flop is a shallow metric. I'd rather measure perf/mm or perf/W. Since bw and it's growth is so limited, perf/bw might have some relevance, but I just don't see how perf/flop matters.
 
Back
Top