AMD: R9xx Speculation

Really?

Look at DP throughput, and do some analysis on slide 70, it seems there's something more to come...

2 DP MUL or ADD, but only 1 DP MAD/FMA per clock? It seems I was right when speculating about VLIW4 = half-rate DP with semi-specialized, symmetrical units, disabled on Radeons only for products segmentation.

I don't see why they'd go higher* than 1:4 DP:SP ratio. What good is it except for marketing? I would prefer increasing SP throughput with DP naturally increasing at the same time.

* Closer to one.
 
It´s funny the work Dave Bauman´s site gives to himself!. I suspect he would be willing to make the review of this card if he wasn´t its product manager!! ;)

As far as I know, Dave Baumann hasn't been affiliated with Beyond3D since he moved to work at ATI/AMD
 
I don't see why they'd go higher* than 1:4 DP:SP ratio. What good is it except for marketing? I would prefer increasing SP throughput with DP naturally increasing at the same time.
For GPGPU, it would mean a considerable lead (>=1.5TFlops).

As for the 1:2 DP MAD/FMA ratio, with 1:2 ADD and 1:2 MUL DP throughput there's no reason to limit MAD/FMA throughput to 1:4 since "simple" optimisation gives 1:2 rate for free.
 
For GPGPU, it would mean a considerable lead (>=1.5TFlops).

As for the 1:2 DP MAD/FMA ratio, with 1:2 ADD and 1:2 MUL DP throughput there's no reason to limit MAD/FMA throughput to 1:4 since "simple" optimisation gives 1:2 rate for free.

But it would seem likely to me that going from, say, 1 DP, 2 SP TFLOPS to 1 DP, 4 SP TFLOPS would also not require that much extra hardware.

Edit: Except if they are bandwidth limited...
 
Look at DP throughput, and do some analysis on slide 70, it seems there's something more to come...

2 DP MUL or ADD, but only 1 DP MAD/FMA per clock? It seems I was right when speculating about VLIW4 = half-rate DP with semi-specialized, symmetrical units, disabled on Radeons only for products segmentation.
Beside that Cayman will do only ADDs with 1:2 ratio and MUL/MAD/FMA with 1:4. That's most probably just an error in the slide which got carried over from the Cypress presentation (which had the same misleading "2 64 Bit ADD or MUL" in it but only for add it is true).
 
Last edited by a moderator:
More thoughts - 50% more simds require 50% more of bandwidth, but from slides there are still four slices of L2, so or data paths of slices should be twice as wide or chip will be even more cache constrained. Also considering 512kb L2 capacity there will be more of texture fetch misses compare to rv870, which automatically means underutilization of SM's(i.e. utilization of simd could be worse in heavy texture fetch shaders like parallax occlusion, sun shafts, shadows filtering and ect)
 
What's the difference among concurrent kernel execution for rv870(it's obvious that AMD compare this feature with NV's parallel kernel processing)
That model is where kernel execution fills all available SIMDs. The easiest way to think about this is when a kernel is "ending", i.e. as SIMDs finish off their final threads for kernel A, they become available to start work on kernel B.

This is queued overlapped-execution.
and execution of multiple compute kernels for rv970? For me it's same two kernels by the number of dispatch processors in both rv870/940 and rv970
This model launches multiple (prolly only 2 per SIMD) kernels regardless of the occupation of a SIMD by any other kernel. i.e. two compute kernels could both fill all SIMDs. Here kernels A and B can be launched independently.

B doesn't have to wait for free SIMDs - B isn't waiting for A to give it breathing room.

This is task parallelism (though presumably restricted to some unknown number of distinct kernels).

I've never seen any statement by AMD as to the number of concurrent kernels supported in Evergreen (merely 2 across the entire GPU?) and I don't see any statement for this new feature, either.

If you want to criticise AMD for making the comparison of Evergreen with Fermi, I'll join in, just as soon as I know the constraints of Evergreen's. I don't though - but I do suspect Fermi is more finely-grained. But asynchronous launch is, if they're using the term correctly, a step forward from what's seen in Fermi.

Though I doubt it allows more than two kernels per SIMD - because management of GPR allocation gets seriously tricky with 3 due to fragmentation. Even with register spill through caches into global memory 3 is going to be tricky - and I suspect Cayman doesn't have cached register spill like Fermi.
 
There are none in HQ mode.

So how come that HD5000/6000 series shows noticable texture-shimmering in some games while any Geforce shows little to none even on the Q setting? Is it a hardware limitation then?

With all this raw power, why can't modern Radeon cards filter as clean as possible, providing a smooth calm image? R520/580 did way better in this area, and your main competitor has been offering superb AF-Quality without any "compromises" since 2006...
 
I get the impression things aren't going to change much based on the current (lack of) architectural change. Unless there's some as yet undisclosed magic it'll probably be similar performance to the 580 with lower die size, power consumption and hopefully cost. The story with geometry doesnt seem to have changed either.
 
So how come that HD5000/6000 series shows noticable texture-shimmering in some games while any Geforce shows little to none even on the Q setting? Is it a hardware limitation then?

With all this raw power, why can't modern Radeon cards filter as clean as possible, providing a smooth calm image? R520/580 did way better in this area, and your main competitor has been offering superb AF-Quality without any "compromises" since 2006...

There was one theory posted in the HD5 AF broken -thread, aka AMD/ATI using more detailed LOD values by default, by "softening" the LOD by +0.65, the shimmering disappears on Radeons - incidently, then "sharpening" the LOD by -0.65, the shimmering appears on GeForces
 
I get the impression things aren't going to change much based on the current (lack of) architectural change. Unless there's some as yet undisclosed magic it'll probably be similar performance to the 580 with lower die size, power consumption and hopefully cost. The story with geometry doesnt seem to have changed either.

Doesnt it say 2 polygons per clock now? Versus 1 prior?
 
There was one theory posted in the HD5 AF broken -thread, aka AMD/ATI using more detailed LOD values by default, by "softening" the LOD by +0.65, the shimmering disappears on Radeons - incidently, then "sharpening" the LOD by -0.65, the shimmering appears on GeForces

Well, in my opinion, an IHV has no business adjusting the LOD. If the user or an application requests it, fine. Everything else is just inscrutable.
Normally, the LOD should stay at 0, right?
 
Well, in my opinion, an IHV has no business adjusting the LOD. If the user or an application requests it, fine. Everything else is just inscrutable.
Normally, the LOD should stay at 0, right?

Of course - I don't know much about that stuff though, so the next question is, is there a fixed 0, or is it determined by the hardware?
 
Back
Top