AMD: R9xx Speculation

Kaotik · Nov 22, 2010

simbus82 said:
Slides are gone.

I have saved them here http://img97.yfrog.com/gal.php?g=fot024.jpg

I'm hosting them aswell, http://home.akku.tv/~akku38901/HD6900/

Love_In_Rio · Nov 22, 2010

simbus82 said:
Slides are gone.

I have saved them here http://img97.yfrog.com/gal.php?g=fot024.jpg

It´s funny the work Dave Bauman´s site gives to himself!. I suspect he would be willing to make the review of this card if he wasn´t its product manager!!

CRoland · Nov 22, 2010

PSU-failure said:
Really?

Look at DP throughput, and do some analysis on slide 70, it seems there's something more to come...

2 DP MUL or ADD, but only 1 DP MAD/FMA per clock? It seems I was right when speculating about VLIW4 = half-rate DP with semi-specialized, symmetrical units, disabled on Radeons only for products segmentation.

I don't see why they'd go higher* than 1:4 DP:SP ratio. What good is it except for marketing? I would prefer increasing SP throughput with DP naturally increasing at the same time.

* Closer to one.

Kaotik · Nov 22, 2010

Love_In_Rio said:
It´s funny the work Dave Bauman´s site gives to himself!. I suspect he would be willing to make the review of this card if he wasn´t its product manager!!

As far as I know, Dave Baumann hasn't been affiliated with Beyond3D since he moved to work at ATI/AMD

PSU-failure · Nov 22, 2010

CRoland said:
I don't see why they'd go higher* than 1:4 DP:SP ratio. What good is it except for marketing? I would prefer increasing SP throughput with DP naturally increasing at the same time.

For GPGPU, it would mean a considerable lead (>=1.5TFlops).

As for the 1:2 DP MAD/FMA ratio, with 1:2 ADD and 1:2 MUL DP throughput there's no reason to limit MAD/FMA throughput to 1:4 since "simple" optimisation gives 1:2 rate for free.

CRoland · Nov 22, 2010

PSU-failure said:
For GPGPU, it would mean a considerable lead (>=1.5TFlops).

As for the 1:2 DP MAD/FMA ratio, with 1:2 ADD and 1:2 MUL DP throughput there's no reason to limit MAD/FMA throughput to 1:4 since "simple" optimisation gives 1:2 rate for free.

But it would seem likely to me that going from, say, 1 DP, 2 SP TFLOPS to 1 DP, 4 SP TFLOPS would also not require that much extra hardware.

Edit: Except if they are bandwidth limited...

OlegSH · Nov 22, 2010

Jawed said:
Maybe you should think about the difference between concurrent and asynchronous

What's the difference among concurrent kernel execution for rv870(it's obvious that AMD compare this feature with NV's parallel kernel processing) and execution of multiple compute kernels for rv970? For me it's same two kernels by the number of dispatch processors in both rv870/940 and rv970

Mize · Nov 22, 2010

Love_In_Rio said:
It´s funny the work Dave Bauman´s site

It's been Rys' site for quite some time.
http://beyond3d.com/content/about

rpg.314 · Nov 22, 2010

No mention of a cache hierarchy is odd.

Undecided specs even now is weird.

Gipsel · Nov 22, 2010

PSU-failure said:
Look at DP throughput, and do some analysis on slide 70, it seems there's something more to come...

2 DP MUL or ADD, but only 1 DP MAD/FMA per clock? It seems I was right when speculating about VLIW4 = half-rate DP with semi-specialized, symmetrical units, disabled on Radeons only for products segmentation.

Beside that Cayman will do only ADDs with 1:2 ratio and MUL/MAD/FMA with 1:4. That's most probably just an error in the slide which got carried over from the Cypress presentation (which had the same misleading "2 64 Bit ADD or MUL" in it but only for add it is true).

ferro · Nov 22, 2010

rpg.314 said:
No mention of a cache hierarchy is odd.

Undecided specs even now is weird.

I think they are just undisclosed.

OlegSH · Nov 22, 2010

More thoughts - 50% more simds require 50% more of bandwidth, but from slides there are still four slices of L2, so or data paths of slices should be twice as wide or chip will be even more cache constrained. Also considering 512kb L2 capacity there will be more of texture fetch misses compare to rv870, which automatically means underutilization of SM's(i.e. utilization of simd could be worse in heavy texture fetch shaders like parallax occlusion, sun shafts, shadows filtering and ect)

fellix · Nov 22, 2010

rpg.314 said:
No mention of a cache hierarchy is odd.

Undecided specs even now is weird.

Judging by this slide, the L2 cache is still read-only.

Jawed · Nov 22, 2010

OlegSH said:
What's the difference among concurrent kernel execution for rv870(it's obvious that AMD compare this feature with NV's parallel kernel processing)

That model is where kernel execution fills all available SIMDs. The easiest way to think about this is when a kernel is "ending", i.e. as SIMDs finish off their final threads for kernel A, they become available to start work on kernel B.

This is queued overlapped-execution.

and execution of multiple compute kernels for rv970? For me it's same two kernels by the number of dispatch processors in both rv870/940 and rv970

This model launches multiple (prolly only 2 per SIMD) kernels regardless of the occupation of a SIMD by any other kernel. i.e. two compute kernels could both fill all SIMDs. Here kernels A and B can be launched independently.

B doesn't have to wait for free SIMDs - B isn't waiting for A to give it breathing room.

This is task parallelism (though presumably restricted to some unknown number of distinct kernels).

I've never seen any statement by AMD as to the number of concurrent kernels supported in Evergreen (merely 2 across the entire GPU?) and I don't see any statement for this new feature, either.

If you want to criticise AMD for making the comparison of Evergreen with Fermi, I'll join in, just as soon as I know the constraints of Evergreen's. I don't though - but I do suspect Fermi is more finely-grained. But asynchronous launch is, if they're using the term correctly, a step forward from what's seen in Fermi.

Though I doubt it allows more than two kernels per SIMD - because management of GPR allocation gets seriously tricky with 3 due to fragmentation. Even with register spill through caches into global memory 3 is going to be tricky - and I suspect Cayman doesn't have cached register spill like Fermi.

boxleitnerb · Nov 22, 2010

Dave Baumann said:
There are none in HQ mode.

So how come that HD5000/6000 series shows noticable texture-shimmering in some games while any Geforce shows little to none even on the Q setting? Is it a hardware limitation then?

With all this raw power, why can't modern Radeon cards filter as clean as possible, providing a smooth calm image? R520/580 did way better in this area, and your main competitor has been offering superb AF-Quality without any "compromises" since 2006...

trinibwoy · Nov 22, 2010

I get the impression things aren't going to change much based on the current (lack of) architectural change. Unless there's some as yet undisclosed magic it'll probably be similar performance to the 580 with lower die size, power consumption and hopefully cost. The story with geometry doesnt seem to have changed either.

Kaotik · Nov 22, 2010

boxleitnerb said:
So how come that HD5000/6000 series shows noticable texture-shimmering in some games while any Geforce shows little to none even on the Q setting? Is it a hardware limitation then?

With all this raw power, why can't modern Radeon cards filter as clean as possible, providing a smooth calm image? R520/580 did way better in this area, and your main competitor has been offering superb AF-Quality without any "compromises" since 2006...

There was one theory posted in the HD5 AF broken -thread, aka AMD/ATI using more detailed LOD values by default, by "softening" the LOD by +0.65, the shimmering disappears on Radeons - incidently, then "sharpening" the LOD by -0.65, the shimmering appears on GeForces

Rangers · Nov 22, 2010

trinibwoy said:
I get the impression things aren't going to change much based on the current (lack of) architectural change. Unless there's some as yet undisclosed magic it'll probably be similar performance to the 580 with lower die size, power consumption and hopefully cost. The story with geometry doesnt seem to have changed either.

Doesnt it say 2 polygons per clock now? Versus 1 prior?

boxleitnerb · Nov 22, 2010

Kaotik said:
There was one theory posted in the HD5 AF broken -thread, aka AMD/ATI using more detailed LOD values by default, by "softening" the LOD by +0.65, the shimmering disappears on Radeons - incidently, then "sharpening" the LOD by -0.65, the shimmering appears on GeForces

Well, in my opinion, an IHV has no business adjusting the LOD. If the user or an application requests it, fine. Everything else is just inscrutable.
Normally, the LOD should stay at 0, right?

Kaotik · Nov 22, 2010

boxleitnerb said:
Well, in my opinion, an IHV has no business adjusting the LOD. If the user or an application requests it, fine. Everything else is just inscrutable.
Normally, the LOD should stay at 0, right?

Of course - I don't know much about that stuff though, so the next question is, is there a fixed 0, or is it determined by the hardware?

AMD: R9xx Speculation

Kaotik

Drunk Member

Love_In_Rio

CRoland

Kaotik

Drunk Member

PSU-failure

CRoland

OlegSH

Mize

3dfx Fan

rpg.314

Gipsel

ferro

OlegSH

fellix

Jawed

boxleitnerb

trinibwoy

Meh

Kaotik

Drunk Member

Rangers

boxleitnerb

Kaotik

Drunk Member

Similar threads