AMD: R9xx Speculation

mczak · Oct 19, 2010

DavidGraham said:
And about the two rasterizers claim , I think this is the same as Cypress has 2 rasterizers .

Dunno the two dispatch units are definitely new in the diagram (in contrast to the two rasterizers).
I was wondering do they mean it what is drawn there? Vertex Assembler feeding into one dispatch processor, Geometry Assembler feeding into the other (I guess it could make sense, and might actually explain why you get the max performance increase without too heavy tesselation, but still looks awkward).

Die size is 255mm , not 230 !

Yes. So the perf/area improvements compared to Cypress mostly come from clock increase (that is by using fewer simds but clocked higher). Well considering it's so similar, with the same shader units, I think this isn't really a big surprise. Still, with die size almost exactly between Juniper and Cypress, but performance much closer to Cypress, it looks like that's not too shabby (still only slightly bigger than GF106 and way smaller than GF104!).

Jawed · Oct 19, 2010

hkultala said:
Can current radeons execute different shader programs in different SIMD processors?

Yes, for graphics (different shaders, e.g. VS and PS, can share a SIMD as well, I believe). Unclear for compute kernels.

Could this change be related to that, so that now there are really two "more independent" groups of shaders which can execute 2 different shaders programs, and previously all had to execute same?

The tessellation gain tails off pretty quickly above the "sweet spot" that we've already observed in earlier comparisons (i.e. moderate tessellation levels).

I was under the impression that each shader engine in Cypress is effectively distinct in terms of all execution resources. But if that's not the case, then this may well be the source of the improvements.

This sort of accords with my old theory on poor tessellation performance: that it's basically constrained by thread-generation (and rasterisation is also bottlenecked by that).

So it doesn't seem to me like a complete fix for tessellation performance.

Arnold Beckenbauer · Oct 19, 2010

Man from Atlantis said:
4+1 continues

Nope.

Xenos is a 4+1 GPU. All VS units (R300-R580) were 4+1 units, but since R600 it's 1+1+1+1+1.
Crap:
You mean 4 Shader-Quads and one TMU-Quad, don't you?

Man from Atlantis · Oct 19, 2010

Arnold Beckenbauer said:
Nope.

Xenos is a 4+1 GPU. All VS units (R300-R580) were 4+1 units, but since R600 it's 1+1+1+1+1.
Crap:
You mean 4 Shader-Quads and one TMU-Quad, don't you?

i meant 4 simple + 1 complex shader style.. there are 80=20x4 sp's

Topman · Oct 19, 2010

Arnold Beckenbauer said:
Nope.

Xenos is a 4+1 GPU. All VS units (R300-R580) were 4+1 units, but since R600 it's 1+1+1+1+1.
Crap:
You mean 4 Shader-Quads and one TMU-Quad, don't you?

ignore the yellow/orange unit (TMU).
look inside the red one - a fatter unit
14 SIMDs x 16 x 5DWay = 1120 ALUS ?

Squilliam · Oct 19, 2010

Arnold Beckenbauer said:
Nope.

Xenos is a 4+1 GPU. All VS units (R300-R580) were 4+1 units, but since R600 it's 1+1+1+1+1.
Crap:
You mean 4 Shader-Quads and one TMU-Quad, don't you?

I think he means VLIW width.

Anyway wasn't Xenos simply 48 shader units, nothing particularly fancy like VLIW?

Arnold Beckenbauer · Oct 19, 2010

Squilliam said:
I think he means VLIW width.

Anyway wasn't Xenos simply 48 shader units, nothing particularly fancy like VLIW?

It was Vec4+1, so it was Superscalar.
http://www.beyond3d.com/content/articles/4/7

Its been said that Xenos's shader processor is an array of 48 ALU's, however it is more correct to say that that it is 3 separate arrays of SIMD (Single Instruction Multiple Data) ALU's. Each one of the 48 ALU's can co-issue a vector (Vec4) and a scalar instruction simultaneously, essentially allowing a "5D" operation per cycle.

Gipsel · Oct 19, 2010

Jawed said:

This is Cypress for SIN:

Code:

      1  x: MULADD      ____,  PV0.x,  (0x3E22F983, 0.1591549367f).x,  0.5      
      2  w: FRACT       ____,  PV1.x      
      3  z: MULADD      ____,  PV2.w,  (0x40C90FDB, 6.283185482f).y,  (0xC0490FDB, -3.141592741f).x      
      4  y: MUL         ____,  PV3.z,  (0x3E22F983, 0.1591549367f).x      
      5  t: SIN         R0.x,  PV4.y

The instruction sequence is different for R600:

Code:

      0  w: MULADD      ____,  R0.x,  C0.x,  0.5      
      1  z: FRACT       ____,  PV0.w      
      2  y: MULADD      ____,  PV1.z,  C0.z,  C0.w      
      3  t: SIN         R0.x,  PV2.y

So it seems to have evolved.
But this is the D3D assembly:

Code:

    ps_3_0
    [B]def c0, 0.159154937, 0.5, 6.28318548, -3.14159274[/B]
    dcl_color v0.x
    mad r0.x, v0.x, c0.x, c0.y
    frc r0.x, r0.x
    mad r0.x, r0.x, c0.z, c0.w
    sincos r1.y, r0.x
    mov oC0, r1.y

Obviously R600 didn't need the normalization by 2PI. The Cypress code does virtually the same, and is not very well optimized.
What you see is the succession of the IL instructions "pireduce" and "sin", the latter expecting an already reduced input (between -pi and +pi according to the IL spec). But the ISA instruction SIN expects a normalized input in radian/2PI (but can work within a range of -256..256, i.e. -512PI..512PI before the MUL, that is an improvement to R600, where it needed to be in the -Pi..Pi range), hence the compiler inserts an additional MUL to do this normalization. The last step of pireduce is:

MAD dest, intermediate_value, 2PI, -PI

and the sin in IL adds a division by 2PI. So one can optimize away the last MUL if one exchanges the last instruction of pireduce with:

MAD dest, intermediate_value, 1, -0.5
or even simpler
ADD dest, intermediate_value, -0.5

Way to go for the shader compiler!

Kaotik · Oct 19, 2010

http://dvd4arab.maktoob.com/showthread.php?t=2560070

ASUS HD6850 DirectCU, was this posted yet?

edit:
According to the posters sig, he has HD 6970 @ 1050/6800MHz in his own machine

Unknown Soldier · Oct 19, 2010

DavidGraham said:
I still find it unlikely for Barts to have 1120 ALUs of 4D shaders , wavefront size would be horrible .

1120 = 17.5 X 64 ALU SIMD (Can't happen)
1120 = 14 X 80 ALU SIMD (Wavefront size is disastrous)

In fact if it had 1120 ALUs , then it wouldn't probably be much different than an overclocked HD 5830 with 32 functional ROPs and boosted memory frequency , that would be enough for the HD 5830 to even overtake the HD 5850.

So either ALUs are 1280 or 960 , those are the only ones that make sense right now .

Man from Atlantis said:
4+1 continues

Well it's definitely 1120 that's for sure 80x14

RedVi · Oct 19, 2010

Tchock said:

So AMD got the memory bandwidth for the 5850 wrong in their own official slide?

They've stated 5870 memory bandwidth, 5850 is 128GB/s, so 6870 is an improvement.

jimbo75 · Oct 19, 2010

Anyone seen slide 20, maybe it says what resolution these were benched at?

rpg.314 · Oct 19, 2010

fellix said:
About the MLAA thing, I wonder if would be possible to be used in conjunction with MSAA -- the last would provide some minimum level of sub-pixel coverage.

I think MLAA is implemented using dxcs, so it should be a post process. MSAA should definitely be usable along with it.

Man from Atlantis · Oct 19, 2010

more from Sweclockers
http://img217.imageshack.us/img217/3116/70001e.jpg
http://img209.imageshack.us/img209/7409/10004a.jpg
http://img31.imageshack.us/img31/8382/20004w.jpg
http://img710.imageshack.us/img710/6076/30004.jpg
http://img252.imageshack.us/img252/7376/40003p.jpg
http://img252.imageshack.us/img252/8409/50003.jpg
http://img132.imageshack.us/img132/5441/60003.jpg

GZ007 · Oct 19, 2010

Tchock said:
Geez.

Xbitlabs tested a OC 460gtx with the new 260 drivers and the numbers are quite interesting. Is it true that the new drivers gives such a performance boost :?:

http://www.xbitlabs.com/articles/video/display/asus-engtx460-directcu-top.html

Sontin · Oct 19, 2010

Man from Atlantis said:
http://img31.imageshack.us/img31/8382/20004w.jpg

I lol'ed.
So Workstation makes only 2% of all GPU sales. nVidia earns with 1,9% of the whole market nearly 1/2 of AMD's graphics business revenue. That's pretty amazing.

neliz · Oct 19, 2010

AnarchX said:
http://gathering.tweakers.net/forum/list_message/34874406#34874406

I hope this is no fake. :smile:

nope it's not

Kef · Oct 19, 2010

neliz said:
nope it's not

WTF is surface format optimization? I hope it's an end to the texture shimmering or something.. :?:

Gipsel · Oct 19, 2010

GZ007 said:
Xbitlabs tested a OC 460gtx with the new 260 drivers and the numbers are quite interesting. Is it true that the new drivers gives such a performance boost
http://www.xbitlabs.com/articles/video/display/asus-engtx460-directcu-top.html

What did you expect from an overclock by about 30% above default (675 -> 880MHz with slightly raised voltage)? Of course it can close the gap to the HD6870 running at default voltage and default clock quite a bit. I guess the same would be true for a similarly overclocked HD6850 (would be 1.01 GHz clock).

Mize · Oct 19, 2010

meh...all this is pretty ho hum until we see independent Cayman benches against 5970 and 480

AMD: R9xx Speculation

mczak

Jawed

Arnold Beckenbauer

Man from Atlantis

Topman

Squilliam

Beyond3d isn't defined yet

Arnold Beckenbauer

Gipsel

Kaotik

Drunk Member

Unknown Soldier

RedVi

jimbo75

rpg.314

Man from Atlantis

GZ007

Sontin

neliz

GIGABYTE Man

Kef

Gipsel

Mize

3dfx Fan

Similar threads