AMD: R9xx Speculation

And about the two rasterizers claim , I think this is the same as Cypress has 2 rasterizers .
Dunno the two dispatch units are definitely new in the diagram (in contrast to the two rasterizers).
I was wondering do they mean it what is drawn there? Vertex Assembler feeding into one dispatch processor, Geometry Assembler feeding into the other (I guess it could make sense, and might actually explain why you get the max performance increase without too heavy tesselation, but still looks awkward).
Die size is 255mm , not 230 !
Yes. So the perf/area improvements compared to Cypress mostly come from clock increase (that is by using fewer simds but clocked higher). Well considering it's so similar, with the same shader units, I think this isn't really a big surprise. Still, with die size almost exactly between Juniper and Cypress, but performance much closer to Cypress, it looks like that's not too shabby (still only slightly bigger than GF106 and way smaller than GF104!).
 
Can current radeons execute different shader programs in different SIMD processors?
Yes, for graphics (different shaders, e.g. VS and PS, can share a SIMD as well, I believe). Unclear for compute kernels.

Could this change be related to that, so that now there are really two "more independent" groups of shaders which can execute 2 different shaders programs, and previously all had to execute same?
The tessellation gain tails off pretty quickly above the "sweet spot" that we've already observed in earlier comparisons (i.e. moderate tessellation levels).

I was under the impression that each shader engine in Cypress is effectively distinct in terms of all execution resources. But if that's not the case, then this may well be the source of the improvements.

This sort of accords with my old theory on poor tessellation performance: that it's basically constrained by thread-generation (and rasterisation is also bottlenecked by that).

So it doesn't seem to me like a complete fix for tessellation performance.
 
171915m9l893m3ymi39lic.jpg


4+1 continues

Nope.

Xenos is a 4+1 GPU. All VS units (R300-R580) were 4+1 units, but since R600 it's 1+1+1+1+1.
Crap:
You mean 4 Shader-Quads and one TMU-Quad, don't you?
 
Nope.

Xenos is a 4+1 GPU. All VS units (R300-R580) were 4+1 units, but since R600 it's 1+1+1+1+1.
Crap:
You mean 4 Shader-Quads and one TMU-Quad, don't you?

I think he means VLIW width.

Anyway wasn't Xenos simply 48 shader units, nothing particularly fancy like VLIW?
 
I think he means VLIW width.

Anyway wasn't Xenos simply 48 shader units, nothing particularly fancy like VLIW?

It was Vec4+1, so it was Superscalar.
http://www.beyond3d.com/content/articles/4/7

Its been said that Xenos's shader processor is an array of 48 ALU's, however it is more correct to say that that it is 3 separate arrays of SIMD (Single Instruction Multiple Data) ALU's. Each one of the 48 ALU's can co-issue a vector (Vec4) and a scalar instruction simultaneously, essentially allowing a "5D" operation per cycle.
 
This is Cypress for SIN:

Code:
      1  x: MULADD      ____,  PV0.x,  (0x3E22F983, 0.1591549367f).x,  0.5      
      2  w: FRACT       ____,  PV1.x      
      3  z: MULADD      ____,  PV2.w,  (0x40C90FDB, 6.283185482f).y,  (0xC0490FDB, -3.141592741f).x      
      4  y: MUL         ____,  PV3.z,  (0x3E22F983, 0.1591549367f).x      
      5  t: SIN         R0.x,  PV4.y

The instruction sequence is different for R600:
Code:
      0  w: MULADD      ____,  R0.x,  C0.x,  0.5      
      1  z: FRACT       ____,  PV0.w      
      2  y: MULADD      ____,  PV1.z,  C0.z,  C0.w      
      3  t: SIN         R0.x,  PV2.y
So it seems to have evolved.
But this is the D3D assembly:

Code:
    ps_3_0
    [B]def c0, 0.159154937, 0.5, 6.28318548, -3.14159274[/B]
    dcl_color v0.x
    mad r0.x, v0.x, c0.x, c0.y
    frc r0.x, r0.x
    mad r0.x, r0.x, c0.z, c0.w
    sincos r1.y, r0.x
    mov oC0, r1.y
Obviously R600 didn't need the normalization by 2PI. The Cypress code does virtually the same, and is not very well optimized.
What you see is the succession of the IL instructions "pireduce" and "sin", the latter expecting an already reduced input (between -pi and +pi according to the IL spec). But the ISA instruction SIN expects a normalized input in radian/2PI (but can work within a range of -256..256, i.e. -512PI..512PI before the MUL, that is an improvement to R600, where it needed to be in the -Pi..Pi range), hence the compiler inserts an additional MUL to do this normalization. The last step of pireduce is:

MAD dest, intermediate_value, 2PI, -PI

and the sin in IL adds a division by 2PI. So one can optimize away the last MUL if one exchanges the last instruction of pireduce with:

MAD dest, intermediate_value, 1, -0.5
or even simpler
ADD dest, intermediate_value, -0.5

Way to go for the shader compiler!
 
I still find it unlikely for Barts to have 1120 ALUs of 4D shaders , wavefront size would be horrible .

1120 = 17.5 X 64 ALU SIMD (Can't happen)
1120 = 14 X 80 ALU SIMD (Wavefront size is disastrous)

In fact if it had 1120 ALUs , then it wouldn't probably be much different than an overclocked HD 5830 with 32 functional ROPs and boosted memory frequency , that would be enough for the HD 5830 to even overtake the HD 5850.

So either ALUs are 1280 or 960 , those are the only ones that make sense right now .

171915m9l893m3ymi39lic.jpg


4+1 continues

Well it's definitely 1120 that's for sure 80x14
 
About the MLAA thing, I wonder if would be possible to be used in conjunction with MSAA -- the last would provide some minimum level of sub-pixel coverage.

I think MLAA is implemented using dxcs, so it should be a post process. MSAA should definitely be usable along with it.
 
Xbitlabs tested a OC 460gtx with the new 260 drivers and the numbers are quite interesting. Is it true that the new drivers gives such a performance boost :?:
http://www.xbitlabs.com/articles/video/display/asus-engtx460-directcu-top.html
What did you expect from an overclock by about 30% above default (675 -> 880MHz with slightly raised voltage)? Of course it can close the gap to the HD6870 running at default voltage and default clock quite a bit. I guess the same would be true for a similarly overclocked HD6850 (would be 1.01 GHz clock).
 
Back
Top