AMD: R9xx Speculation

jaredpace · Sep 7, 2010

Cayman:

Barts:

Jawed · Sep 7, 2010

3dilettante said:
Despite this, it is a nasty implementation detail to reveal to software, and it does restrict what the compiler can pack in a cycle, however uncommon in graphics we can expect stretches of code dominated by references to just one bank to be.

You're referring to the limitation of there being a maximum of 3 operands from each of .x, .y, .z or .w? So if 12 operands were all ".x" that would lead to severely constrained throughput (e.g. 1 MUL per cycle).

Assuming there is ILP then in theory it can be faster. Commonly the pipeline registers (PV.xyzw and PS) are used as operands. Also T registers' 4 components can be freely used.

In theory these two things will "unlock" the single-bank restriction, but that depends on the compiler.

I just wrote Brook+, OpenCL and DirectCompute test shaders. Brook+ and OpenCL worked moderately well, using PV and T registers resulting in ~2 MULs per cycle (instead of 1). It's actually 7 cycles for 11 MULs.

DirectCompute "cheated" by loading operands from memory into .x and .y components, though still ended-up with ~2 MULs per cycle.

Increasing ILP results in more MULs per cycle.

Generally the compiler allocates registers for intermediate results with packing, so it'll create .y, .z and .w for anything reasonably complex. Though the packing varies in quality in my experience.

Overall the question of ILP seems to be dominant, not banking.

3dilettante · Sep 7, 2010

Jawed said:
You're referring to the limitation of there being a maximum of 3 operands from each of .x, .y, .z or .w? So if 12 operands were all ".x" that would lead to severely constrained throughput (e.g. 1 MUL per cycle).

That was what I was referencing in the portion of my statement with regards to the packing. For clarity, that is separate from the portion concerning the exposure of an ugly low-level implementation detail.

Assuming there is ILP then in theory it can be faster. Commonly the pipeline registers (PV.xyzw and PS) are used as operands. Also T registers' 4 components can be freely used.

This assumes that there is ILP that does not run afoul of bank conflicts, since ILP is not defined relative to banking issues.
In a perverse sense, it is the opposite of ILP that hides some of the problem, since the pipeline registers are populated when there are dependences in the current cycle on the cycle prior.

The pipeline registers are also a low-level implementation detail that is exposed to software, though typically this would be fine since most architectures would have bypass networks that do this automatically.

I just wrote Brook+, OpenCL and DirectCompute test shaders. Brook+ and OpenCL worked moderately well, using PV and T registers resulting in ~2 MULs per cycle (instead of 1). It's actually 7 cycles for 11 MULs.

Are these test shaders coded specifically to test banking conflicts?

DirectCompute "cheated" by loading operands from memory into .x and .y components, though still ended-up with ~2 MULs per cycle.

That sounds like the compiler in that case at least knew there would be a problem.

Gipsel · Sep 7, 2010

trinibwoy said:
Slide 13: https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf

230 GB/s L2. I'm guessing it's 6 x 64B x 600Mhz.

Thanks. I'm glad you had a reference at hand, I simply forgot where I got the 64 Byte/clock per L2 partition for Fermi from.

Jawed · Sep 7, 2010

Code:

__kernel void test2(
global float* A,
global float* B, 
global float* C,
global float* D,
global float* E,
global float* F,
global float* G,
global float* H,
global float* I,
global float* J,
global float* K,
global float* L,
global float* Z) {
 int x = get_global_id(0) ;
 float Z1 = A[x] * B[x] * C[x] * D[x] ;
 float Z2 = E[x] * F[x] * G[x] * H[x] ;
 float Z3 = I[x] * J[x] * K[x] * L[x] ;
 Z[x] = Z1 * Z2 * Z3;
}

Produces:

Code:

02 ALU: ADDR(68) CNT(13) 
     23  x: LSHR        R0.x,  R5.y,  (0x00000002, 2.802596929e-45f).x      
         w: MUL_e       ____,  R0.x,  R4.x      
     24  x: MUL_e       ____,  R5.x,  R6.x      
         w: MUL_e       T0.w,  PV23.w,  R1.x      VEC_021 
     25  x: MUL_e       ____,  PV24.x,  R8.x      
         y: MUL_e       ____,  R7.x,  R2.x      VEC_021 
     26  x: MUL_e       ____,  T0.w,  R10.x      
         y: MUL_e       ____,  PV25.x,  R9.x      VEC_021 
         z: MUL_e       ____,  PV25.y,  R11.x      VEC_102 
     27  z: MUL_e       ____,  PV26.z,  R3.x      
         w: MUL_e       ____,  PV26.y,  PV26.x      
     28  x: MUL_e       R3.x,  PV27.w,  PV27.z

I've excluded all the addressing instruction crap and the memory operations.

Obviously I created ILP here. Purely serial scalar runs at 1 instruction per clock:

Code:

__kernel void test3(
global float* A,
global float* Z) {
 int x = get_global_id(0) ;
 Z[x] = A[x] * A[x + 1] * A[x + 2] * A[x + 3] * A[x + 4] * A[x + 5] * A[x + 6] * A[x + 7] * A[x + 8] * A[x + 9] * A[x + 10] * A[x + 11];
}

Code:

02 ALU: ADDR(73) CNT(11) 
     23  x: MUL_e       ____,  R3.x,  R4.x      
     24  y: MUL_e       ____,  PV23.x,  R5.x      
     25  y: MUL_e       ____,  PV24.y,  R0.x      
     26  z: MUL_e       ____,  PV25.y,  R6.x      
     27  z: MUL_e       ____,  PV26.z,  R1.x      
     28  w: MUL_e       ____,  PV27.z,  R2.x      
     29  w: MUL_e       ____,  PV28.w,  R7.x      
     30  w: MUL_e       ____,  PV29.w,  R8.x      
     31  x: MUL_e       ____,  PV30.w,  R9.x      
     32  y: MUL_e       ____,  PV31.x,  R10.x      
     33  x: MUL_e       R11.x,  PV32.y,  R11.x

Evergreen is supposed to be able to do 2 serially dependent MULs per clock so AMD still hasn't got that working. EDIT: I have to take that back as instruction ordering here prevents the possibility of issuing two per cycle.

Jawed · Sep 7, 2010

Actually that ordering shouldn't prevent the dual-issue of serially dependent ops.

Experimenting some more, I can't get the compiler to do this at all, which is pretty crap for a chip that's been around for over a year.

3dilettante · Sep 7, 2010

Is the compiler being conservative because these are floating point values? Would ints be handled differently?

EduardoS · Sep 7, 2010

That's not all, a * b * c * d isn't trully dependent, if I told the compiler that I don't care about precision it should be able to reorder as (a * b) * (c * d).

Also I bet 99% of programmers who wrote something like a * b * c * d in the code wouldn't be able to say if this would produce more correct results than (a * b) * (c * d) but they would be fine with either result.

Jawed · Sep 7, 2010

3dilettante said:
Is the compiler being conservative because these are floating point values? Would ints be handled differently?

I don't think this scheduling is supported for normal integer math (the bit count and SAD stuff does though).

As for MUL, no change even with brackets.

a * b * c * d is conventionally serially dependent left-to-right.

3dilettante · Sep 7, 2010

I can understand that there is a serialization when using a*b*c*d, though it seems odd to me that brackets wouldn't change anything.
There must be something I haven't considered for that situation.

Jawed · Sep 7, 2010

By "no change" I mean dual-issue doesn't occur. The compiled sequence is affected by brackets, as you'd expect.

Man from Atlantis · Sep 7, 2010

Kaldskryke · Sep 7, 2010

Man from Atlantis said:
[IG]http://img835.imageshack.us/img835/3922/62953082.jpg[/IMG]

Are we supposedly looking at Antilles here, or just two Cayman in crossfire? If it's Antilles I doubt 900MHz is possible without exceeding 300W TDP.

UniversalTruth · Sep 7, 2010

What a fake.

Why 128 bit memory interface?

jaredpace · Sep 7, 2010

The memory on the AMD Caicos board is Hynix H5TQ2G63BFR-12C-024A GDDR3, and the memory on ATI Cedar (HD5450) is Samsung K4W1G1646E-HC12. Both are GDDR3 specd at 800mhz, but the hynix on the new board has a density of 2G / 8k Ref / 64ms, while the cedar board has a density 1G / 8k / 64ms. Memory bandwidth has increased from 12.8 gb/s on cedar to 25.6 gb/s on caicos. This is because of changing density from 1g > 2g?

Talking Juniper ->Barts, increasing bus width from 128bit->256bit, would this effectively double the board's memory bandwidth while keeping the same Samsung GDDR5 memory modules from Juniper boards?

Also does anyone know how to read the fab codes from the dies? I know the first four digits are year/workweek, eng = sample; whats the other stuff - wafer batch code?

mczak · Sep 8, 2010

jaredpace said:
The memory on the AMD Caicos board is Hynix H5TQ2G63BFR-12C-024A GDDR3, and the memory on ATI Cedar (HD5450) is Samsung K4W1G1646E-HC12. Both are GDDR3 specd at 800mhz, but the hynix on the new board has a density of 2G / 8k Ref / 64ms, while the cedar board has a density 1G / 8k / 64ms. Memory bandwidth has increased from 12.8 gb/s on cedar to 25.6 gb/s on caicos. This is because of changing density from 1g > 2g?

No memory bandwidth will stay the same (but memory size will double, of course). Who said it increased?

Talking Juniper ->Barts, increasing bus width from 128bit->256bit, would this effectively double the board's memory bandwidth while keeping the same Samsung GDDR5 memory modules from Juniper boards?

Yes. Note Juniper used 8 chips in clamshell mode, which effectively halfs the per-chip data width from 32bit to 16bit, presumably Barts would use 8 chips in "normal" mode.

jaredpace · Sep 8, 2010

mczak said:
No memory bandwidth will stay the same (but memory size will double, of course). Who said it increased?

Well i checked a press deck slide for Cedar to get the 11.8gb/s figure, and there was a Caicos GPUz screenshot showing 25.6gb/s. Could just be an error on gpuz I suppose.

hkultala · Sep 8, 2010

jaredpace said:
Well i checked a press deck slide for Cedar to get the 11.8gb/s figure, and there was a Caicos GPUz screenshot showing 25.6gb/s. Could just be an error on gpuz I suppose.

AFAIK GPUz cannot detect the memory bus width of the chip; it's just "guessing" it based on the chip's PCI-ID and memory size. (stored in database where the makers of gpuz have stored the info). And for new unreleased chips's there are errors in the gpuz database, as the makers of gpuz have had to guess some values when they had no official info.

And then the gpuz have detected the clock speed, and calculated the bandwidth from the detected clock speed and guessed bit width, and gotten the wrong result.

Jawed · Sep 8, 2010

3dilettante said:
Is the compiler being conservative because these are floating point values?

The compiler could be being conservative if the serial MULs are paired in such a way as the first MUL's result is not rounded and normalised (as would happen in a dot-product, I presume). Is that what you were thinking of?

I tried this in both OpenCL and DirectCompute. Neither provides serial MULs. The rounding/normalisation issue is not documented in the ISA. Since there are "IEEE" versions of the serial instructions (MUL and MULADD), one would be forgiven for thinking that a pairing of correctly rounded/normalised MULs is possible.

I haven't tried a pixel (or other graphics) shader, that's the last resort I guess.

If it only ever occurs in a pixel shader then it's pretty much useless, since the default is high levels of ILP anyway.

rpg.314 · Sep 8, 2010

Jawed said:
The compiler could be being conservative if the serial MULs are paired in such a way as the first MUL's result is not rounded and normalised (as would happen in a dot-product, I presume). Is that what you were thinking of?

I think ocl compiler will do maintain strict ordering unless you opt-in to relax fp order. I forget the exact name of the compiler option but did you give any compiler options like fast-math or relaxed-math?

Giving no option should strictly maintain program order.

AMD: R9xx Speculation

Similar threads