You're referring to the limitation of there being a maximum of 3 operands from each of .x, .y, .z or .w? So if 12 operands were all ".x" that would lead to severely constrained throughput (e.g. 1 MUL per cycle).Despite this, it is a nasty implementation detail to reveal to software, and it does restrict what the compiler can pack in a cycle, however uncommon in graphics we can expect stretches of code dominated by references to just one bank to be.
That was what I was referencing in the portion of my statement with regards to the packing. For clarity, that is separate from the portion concerning the exposure of an ugly low-level implementation detail.You're referring to the limitation of there being a maximum of 3 operands from each of .x, .y, .z or .w? So if 12 operands were all ".x" that would lead to severely constrained throughput (e.g. 1 MUL per cycle).
This assumes that there is ILP that does not run afoul of bank conflicts, since ILP is not defined relative to banking issues.Assuming there is ILP then in theory it can be faster. Commonly the pipeline registers (PV.xyzw and PS) are used as operands. Also T registers' 4 components can be freely used.
Are these test shaders coded specifically to test banking conflicts?I just wrote Brook+, OpenCL and DirectCompute test shaders. Brook+ and OpenCL worked moderately well, using PV and T registers resulting in ~2 MULs per cycle (instead of 1). It's actually 7 cycles for 11 MULs.
That sounds like the compiler in that case at least knew there would be a problem.DirectCompute "cheated" by loading operands from memory into .x and .y components, though still ended-up with ~2 MULs per cycle.
Thanks. I'm glad you had a reference at hand, I simply forgot where I got the 64 Byte/clock per L2 partition for Fermi from.Slide 13: https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf
230 GB/s L2. I'm guessing it's 6 x 64B x 600Mhz.
__kernel void test2(
global float* A,
global float* B,
global float* C,
global float* D,
global float* E,
global float* F,
global float* G,
global float* H,
global float* I,
global float* J,
global float* K,
global float* L,
global float* Z) {
int x = get_global_id(0) ;
float Z1 = A[x] * B[x] * C[x] * D[x] ;
float Z2 = E[x] * F[x] * G[x] * H[x] ;
float Z3 = I[x] * J[x] * K[x] * L[x] ;
Z[x] = Z1 * Z2 * Z3;
}
02 ALU: ADDR(68) CNT(13)
23 x: LSHR R0.x, R5.y, (0x00000002, 2.802596929e-45f).x
w: MUL_e ____, R0.x, R4.x
24 x: MUL_e ____, R5.x, R6.x
w: MUL_e T0.w, PV23.w, R1.x VEC_021
25 x: MUL_e ____, PV24.x, R8.x
y: MUL_e ____, R7.x, R2.x VEC_021
26 x: MUL_e ____, T0.w, R10.x
y: MUL_e ____, PV25.x, R9.x VEC_021
z: MUL_e ____, PV25.y, R11.x VEC_102
27 z: MUL_e ____, PV26.z, R3.x
w: MUL_e ____, PV26.y, PV26.x
28 x: MUL_e R3.x, PV27.w, PV27.z
__kernel void test3(
global float* A,
global float* Z) {
int x = get_global_id(0) ;
Z[x] = A[x] * A[x + 1] * A[x + 2] * A[x + 3] * A[x + 4] * A[x + 5] * A[x + 6] * A[x + 7] * A[x + 8] * A[x + 9] * A[x + 10] * A[x + 11];
}
02 ALU: ADDR(73) CNT(11)
23 x: MUL_e ____, R3.x, R4.x
24 y: MUL_e ____, PV23.x, R5.x
25 y: MUL_e ____, PV24.y, R0.x
26 z: MUL_e ____, PV25.y, R6.x
27 z: MUL_e ____, PV26.z, R1.x
28 w: MUL_e ____, PV27.z, R2.x
29 w: MUL_e ____, PV28.w, R7.x
30 w: MUL_e ____, PV29.w, R8.x
31 x: MUL_e ____, PV30.w, R9.x
32 y: MUL_e ____, PV31.x, R10.x
33 x: MUL_e R11.x, PV32.y, R11.x
I don't think this scheduling is supported for normal integer math (the bit count and SAD stuff does though).Is the compiler being conservative because these are floating point values? Would ints be handled differently?
Are we supposedly looking at Antilles here, or just two Cayman in crossfire? If it's Antilles I doubt 900MHz is possible without exceeding 300W TDP.[IG]http://img835.imageshack.us/img835/3922/62953082.jpg[/IMG]
No memory bandwidth will stay the same (but memory size will double, of course). Who said it increased?The memory on the AMD Caicos board is Hynix H5TQ2G63BFR-12C-024A GDDR3, and the memory on ATI Cedar (HD5450) is Samsung K4W1G1646E-HC12. Both are GDDR3 specd at 800mhz, but the hynix on the new board has a density of 2G / 8k Ref / 64ms, while the cedar board has a density 1G / 8k / 64ms. Memory bandwidth has increased from 12.8 gb/s on cedar to 25.6 gb/s on caicos. This is because of changing density from 1g > 2g?
Yes. Note Juniper used 8 chips in clamshell mode, which effectively halfs the per-chip data width from 32bit to 16bit, presumably Barts would use 8 chips in "normal" mode.Talking Juniper ->Barts, increasing bus width from 128bit->256bit, would this effectively double the board's memory bandwidth while keeping the same Samsung GDDR5 memory modules from Juniper boards?
No memory bandwidth will stay the same (but memory size will double, of course). Who said it increased?
Well i checked a press deck slide for Cedar to get the 11.8gb/s figure, and there was a Caicos GPUz screenshot showing 25.6gb/s. Could just be an error on gpuz I suppose.
The compiler could be being conservative if the serial MULs are paired in such a way as the first MUL's result is not rounded and normalised (as would happen in a dot-product, I presume). Is that what you were thinking of?Is the compiler being conservative because these are floating point values?
I think ocl compiler will do maintain strict ordering unless you opt-in to relax fp order. I forget the exact name of the compiler option but did you give any compiler options like fast-math or relaxed-math?The compiler could be being conservative if the serial MULs are paired in such a way as the first MUL's result is not rounded and normalised (as would happen in a dot-product, I presume). Is that what you were thinking of?