NVIDIA Fermi: Architecture discussion

DemoCoder · Apr 13, 2010

It doesn't seem like a useful feature (time slicing shaders), because there are so many cores, it would be better to implement some work queue scheme with resource reservation (e.g. I'm enqueuing this task, I need atleast N cores for it). The context saving/restoration is too costly IMHO.

Andrew Lauritzen · Apr 13, 2010

DemoCoder said:
It doesn't seem like a useful feature (time slicing shaders), because there are so many cores, it would be better to implement some work queue scheme with resource reservation (e.g. I'm enqueuing this task, I need atleast N cores for it). The context saving/restoration is too costly IMHO.

While I agree that cooperative scheduling is definitely desirable and more efficient being *able* to do preemption is important for correctness guarantees on producer/consumer workloads. Currently these workloads are not possible to express robustly in the current APIs/languages but it'll probably be desirable to do this at least at a coarse granularity in the future.

rpg.314 · Jun 29, 2010

Some useful stuff here

http://www.highperformancegraphics.org/media/Hot3D/HPG2010_Hot3D_NVIDIA.pdf

Davros · Jun 30, 2010

gpu's already multitask don't they ? for example graphics run at the same time as cuda calculated water in just cause 2

neliz · Jun 30, 2010

Davros said:
gpu's already multitask don't they ? for example graphics run at the same time as cuda calculated water in just cause 2

I don't think so, it would be nuts if nV had to issue separate instructions to their CUDA Cores, best have one thread and one instruction per cycle. Heck, next thing we know someone comes up and insists that CUDA is made of nothing more than GPU commands.

sorry

KimB · Jun 30, 2010

neliz said:
I don't think so, it would be nuts if nV had to issue separate instructions to their CUDA Cores, best have one thread and one instruction per cycle. Heck, next thing we know someone comes up and insists that CUDA is made of nothing more than GPU commands.

sorry

Yeah, makes more sense for the game to execute all of the CUDA commands on the GPU, so that the GPU is in pure CUDA-mode, finish preparing the frame, and then render the frame in normal graphics mode.

CarstenS · Jun 30, 2010

I guess that depends on your definition of "at the same time". Obviously, one ALU can only execute one command at a time - and that cannot belong to a cuda kernel and a graphics kernel.

On earlier-than-Fermi GPUs though, the whole chip could only execute one kernel at a time and needed to be in "cuda state" or "graphics state" with significant switching times and could. That has been remedied in Fermi, where multiple kernels can run simultaneously, but still have to belong to the same group (compute or graphics), switching time has been drastically shortened according to Nvidia. AMD claims the same capabilities, running multiple kernels at once, for their chips since I don't know when. So they've been there probably much earlier.

fellix · Jun 30, 2010

CarstenS said:
AMD claims the same capabilities, running multiple kernels at once, for their chips since I don't know when. So they've been there probably much earlier.

R600?

Jawed · Jun 30, 2010

In graphics mode conceptually you have, as a minimum, vertex shader and pixel shader kernels running "independently" on the GPU at the same time.

Xenos/R600 run these as independent kernels.

G80 onwards, theoretically, is the same. Though there's always the chance that NVidia implemented VS/GS/PS by running a single uber-kernel and then simply using a run-time constant in the context (hardware thread's context) to specify which sub-section of the uber-kernel to run, one section for VS, another for GS and a third for PS. Would require a fair bit of digging to find out for sure...

Meanwhile:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=135388&enterthread=y

"At present, there are no plans to support concurrent kernels."

I wouldn't be surprised if this is a feature that's a victim of the Evergreen cut-backs, i.e. it's in there but doesn't really work, so they're not bothering.

Jawed

mhouston · Jul 1, 2010

Evergreen can do it, and in fact can do it under DXCS right now. It's just tricky. On the CL side, we have some work to do to get this through, namely getting out of order queues and fixing up some dependency tracking issues that are too aggressive. Om stated the status a little too strongly. It's being looked at on getting this exposed, but no ETA for CL.

cho · Jul 1, 2010

something maybe useful:

http://we.pcinlife.com/thread-1457980-1-1.html

the source code:
http://www.eecg.toronto.edu/~moshovos/CUDA08/arx/microbenchmark.zip

http://www.eecg.toronto.edu/~moshovos/CUDA08/arx/microbenchmark_report.pdf

KimB · Jul 1, 2010

cho said:
something maybe useful:

http://we.pcinlife.com/thread-1457980-1-1.html

"not authorized" Presumably viewing of this thread requires registration. Also, it's in Chinese, which doesn't help many of us.

cho · Jul 1, 2010

hmm, ok , I had changed the permission setting, it should be viewable now.

Jawed · Jul 1, 2010

Ooh, pipeline length is shorter, which means less hardware threads are required to hide pipeline latency, in general.

Throughput is peculiar, it never goes above 28 ops/clk as far as I can see (for things one would expect to be 32 ops/clk).

Throughput for a 32-bit integer MUL is spot-on, though, 15.9 ops/clk.

Double-precision throughput looks like a disaster zone. Register bandwidth spoilt by bad register allocation? In other words, I think the nature of the test is screwing things up, and this will eventually come out right once the driver is more mature.

Some of the GT200 numbers in your test differ from that paper (GTX280 in both cases), e.g. you report 6.0 ops/clk for MAD but the paper has 7.9, and 12.4 ops/clk for MUL but the paper has 11.2 :???:

Something going on in the driver/compiler or register allocation?...

Jawed

cho · Jul 1, 2010

here is the benchmark results on ubuntu 10.04 + cuda toolkit 3.1 + 256.35 + GTX 480:

Code:

Running (16 x 16 x 16) blocks of 512 empty threads...done
Running (16 x 16 x 16) blocks of 512 empty threads: 79.610 ms

Running clock() test...
kclock: 
   (3591554430, 3591554454): 24


kclock_test2: [10 blocks, 1 thread(s)/block]
kclock_test2: [30 blocks, 1 thread(s)/block]
  Block 00: start: 3591605780, stop: 3591608096
  Block 01: start: 3591606066, stop: 3591608382
  Block 02: start: 3591609784, stop: 3591612100
  Block 03: start: 3591605790, stop: 3591608106
  Block 04: start: 3591606072, stop: 3591608388
  Block 05: start: 3591609790, stop: 3591612106
  Block 06: start: 3591605790, stop: 3591608106
  Block 07: start: 3591606072, stop: 3591608388
  Block 08: start: 3591609790, stop: 3591612106
  Block 09: start: 3591605798, stop: 3591608114
  Block 00: start: 3591616774, stop: 3591619090
  Block 10: start: 3591620632, stop: 3591622948
  Block 20: start: 3591616874, stop: 3591619190
  Block 01: start: 3591616790, stop: 3591619106
  Block 11: start: 3591620662, stop: 3591622978
  Block 21: start: 3591620626, stop: 3591622942
  Block 02: start: 3591616796, stop: 3591619112
  Block 12: start: 3591616780, stop: 3591619096
  Block 22: start: 3591616908, stop: 3591619224
  Block 03: start: 3591616808, stop: 3591619124
  Block 13: start: 3591616788, stop: 3591619104
  Block 23: start: 3591620652, stop: 3591622968
  Block 04: start: 3591616862, stop: 3591619178
  Block 14: start: 3591616790, stop: 3591619106
  Block 24: start: 3591616948, stop: 3591619264
  Block 05: start: 3591616876, stop: 3591619192
  Block 15: start: 3591616802, stop: 3591619118
  Block 25: start: 3591616778, stop: 3591619094
  Block 06: start: 3591616880, stop: 3591619196
  Block 16: start: 3591616860, stop: 3591619176
  Block 26: start: 3591616866, stop: 3591619182
  Block 07: start: 3591616910, stop: 3591619226
  Block 17: start: 3591620612, stop: 3591622928
  Block 27: start: 3591620618, stop: 3591622934
  Block 08: start: 3591620614, stop: 3591622930
  Block 18: start: 3591616870, stop: 3591619186
  Block 28: start: 3591616948, stop: 3591619264
  Block 09: start: 3591620628, stop: 3591622944
  Block 19: start: 3591620622, stop: 3591622938
  Block 29: start: 3591616794, stop: 3591619110


Running pipeline tests...
Pipeline latency (512 dependent operations)
  mul:          9228 clk (18.023 clk/warp)

Running pipeline tests...

  K_ADD_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_RSQRT_FLOAT_DEP128     latency:        17950 clk (70.117 clk/warp)
  K_ADD_DOUBLE_DEP128     latency:        6148 clk (24.016 clk/warp)

  K_ADD_UINT_DEP128     throughput:         4666 clk (28.091 ops/clk)
  K_RSQRT_FLOAT_DEP128     throughput:        32876 clk (3.987 ops/clk)
  K_ADD_DOUBLE_DEP128     throughput:        32752 clk (4.002 ops/clk)

  K_ADD_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_SUB_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_MAD_UINT_DEP128     latency:        5130 clk (20.039 clk/warp)
  K_MUL_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_DIV_UINT_DEP128     latency:        67596 clk (264.047 clk/warp)
  K_REM_UINT_DEP128     latency:        67596 clk (264.047 clk/warp)
  K_MIN_UINT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_MAX_UINT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_ADD_UINT_DEP128     throughput:         4662 clk (28.115 ops/clk)
  K_SUB_UINT_DEP128     throughput:         4666 clk (28.091 ops/clk)
  K_MAD_UINT_DEP128     throughput:         8224 clk (15.938 ops/clk)
  K_MUL_UINT_DEP128     throughput:         8224 clk (15.938 ops/clk)
  K_DIV_UINT_DEP128     throughput:        77310 clk (1.695 ops/clk)
  K_REM_UINT_DEP128     throughput:        75536 clk (1.735 ops/clk)
  K_MIN_UINT_DEP128     throughput:         9280 clk (14.124 ops/clk)
  K_MAX_UINT_DEP128     throughput:         9796 clk (13.380 ops/clk)

  K_ADD_INT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_SUB_INT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_MAD_INT_DEP128     latency:        5130 clk (20.039 clk/warp)
  K_MUL_INT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_DIV_INT_DEP128     latency:        77580 clk (303.047 clk/warp)
  K_REM_INT_DEP128     latency:        76044 clk (297.047 clk/warp)
  K_MIN_INT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_MAX_INT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_ABS_INT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_ADD_INT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_SUB_INT_DEP128     throughput:         4662 clk (28.115 ops/clk)
  K_MAD_INT_DEP128     throughput:         8224 clk (15.938 ops/clk)
  K_MUL_INT_DEP128     throughput:         8228 clk (15.930 ops/clk)
  K_DIV_INT_DEP128     throughput:        95372 clk (1.374 ops/clk)
  K_REM_INT_DEP128     throughput:        88822 clk (1.476 ops/clk)
  K_MIN_INT_DEP128     throughput:         9280 clk (14.124 ops/clk)
  K_MAX_INT_DEP128     throughput:         9286 clk (14.115 ops/clk)
  K_ABS_INT_DEP128     throughput:         9298 clk (14.097 ops/clk)

  K_ADD_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_SUB_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_MAD_FLOAT_DEP128     latency:        5130 clk (20.039 clk/warp)
  K_MUL_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_DIV_FLOAT_DEP128     latency:        162842 clk (636.102 clk/warp)
  K_MIN_FLOAT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_MAX_FLOAT_DEP128     latency:        9228 clk (36.047 clk/warp)
  K_ADD_FLOAT_DEP128     throughput:         4662 clk (28.115 ops/clk)
  K_SUB_FLOAT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_MAD_FLOAT_DEP128     throughput:         5432 clk (24.130 ops/clk)
  K_MUL_FLOAT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_DIV_FLOAT_DEP128     throughput:       221442 clk (0.592 ops/clk)
  K_MIN_FLOAT_DEP128     throughput:         9282 clk (14.121 ops/clk)
  K_MAX_FLOAT_DEP128     throughput:         9280 clk (14.124 ops/clk)

  K_ADD_DOUBLE_DEP128     latency:        6150 clk (24.023 clk/warp)
  K_SUB_DOUBLE_DEP128     latency:        6148 clk (24.016 clk/warp)
  K_MAD_DOUBLE_DEP128     latency:        6150 clk (24.023 clk/warp)
  K_MUL_DOUBLE_DEP128     latency:        6148 clk (24.016 clk/warp)
  K_DIV_DOUBLE_DEP128     latency:        173078 clk (676.086 clk/warp)
  K_MIN_DOUBLE_DEP128     latency:        12292 clk (48.016 clk/warp)
  K_MAX_DOUBLE_DEP128     latency:        12294 clk (48.023 clk/warp)
  K_ADD_DOUBLE_DEP128     throughput:        32752 clk (4.002 ops/clk)
  K_SUB_DOUBLE_DEP128     throughput:        32766 clk (4.000 ops/clk)
  K_MAD_DOUBLE_DEP128     throughput:        32754 clk (4.002 ops/clk)
  K_MUL_DOUBLE_DEP128     throughput:        32760 clk (4.001 ops/clk)
  K_DIV_DOUBLE_DEP128     throughput:       258918 clk (0.506 ops/clk)
  K_MIN_DOUBLE_DEP128     throughput:        65530 clk (2.000 ops/clk)
  K_MAX_DOUBLE_DEP128     throughput:        65520 clk (2.000 ops/clk)

  K_AND_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_OR_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_XOR_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_SHL_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_SHR_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_AND_UINT_DEP128     throughput:         4666 clk (28.091 ops/clk)
  K_OR_UINT_DEP128     throughput:         4666 clk (28.091 ops/clk)
  K_XOR_UINT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_SHL_UINT_DEP128     throughput:         8242 clk (15.903 ops/clk)
  K_SHR_UINT_DEP128     throughput:         8242 clk (15.903 ops/clk)

  K_UMUL24_UINT_DEP128     latency:        9234 clk (36.070 clk/warp)
  K_MUL24_INT_DEP128     latency:        9234 clk (36.070 clk/warp)
  K_UMULHI_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_MULHI_INT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_USAD_UINT_DEP128     latency:        5130 clk (20.039 clk/warp)
  K_SAD_INT_DEP128     latency:        5130 clk (20.039 clk/warp)
  K_UMUL24_UINT_DEP128     throughput:         9332 clk (14.045 ops/clk)
  K_MUL24_INT_DEP128     throughput:         9334 clk (14.042 ops/clk)
  K_UMULHI_UINT_DEP128     throughput:         8224 clk (15.938 ops/clk)
  K_MULHI_INT_DEP128     throughput:         8222 clk (15.942 ops/clk)
  K_USAD_UINT_DEP128     throughput:         8240 clk (15.907 ops/clk)
  K_SAD_INT_DEP128     throughput:         8242 clk (15.903 ops/clk)

  K_FADD_RN_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_FADD_RZ_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_FMUL_RN_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_FMUL_RZ_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_FDIVIDEF_FLOAT_DEP128     latency:        21260 clk (83.047 clk/warp)
  K_FADD_RN_FLOAT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_FADD_RZ_FLOAT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_FMUL_RN_FLOAT_DEP128     throughput:         4662 clk (28.115 ops/clk)
  K_FMUL_RZ_FLOAT_DEP128     throughput:         4662 clk (28.115 ops/clk)
  K_FDIVIDEF_FLOAT_DEP128     throughput:        32908 clk (3.983 ops/clk)

  K_DADD_RN_DOUBLE_DEP128     latency:        6148 clk (24.016 clk/warp)
  K_DADD_RN_DOUBLE_DEP128     throughput:        32752 clk (4.002 ops/clk)

  K_RCP_FLOAT_DEP128     latency:        74766 clk (292.055 clk/warp)
  K_SQRT_FLOAT_DEP128     latency:        70688 clk (276.125 clk/warp)
  K_RSQRT_FLOAT_DEP128     latency:        17950 clk (70.117 clk/warp)
  K_RCP_FLOAT_DEP128     throughput:        93152 clk (1.407 ops/clk)
  K_SQRT_FLOAT_DEP128     throughput:        90428 clk (1.449 ops/clk)
  K_RSQRT_FLOAT_DEP128     throughput:        32884 clk (3.986 ops/clk)

  K_SINF_FLOAT_DEP128     latency:        10248 clk (40.031 clk/warp)
  K_COSF_FLOAT_DEP128     latency:        10248 clk (40.031 clk/warp)
  K_TANF_FLOAT_DEP128     latency:        29708 clk (116.047 clk/warp)
  K_EXPF_FLOAT_DEP128     latency:        27154 clk (106.070 clk/warp)
  K_EXP2F_FLOAT_DEP128     latency:        22558 clk (88.117 clk/warp)
  K_EXP10F_FLOAT_DEP128     latency:        27154 clk (106.070 clk/warp)
  K_LOGF_FLOAT_DEP128     latency:        22552 clk (88.094 clk/warp)
  K_LOG2F_FLOAT_DEP128     latency:        17950 clk (70.117 clk/warp)
  K_LOG10F_FLOAT_DEP128     latency:        22552 clk (88.094 clk/warp)
  K_POWF_FLOAT_DEP128     latency:        27232 clk (106.375 clk/warp)
  K_SINF_FLOAT_DEP128     throughput:        32772 clk (4.000 ops/clk)
  K_COSF_FLOAT_DEP128     throughput:        32778 clk (3.999 ops/clk)
  K_TANF_FLOAT_DEP128     throughput:        98380 clk (1.332 ops/clk)
  K_EXPF_FLOAT_DEP128     throughput:        32902 clk (3.984 ops/clk)
  K_EXP2F_FLOAT_DEP128     throughput:        33012 clk (3.970 ops/clk)
  K_EXP10F_FLOAT_DEP128     throughput:        32986 clk (3.974 ops/clk)
  K_LOGF_FLOAT_DEP128     throughput:        32888 clk (3.985 ops/clk)
  K_LOG2F_FLOAT_DEP128     throughput:        32882 clk (3.986 ops/clk)
  K_LOG10F_FLOAT_DEP128     throughput:        32912 clk (3.982 ops/clk)
  K_POWF_FLOAT_DEP128     throughput:        66116 clk (1.982 ops/clk)

  K_INTASFLOAT_UINT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_FLOATASINT_FLOAT_DEP128     latency:        4620 clk (18.047 clk/warp)
  K_INTASFLOAT_UINT_DEP128     throughput:         8220 clk (15.945 ops/clk)
  K_FLOATASINT_FLOAT_DEP128     throughput:         8228 clk (15.930 ops/clk)

  K_POPC_UINT_DEP128     latency:        5130 clk (20.039 clk/warp)
  K_CLZ_UINT_DEP128     latency:        9228 clk (36.047 clk/warp)

  K_POPC_UINT_DEP128     throughput:         8228 clk (15.930 ops/clk)
  K_CLZ_UINT_DEP128     throughput:         9310 clk (14.079 ops/clk)

  K_ALL_UINT_DEP128     latency:        12336 clk (48.188 clk/warp)
  K_ANY_UINT_DEP128     latency:        12336 clk (48.188 clk/warp)
  K_SYNC_UINT_DEP128     latency:        58 clk (0.227 clk/warp)

  K_ALL_UINT_DEP128     throughput:        16484 clk (7.951 ops/clk)
  K_ANY_UINT_DEP128     throughput:        16488 clk (7.950 ops/clk)
  K_SYNC_UINT_DEP128     throughput:          108 clk (1213.630 ops/clk)


Pipeline latency/throughput with multiple warps (200 iterations of 256 ops)
  K_ADD_UINT_DEP128:
     1 warp  (  1 thr)    924000 clk (18.047 clk/warp, 0.055 ops/clk)   Histogram { (18: 200) }
     1 warp  (  2 thr)    924000 clk (18.047 clk/warp, 0.111 ops/clk)   Histogram { (18: 200) }
     1 warp  (  3 thr)    924000 clk (18.047 clk/warp, 0.166 ops/clk)   Histogram { (18: 200) }
     1 warp  (  4 thr)    924000 clk (18.047 clk/warp, 0.222 ops/clk)   Histogram { (18: 200) }
     1 warp  (  6 thr)    924000 clk (18.047 clk/warp, 0.332 ops/clk)   Histogram { (18: 200) }
     1 warp  (  8 thr)    924000 clk (18.047 clk/warp, 0.443 ops/clk)   Histogram { (18: 200) }
     1 warp  ( 16 thr)    924000 clk (18.047 clk/warp, 0.887 ops/clk)   Histogram { (18: 200) }
     1 warp  ( 24 thr)    924000 clk (18.047 clk/warp, 1.330 ops/clk)   Histogram { (18: 200) }
     1 warp  ( 32 thr)    924000 clk (18.047 clk/warp, 1.773 ops/clk)   Histogram { (18: 200) }
     2 warps ( 64 thr)    924000 clk (18.047 clk/warp, 3.546 ops/clk)   Histogram { (18: 400) }
     3 warps ( 96 thr)    924400 clk (18.047 clk/warp, 5.317 ops/clk)   Histogram { (18: 600) }
     4 warps (128 thr)    924800 clk (18.051 clk/warp, 7.087 ops/clk)   Histogram { (18: 800) }
     5 warps (160 thr)    925600 clk (18.056 clk/warp, 8.850 ops/clk)   Histogram { (18: 1000) }
     6 warps (192 thr)    926000 clk (18.059 clk/warp, 10.616 ops/clk)   Histogram { (18: 1200) }
     7 warps (224 thr)    926800 clk (18.065 clk/warp, 12.375 ops/clk)   Histogram { (18: 1400) }
     8 warps (256 thr)    926916 clk (18.064 clk/warp, 14.141 ops/clk)   Histogram { (18: 1600) }
     9 warps (288 thr)    928328 clk (18.071 clk/warp, 15.884 ops/clk)   Histogram { (18: 1800) }
    10 warps (320 thr)    928742 clk (18.076 clk/warp, 17.641 ops/clk)   Histogram { (18: 2000) }
    11 warps (352 thr)    928930 clk (18.079 clk/warp, 19.401 ops/clk)   Histogram { (18: 2200) }
    12 warps (384 thr)    929168 clk (18.093 clk/warp, 21.160 ops/clk)   Histogram { (18: 2400) }
    13 warps (416 thr)    930940 clk (18.103 clk/warp, 22.879 ops/clk)   Histogram { (18: 2600) }
    14 warps (448 thr)    931248 clk (18.111 clk/warp, 24.631 ops/clk)   Histogram { (18: 2800) }
    15 warps (480 thr)    932606 clk (18.121 clk/warp, 26.352 ops/clk)   Histogram { (18: 3000) }
    16 warps (512 thr)    932754 clk (18.130 clk/warp, 28.104 ops/clk)   Histogram { (18: 3200) }


  K_MUL_FLOAT_DEP128     throughput:         4664 clk (28.103 ops/clk)
  K_MAD_FLOAT_DEP128     throughput:         5374 clk (24.390 ops/clk)

  KADD_MUL     throughput:         4146 clk (31.614 ops/clk)

  KADD_MUL2     throughput:    64 thrds      2570 clk (6.375 ops/clk)

++++++++++++++++++++++++++++++++++++++++++++++++++
  K_SYNC_UINT_DEP128     latency:        58 clk (0.227 clk/warp)
  K_SYNC_UINT_DEP128     latency:        60 clk (0.234 clk/warp)
  K_SYNC_UINT_DEP128     latency:        62 clk (0.242 clk/warp)
  K_SYNC_UINT_DEP128     latency:        64 clk (0.250 clk/warp)
  K_SYNC_UINT_DEP128     latency:        66 clk (0.258 clk/warp)
  K_SYNC_UINT_DEP128     latency:        68 clk (0.266 clk/warp)
  K_SYNC_UINT_DEP128     latency:        70 clk (0.273 clk/warp)
  K_SYNC_UINT_DEP128     latency:        72 clk (0.281 clk/warp)
  K_SYNC_UINT_DEP128     latency:        76 clk (0.297 clk/warp)
  K_SYNC_UINT_DEP128     latency:        78 clk (0.305 clk/warp)
  K_SYNC_UINT_DEP128     latency:        80 clk (0.312 clk/warp)
  K_SYNC_UINT_DEP128     latency:        82 clk (0.320 clk/warp)
  K_SYNC_UINT_DEP128     latency:        84 clk (0.328 clk/warp)
  K_SYNC_UINT_DEP128     latency:        88 clk (0.344 clk/warp)
  K_SYNC_UINT_DEP128     latency:        90 clk (0.352 clk/warp)
  K_SYNC_UINT_DEP128     latency:        94 clk (0.367 clk/warp)
Running register file test...
Max threads x regs/thread before kernel spawn failure.
  [516 x   4 =  2064]
  [516 x   8 =  4128]
  [516 x  12 =  6192]
  [516 x  16 =  8256]
  [516 x  20 = 10320]
  [516 x  24 = 12384]
  [516 x  28 = 14448]
  [516 x  32 = 16512]
  [516 x  36 = 18576]
  [516 x  40 = 20640]
  [516 x  44 = 22704]
  [516 x  48 = 24768]
  [516 x  52 = 26832]
  [516 x  56 = 28896]
  [516 x  60 = 30960]
  [516 x  64 = 33024]
  [516 x  68 = 35088]
  [516 x  72 = 37152]
  [516 x  76 = 39216]
  [516 x  80 = 41280]
  [516 x  84 = 43344]
  [516 x  88 = 45408]
  [516 x  92 = 47472]
  [516 x  96 = 49536]
  [516 x 100 = 51600]
  [516 x 104 = 53664]
  [516 x 108 = 55728]
  [516 x 112 = 57792]
  [516 x 116 = 59856]
  [516 x 120 = 61920]
  [516 x 124 = 63984]

Jawed · Jul 1, 2010

I don't understand why you posted those results. There are differences, but I don't really feel like scouring them to find nuggets of gold.

Anyway, looking again I notice that:

Code:

  KADD_MUL     throughput:         4146 clk (31.614 ops/clk)

is 32 ops/clk, non-dependent. I missed that in the earlier posting.

I've also noticed for GTX280, Pipeline latency/throughput with multiple warps (200 iterations of 256 ops) K_ADD_UINT_DEP128::

Code:

  6 warps (192 thr)   1249528 clk (24.253 clk/warp, 7.867 ops/clk)

notice that this is the peak ops/clk, at only 6 warps. Whereas with GTX480 the peak ops/clk occurs at 16 warps. This entirely contradicts my earlier suggestion that the reduced pipeline length would reduce the count of hardware threads required to hide pipeline latency

This may just be another artefact of this test/compilation. If it truly needs to be heavily populated with hardware threads that's going to be quite awkward.

Maybe this is because the register file has been substantially re-worked due to load/store operation, where ALU operands always come from registers, never from anywhere else and the register file has to support more clients.

Jawed

Gipsel · Jul 1, 2010

Jawed said:
Double-precision throughput looks like a disaster zone. Register bandwidth spoilt by bad register allocation? In other words, I think the nature of the test is screwing things up, and this will eventually come out right once the driver is more mature.

I don't think so as nvidia has artificially limited the DP throughput on consumer cards to 1/4th of the Tesla cards. So 4 ops/clock and SM is exactly what one expects (only 168 GFlop/s Peak for a GTX480).

Jawed · Jul 1, 2010

I totally forgot about that

Mindfury · Jul 3, 2010

rpg.314 · Jul 3, 2010

7SMs for 460, seems rather low. That's a rather big gap between 465 and 460.

May be we'll see some last minute tweaking.

NVIDIA Fermi: Architecture discussion

DemoCoder

Andrew Lauritzen

Moderator

rpg.314

Davros

neliz

GIGABYTE Man

KimB

CarstenS

Moderator

fellix

Jawed

mhouston

A little of this and that

cho

KimB

cho

Jawed

cho

Jawed

Gipsel

Jawed

Mindfury

rpg.314

Similar threads