Is a 32bit MADD/FMA ALU exactly 1/2 of a 64bit MADD/FMA ALU ( ie there are no extra bits needed to calculate to that precision?). Does it change the way the registers and caches load and store data etc? If there are difference like the above and they cost more power and die space then thats a likely explanation.
You mean the up to 2x improvement in ray tracing performance, right?Is the reduction in ray tracing performance in GK104 not enough evidence for you?
Which ray tracer? The NVidia ray tracing library?Is the reduction in ray tracing performance in GK104 not enough evidence for you?
Which ray tracer? The NVidia ray tracing library?
EDIT: hahahahahahaha
When NVidia starts optimising for OpenCL I guess that will be relevant. Didn't LuxMark show a massive decline in performance with recent drivers on GTX580?I think rpg.314 was referring to LuxMark.
Why do you think that won't be the case for the big Kepler chip? Have you spent time on the presentation I linked in post 3966?It's obvious that the 680 requires more hand holding than the 580.
It did -- for almost 6 months now. Sloppy OCL1.1 stack and no sign of 1.2 implementation anytime soon. Go figure where it all lies in the priority list for NV...Didn't LuxMark show a massive decline in performance with recent drivers on GTX580?
Dodge.
All I am saying is that the compiler is not to blame for GK104's sucky compute performance. It's cache subsystem and the register file size vs ALU size trade off made here. It has too little RF and cache to do justice to it's ALUs in compute.
Static scheduling is not going anywhere. It's here to stay.
Why do you think that won't be the case for the big Kepler chip? Have you spent time on the presentation I linked in post 3966?
I think, the number of active threads per SMX is the concern in this case.The flops/register ratio hasn't changed from GF110->GK104. So how does your theory of insufficient latency hiding explain why it is slower than GF110 in some workloads?
I think, the number of active threads per SMX is the concern in this case.
I'm not trying to explain anything, since the data is generally crap. My only observation is that 2 instructions from the same hardware thread can't be issued consecutively (whereas Fermi and prior apparently could). As far as I can tell this constraint applies whether or not the second instruction is dependent upon the first. GCN doesn't have any constraint here (well there'll be some indexed register modes that do raise this constraint, but ignoring them).The flops/register ratio hasn't changed from GF110->GK104. So how does your theory of insufficient latency hiding explain why it is slower than GF110 in some workloads?
Because their research shows that relying upon "minimal compilation effort" requires hardware overheads that totally dominate power consumption. NVidia is directing its efforts towards getting the right data to the right place at the right time.Why would they make it even harder to extract GPU performance for their HPC customers?
Well, I will be very interested to see how this plays out in the coming months, to see how much extra compute performance nVidia is able to squeeze out through better drivers.Why would they make it even harder to extract GPU performance for their HPC customers?
* The execution time for an instruction (those with maximum throughput, anyway) is 11 clock cycles, down from 22 in Fermi.
* Throughput of 32-bit operations are not identical. Max throughput per clock on an SMX is: 192 floating point multiply-add, 168 integer add, or 136 logical operations. These all had the same throughput in compute capability 2.0.
* Relative to the throughput of single precision multiply-add, the throughput of integer shifts, integer comparison, and integer multiplication is lower than before.
* The throughput of the intrinsic special functions relative to single precision floating point MAD is slightly higher than compute capability 2.0.
* Max # of blocks (16), warps (64), and threads (2048) per SMX
Isn't that just a result of ditching the hot-clock domain and reducing the # of ALU pipeline stages?Furthermore, Kepler's pipeline latency is now halved as well.
Isn't that just a result of ditching the hot-clock domain and reducing the # of ALU pipeline stages?