NVIDIA Kepler speculation thread

Shouldn't matter - if you assign 10 to each vALU, you stll get 40 cycles of latency hiding out of one vALU instruction across all 40 wavefronts, no?
 
I wonder why ???

BF3 Multiplayer Summary - GTX 680 SLI offered the best multiplayer experience, despite it having less VRAM capacity and memory bandwidth. We were able to run with motion blur enabled and HBAO turned on at 5760x1200 with FXAA and averaged 60-70 FPS. This amount of performance is perfect for multiplayer, and with the highest in-game settings enabled the game looked great at multiplayer. AMD Radeon HD 7970 CrossFireX struggled for performance, even though it had more RAM and memory bandwidth. To get the game to feel smooth enough with enough performance we had to lower ambient occlusion and motion blur. GeForce GTX 680 SLI was the clear winner in multiplayer.
http://www.hardocp.com/article/2012/03/28/nvidia_kepler_geforce_gtx_680_sli_video_card_review/5
 
A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.
I think Gipsel was considering a scenario of heavy register allocation and using the same register allocation for both architectures.

He was indicating that whereas Kepler cannot hide ALU latency with relatively few hardware threads, GCN has no problem.

---

Why does a MAD take NVidia 11 cycles, but AMD 4 cycles?

---

The ability to copy from work item to work item should be very nice, obviating moves through local memory. This is similar to Larrabee's shuffle.
 
Good question considering it loses in single player. What's different about MP? Are the textures lower res or anything?

Multiplayer is MUCH more cpu demanding.

hardocp said:
Even with just FXAA enabled we are over the VRAM limit, 3541 MB was being used, when the cards only have 3GB per GPU. Increasing to 2X and 4X AA took us to near 5 GB of memory.

The issue could easily be some memory reporting problem - ie that the CFX setup reports 7GB vram, which BF3 then tries to fill (/doesn't care about cleaning up stuff not currently used), leading to massive vram swapping.
 
Good question considering it loses in single player. What's different about MP? Are the textures lower res or anything?
I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .

The issue could easily be some memory reporting problem - ie that the CFX setup reports 7GB vram, which BF3 then tries to fill (/doesn't care about cleaning up stuff not currently used), leading to massive vram swapping.
The game is multi-GPU aware , it wouldn't do that .
 
I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .

Well in the 4x MSAA test it was 5FPS (or 12%) that seems significant.


The game is multi-GPU aware , it wouldn't do that .

It's clearly doing something odd with the radeon's as it's sucking up double the vram.
 
A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.
Yes, but my argument targeted the latency hiding capabilities (for memory accesses) indicated by the number of wavefronts/warps/workgroups in flight (and the time the issue of instructions takes for these) in case of a certain (high) register allocation.
Thanks!
Actually, according to some low level tests it is between 18 and 22 (hot clock) cycles on Fermi depending on register bank conflicts, so maybe nV opted for a constant 11 cycles to get rid of the variable latency for the static scheduling.
Gipsel - good point on max register usage cutting down on the number of warps per SMX, I had forgotten to account for that.

I don't quite understand your GCN numbers though. Doesn't a CU track a maximum of 40 wavefronts? You can process an ALU op for 4 entire waves every 4 cycles. So by executing 1 ALU instruction over all 40 waves, you can hide 40 cycles of memory access latency (much more than on Fermi/Kepler), again assuming you have enough registers for 40 wavefronts.
I assumed a high register usage, so this isn't the case anymore. ;)
For such heavy threads the number of workgroups in flight is simply limited by the size of the register files where nV's GPUs have a disadvantage.
I think Gipsel was considering a scenario of heavy register allocation and using the same register allocation for both architectures.

He was indicating that whereas Kepler cannot hide ALU latency with relatively few hardware threads, GCN has no problem.
Exactly.
The scalar ALU with its separate register file actually should enable AMD to get away with slightly less register usage than nVidia in quite a few cases, as for instance constants or adresses which are the same for all elements in a wavefront can be supplied from there and don't have to be in the vector registers.
Why does a MAD take NVidia 11 cycles, but AMD 4 cycles?
Are the 4 cycles confirmed by some low level benchmark?
Considering the simplifications in the ALUs compared to the VLIW architecture (which had 8 cycles arithmetic latency) it appeared quite possible (and the AFDS presentation mentioned back-to-back issue of vector ops without alternating between wavefronts [vector to scalar issue needs to have a 4 cycle latency panelty, btw.]). But seeing that GCN is actually able to hit frequencies above 1 GHz (my initial guess was not higher than VLIW), it could still be 8 cycles, even if then the architecture presentation at the AFDS would have been misleading in this point.

If it is indeed 4 versus 11 cycles I can only speculate that the reasons are similar to the ones for Fermi: Even as Kepler goes down the route to a much more static scheduling, the actual register access could still be part of the effective latency (not hidden by result forwarding incorporated to the pipeline) while it is not for AMD.

The ability to copy from work item to work item should be very nice, obviating moves through local memory. This is similar to Larrabee's shuffle.
I will wait for benchmarks of this feature. If it is as fast as in case of Larrabee, it contradicts a bit NV's mantra of late to localize the register files as close as possible to the ALUs to get a low power cost. But with some additional latency is appears like a nice idea as it should still be faster and lower power than an exchange through the local memory. Maybe they are even partly reusing the shuffle network for the local memory (which they duplicated for each scheduler/register file set in Kepler?) and just save the writing and reading to the local memory SRAM.
 
Last edited by a moderator:
(4 SIMDs per CU) * (10 wavefronts per SIMD) * (4 clocks per instruction) = 160 clocks

My understanding was that the 4 SIMDs operate in parallel and have a fixed set of wavefronts assigned to them - why are you adding their individual latency hiding abilities?

Gipsel said:
I assumed a high register usage, so this isn't the case anymore.
For such heavy threads the number of workgroups in flight is simply limited by the size of the register files where nV's GPUs have a disadvantage.

Ah, right. It seems to me NV's GPUs are at a disadvantage in either scenario though.
 
My understanding was that the 4 SIMDs operate in parallel and have a fixed set of wavefronts assigned to them - why are you adding their individual latency hiding abilities?
You're right, I goofed. The SIMDs execute simultaneously so even though it's 4 clocks per instruction, that execution overlaps with the other SIMDs.
 
Last edited by a moderator:
I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .


The game is multi-GPU aware , it wouldn't do that .


This is a damn good question. I have not 3 monitors for test sadly,. but i will suggest they try the same without FXAA ( only 4xMSAA ) just for see what happend. But yes, something is strange is their test, if the 680 take only 2040mb vs 4000mb+ for the 7970's...
 
Last edited by a moderator:
Wasn't Nvidia touting some sort of feature that would drop the FPS on the peripheral monitors to increase the FPS on the central?

Is there anyway to track the specific FPS across all three monitors to see how and when this feature is working?
 
LordEC911 said:
Wasn't Nvidia touting some sort of feature that would drop the FPS on the peripheral monitors to increase the FPS on the central?

Is there anyway to track the specific FPS across all three monitors to see how and when this feature is working?
This was a rumor launched by one of the usual sites. But I never saw it mentioned in any review. Probably as much as feature as the HW accelerated physics and $300 launch price.
 
This was a rumor launched by one of the usual sites. But I never saw it mentioned in any review. Probably as much as feature as the HW accelerated physics and $300 launch price.

Yeah, you are right. Now that I think about it I didn't see it mentioned in a single review.
 
When a surround setup is enabled (ie 5760x1080), the new drivers enable you to emulate efficiently a single monitor resolution (ie 1920x1080) without having to disable surround (they add black on both sides of the rendered frame). You can now keep surround enabled at all time but decide in which game to actually use it. There is still some cost compared to disabling surround but it should be negligible.

As usual with some leakers they understand half of what they read and are fine with it ;)
 
I suppose it's already been asked but I don't feel like digging through 20 pages. ;)

Is the chip-known-as-680 a midrange GPU pushed to high clocks so NVIDIA didn't have to push out their big high end chip that is probably not yielding very well? (See unhappy presentation about TSMC)
 
I suppose it's already been asked but I don't feel like digging through 20 pages. ;)

Is the chip-known-as-680 a midrange GPU pushed to high clocks so NVIDIA didn't have to push out their big high end chip that is probably not yielding very well? (See unhappy presentation about TSMC)

It's a GPU that is comparable to GF104/114 in physical terms (size, power envelope) so perhaps you can think of it in whatever terms you thought of GF104/114.

As for the big chip (GK110, apparently) it's probably not yielding well, but that's normal for a big chip on 28nm. More importantly, it's just not ready for launch.
 
Back
Top