Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Edit: Yep this is right.Shouldn't matter - if you assign 10 to each vALU, you stll get 40 cycles of latency hiding out of one vALU instruction across all 40 wavefronts, no?
http://www.hardocp.com/article/2012/03/28/nvidia_kepler_geforce_gtx_680_sli_video_card_review/5BF3 Multiplayer Summary - GTX 680 SLI offered the best multiplayer experience, despite it having less VRAM capacity and memory bandwidth. We were able to run with motion blur enabled and HBAO turned on at 5760x1200 with FXAA and averaged 60-70 FPS. This amount of performance is perfect for multiplayer, and with the highest in-game settings enabled the game looked great at multiplayer. AMD Radeon HD 7970 CrossFireX struggled for performance, even though it had more RAM and memory bandwidth. To get the game to feel smooth enough with enough performance we had to lower ambient occlusion and motion blur. GeForce GTX 680 SLI was the clear winner in multiplayer.
I think Gipsel was considering a scenario of heavy register allocation and using the same register allocation for both architectures.A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.
Good question considering it loses in single player. What's different about MP? Are the textures lower res or anything?
hardocp said:Even with just FXAA enabled we are over the VRAM limit, 3541 MB was being used, when the cards only have 3GB per GPU. Increasing to 2X and 4X AA took us to near 5 GB of memory.
I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .Good question considering it loses in single player. What's different about MP? Are the textures lower res or anything?
The game is multi-GPU aware , it wouldn't do that .The issue could easily be some memory reporting problem - ie that the CFX setup reports 7GB vram, which BF3 then tries to fill (/doesn't care about cleaning up stuff not currently used), leading to massive vram swapping.
I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .
The game is multi-GPU aware , it wouldn't do that .
It's clearly doing something odd with the radeon's as it's sucking up double the vram.
Yes, but my argument targeted the latency hiding capabilities (for memory accesses) indicated by the number of wavefronts/warps/workgroups in flight (and the time the issue of instructions takes for these) in case of a certain (high) register allocation.A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.
Thanks!Is this what you're looking for?
http://forums.nvidia.com/index.php?showtopic=225312&view=findpost&p=1387312
I assumed a high register usage, so this isn't the case anymore.Gipsel - good point on max register usage cutting down on the number of warps per SMX, I had forgotten to account for that.
I don't quite understand your GCN numbers though. Doesn't a CU track a maximum of 40 wavefronts? You can process an ALU op for 4 entire waves every 4 cycles. So by executing 1 ALU instruction over all 40 waves, you can hide 40 cycles of memory access latency (much more than on Fermi/Kepler), again assuming you have enough registers for 40 wavefronts.
Exactly.I think Gipsel was considering a scenario of heavy register allocation and using the same register allocation for both architectures.
He was indicating that whereas Kepler cannot hide ALU latency with relatively few hardware threads, GCN has no problem.
Are the 4 cycles confirmed by some low level benchmark?Why does a MAD take NVidia 11 cycles, but AMD 4 cycles?
I will wait for benchmarks of this feature. If it is as fast as in case of Larrabee, it contradicts a bit NV's mantra of late to localize the register files as close as possible to the ALUs to get a low power cost. But with some additional latency is appears like a nice idea as it should still be faster and lower power than an exchange through the local memory. Maybe they are even partly reusing the shuffle network for the local memory (which they duplicated for each scheduler/register file set in Kepler?) and just save the writing and reading to the local memory SRAM.The ability to copy from work item to work item should be very nice, obviating moves through local memory. This is similar to Larrabee's shuffle.
(4 SIMDs per CU) * (10 wavefronts per SIMD) * (4 clocks per instruction) = 160 clocks
Gipsel said:I assumed a high register usage, so this isn't the case anymore.
For such heavy threads the number of workgroups in flight is simply limited by the size of the register files where nV's GPUs have a disadvantage.
You're right, I goofed. The SIMDs execute simultaneously so even though it's 4 clocks per instruction, that execution overlaps with the other SIMDs.My understanding was that the 4 SIMDs operate in parallel and have a fixed set of wavefronts assigned to them - why are you adding their individual latency hiding abilities?
I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .
The game is multi-GPU aware , it wouldn't do that .
This was a rumor launched by one of the usual sites. But I never saw it mentioned in any review. Probably as much as feature as the HW accelerated physics and $300 launch price.LordEC911 said:Wasn't Nvidia touting some sort of feature that would drop the FPS on the peripheral monitors to increase the FPS on the central?
Is there anyway to track the specific FPS across all three monitors to see how and when this feature is working?
This was a rumor launched by one of the usual sites. But I never saw it mentioned in any review. Probably as much as feature as the HW accelerated physics and $300 launch price.
I suppose it's already been asked but I don't feel like digging through 20 pages.
Is the chip-known-as-680 a midrange GPU pushed to high clocks so NVIDIA didn't have to push out their big high end chip that is probably not yielding very well? (See unhappy presentation about TSMC)