NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

Tags:
  1. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Shouldn't matter - if you assign 10 to each vALU, you stll get 40 cycles of latency hiding out of one vALU instruction across all 40 wavefronts, no?
     
  2. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Edit: Yep this is right.
     
    #3842 OpenGL guy, Mar 30, 2012
    Last edited by a moderator: Mar 30, 2012
  3. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,791
    Likes Received:
    2,602
    I wonder why ???

    http://www.hardocp.com/article/2012/03/28/nvidia_kepler_geforce_gtx_680_sli_video_card_review/5
     
  4. AlphaWolf

    AlphaWolf Specious Misanthrope
    Legend

    Joined:
    May 28, 2003
    Messages:
    8,486
    Likes Received:
    335
    Location:
    Treading Water
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think Gipsel was considering a scenario of heavy register allocation and using the same register allocation for both architectures.

    He was indicating that whereas Kepler cannot hide ALU latency with relatively few hardware threads, GCN has no problem.

    ---

    Why does a MAD take NVidia 11 cycles, but AMD 4 cycles?

    ---

    The ability to copy from work item to work item should be very nice, obviating moves through local memory. This is similar to Larrabee's shuffle.
     
  6. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    745
    Likes Received:
    39
    Location:
    Copenhagen
    Multiplayer is MUCH more cpu demanding.

    The issue could easily be some memory reporting problem - ie that the CFX setup reports 7GB vram, which BF3 then tries to fill (/doesn't care about cleaning up stuff not currently used), leading to massive vram swapping.
     
  7. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,791
    Likes Received:
    2,602
    I wouldn't call the difference of 3 FPS losing , however MP is more memory intensive than SP , bigger maps , larger textures and more alpha and particle effects (explosion , smoke ..etc) .

    The game is multi-GPU aware , it wouldn't do that .
     
  8. AlphaWolf

    AlphaWolf Specious Misanthrope
    Legend

    Joined:
    May 28, 2003
    Messages:
    8,486
    Likes Received:
    335
    Location:
    Treading Water
    Well in the 4x MSAA test it was 5FPS (or 12%) that seems significant.


    It's clearly doing something odd with the radeon's as it's sucking up double the vram.
     
  9. boxleitnerb

    Regular

    Joined:
    Aug 27, 2004
    Messages:
    407
    Likes Received:
    0
    How would it do that if it only has 3072MB? Clearly a reporting error. This is nothing new.
     
  10. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Yes, but my argument targeted the latency hiding capabilities (for memory accesses) indicated by the number of wavefronts/warps/workgroups in flight (and the time the issue of instructions takes for these) in case of a certain (high) register allocation.
    Thanks!
    Actually, according to some low level tests it is between 18 and 22 (hot clock) cycles on Fermi depending on register bank conflicts, so maybe nV opted for a constant 11 cycles to get rid of the variable latency for the static scheduling.
    I assumed a high register usage, so this isn't the case anymore. ;)
    For such heavy threads the number of workgroups in flight is simply limited by the size of the register files where nV's GPUs have a disadvantage.
    Exactly.
    The scalar ALU with its separate register file actually should enable AMD to get away with slightly less register usage than nVidia in quite a few cases, as for instance constants or adresses which are the same for all elements in a wavefront can be supplied from there and don't have to be in the vector registers.
    Are the 4 cycles confirmed by some low level benchmark?
    Considering the simplifications in the ALUs compared to the VLIW architecture (which had 8 cycles arithmetic latency) it appeared quite possible (and the AFDS presentation mentioned back-to-back issue of vector ops without alternating between wavefronts [vector to scalar issue needs to have a 4 cycle latency panelty, btw.]). But seeing that GCN is actually able to hit frequencies above 1 GHz (my initial guess was not higher than VLIW), it could still be 8 cycles, even if then the architecture presentation at the AFDS would have been misleading in this point.

    If it is indeed 4 versus 11 cycles I can only speculate that the reasons are similar to the ones for Fermi: Even as Kepler goes down the route to a much more static scheduling, the actual register access could still be part of the effective latency (not hidden by result forwarding incorporated to the pipeline) while it is not for AMD.

    I will wait for benchmarks of this feature. If it is as fast as in case of Larrabee, it contradicts a bit NV's mantra of late to localize the register files as close as possible to the ALUs to get a low power cost. But with some additional latency is appears like a nice idea as it should still be faster and lower power than an exchange through the local memory. Maybe they are even partly reusing the shuffle network for the local memory (which they duplicated for each scheduler/register file set in Kepler?) and just save the writing and reading to the local memory SRAM.
     
    #3850 Gipsel, Mar 30, 2012
    Last edited by a moderator: Mar 30, 2012
  11. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    My understanding was that the 4 SIMDs operate in parallel and have a fixed set of wavefronts assigned to them - why are you adding their individual latency hiding abilities?

    Ah, right. It seems to me NV's GPUs are at a disadvantage in either scenario though.
     
  12. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    You're right, I goofed. The SIMDs execute simultaneously so even though it's 4 clocks per instruction, that execution overlaps with the other SIMDs.
     
    #3852 OpenGL guy, Mar 30, 2012
    Last edited by a moderator: Mar 30, 2012
  13. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland

    This is a damn good question. I have not 3 monitors for test sadly,. but i will suggest they try the same without FXAA ( only 4xMSAA ) just for see what happend. But yes, something is strange is their test, if the 680 take only 2040mb vs 4000mb+ for the 7970's...
     
    #3853 lanek, Mar 31, 2012
    Last edited by a moderator: Mar 31, 2012
  14. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    789
    Likes Received:
    74
    Location:
    'Zona
    Wasn't Nvidia touting some sort of feature that would drop the FPS on the peripheral monitors to increase the FPS on the central?

    Is there anyway to track the specific FPS across all three monitors to see how and when this feature is working?
     
  15. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    This was a rumor launched by one of the usual sites. But I never saw it mentioned in any review. Probably as much as feature as the HW accelerated physics and $300 launch price.
     
  16. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    789
    Likes Received:
    74
    Location:
    'Zona
    Yeah, you are right. Now that I think about it I didn't see it mentioned in a single review.
     
  17. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    When a surround setup is enabled (ie 5760x1080), the new drivers enable you to emulate efficiently a single monitor resolution (ie 1920x1080) without having to disable surround (they add black on both sides of the rendered frame). You can now keep surround enabled at all time but decide in which game to actually use it. There is still some cost compared to disabling surround but it should be negligible.

    As usual with some leakers they understand half of what they read and are fine with it ;)
     
  18. kyleb

    Veteran

    Joined:
    Nov 21, 2002
    Messages:
    4,165
    Likes Received:
    52
    That is a nice feature though.
     
  19. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,457
    Likes Received:
    580
    Location:
    WI, USA
    I suppose it's already been asked but I don't feel like digging through 20 pages. ;)

    Is the chip-known-as-680 a midrange GPU pushed to high clocks so NVIDIA didn't have to push out their big high end chip that is probably not yielding very well? (See unhappy presentation about TSMC)
     
  20. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    It's a GPU that is comparable to GF104/114 in physical terms (size, power envelope) so perhaps you can think of it in whatever terms you thought of GF104/114.

    As for the big chip (GK110, apparently) it's probably not yielding well, but that's normal for a big chip on 28nm. More importantly, it's just not ready for launch.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...