NVIDIA Maxwell Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Feb 9, 2011.

Tags:
  1. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    No confirmation, but i believe that a Ti version based on 204 is more than just a possibility. 460TI, 560TI, 660TI etc etc ... this GM206 have half the spec of a GTX980 and let a lot of places between the 970 and this one.
    980 = 16SMM, 970 = 13SMM, 960 = 8SMM.. ( is one TI 10-12SMM with a 256 bits 2GB version, should do the tricks ? )
     
    #2721 lanek, Jan 15, 2015
    Last edited: Jan 15, 2015
  2. Kaarlisk

    Regular Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    I'm really hoping there will be not much more expensive 4GB versions. Otherwise, it will be hard to choose between GTX 960 and R280X.
     
  3. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    FP10 is slightly too inaccurate for accumulation purposes. I would prefer to accumulate in FP16 when I output in FP11. In this case FP16 causes practically no loss, but FP10 / FP11 would both cause loss. It is also debatable whether FP11 output (storage) is enough for PBR (with realistic dynamic range, and realistic specular exponents). I find it (barely) enough when used as a storage format. Accumulation multiple lights (of similar brightness) would reduce the mantissa quality by roughly one or two bit (and reducing "barely enough" by two bits is not going to please the artists).
    If you are at 110 you are already dead :D. A more realistic scenario is optimizing down from something like ~70 to 48 (or 64). This provides a big performance boost (3->5 concurrent waves). Obviously it requires quite a bit more work than saving four registers, but four is always a good start..
    DirectX 10/11 IL is horrible. It's still vector based. The compiler is doing silly things trying optimize your code for vector architectures (that no longer exists). All the major PC GPU vendors (+ Imagination -> Apple) have moved to scalar architectures years ago.

    The only reason for writing DirectX assembly (IL) in DX8 / DX9(SM2.0) was the strict instruction limit. The first hlsl compilers were VERY bad, causing the 64 instruction limit to overflow frequently. You basically had to hand write in the assembly language in order to do anything complex with SM 2.0. The strict 64 instruction limit was a silly limit, as it was a IL instruction count limit (not an actual limit of the hardware microcode ops).
    I can only talk about Xbox 360 here, as Microsoft has posted most of the low level details about the architecture to public, including the microcode syntax (thanks to XNA project). Xbox 360 supported inline microcode (hlsl asm block), making it easy to write the most critical sections with microcode.

    Isolate documentation: http://msdn.microsoft.com/en-us/library/bb313977(v=xnagamestudio.31).aspx.
    Other hlsl extended attributes: http://msdn.microsoft.com/en-us/library/bb313968(v=xnagamestudio.31).aspx.

    Some microcode stuff (and links to more can be found here): https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0CCkQFjAC&url=http://synesthetics.livejournal.com/3720.html&ei=wJO3VP3oHqKeygOd5ILgCg&usg=AFQjCNFPx3Lpdnl1iqhLQlmFsZnxySp67Q&bvm=bv.83640239,d.bGQ

    Unfortunately many of the XNA pages have been removed (most likely because XNA is discontinued), so most of the links in that presentation (and some Google search results) do not work. Google cache helps.
    One of the most important things is to remember is that only peak GPR usage matters. People often describe this as problem of the GPU architecture design. However it is sometimes a good thing as well, since it means that you can freely use as many GPRs in other places (assuming these new registers are not live in the peak), and you only need to optimize the peak to reduce the GPR count (not the other local peaks that are smaller than the biggest peak).
    Yes... but Larrabee was slightly too big and less energy efficient compared to the competition. Hopefully Intel returns to this concept in the future. Intel has the highest chance to pull this off, as they have quite a big process advantage.
    I didn't mean that the tessellation API is messy. This API is perfect for tessellation, but it could have been more generic to suit some other purposes as well. This pipeline setup has some nice properties, such as running multiple shaders concurrently with different granularity (and passing data between them on-chip).

    There are several use cases where you'd want to have different granularity for different processing and memory accesses. GCN scalar unit is helpful for some use cases (when the granularity difference is 1:64 or more), but it's not generic enough. The work suitable for the scalar unit is automatically extracted from the hlsl code by the compiler. As you said earlier, the compilers are not always perfect. I would prefer to manually state which instructions (and loads) are scalar to ensure that my code works the way I intend. Basically you need to perform "threadId / 64" before your memory request (and math) and hope for the best. It seems that loads/math based on system value semantics (and constants read from constant buffers) have higher probability to be extracted to the scalar unit. Scalar unit is also very good for reducing register pressure (as it needs to store the register once for a wave, not once per thread). If you have values that are constant across 64 threads, the compiler should definitely keep these values in scalar registers (as scalar -> vector moves are fast).
    This sounds like a good idea. However, couldn't it just store the data to memory, because all the memory accesses go though the L2 cache? If the lines are not evicted, the GPU will practically transfer data from L1->L2->L1 (of another CU). To ensure that the temporary memory areas are not written to RAM after it's being read by the other CU, the GPU should mark these pages as invalid when the other CU has received all the data. In the writing side it should of course also ensure that the line is not loaded from memory (make it a special case to implement a PPC style "cache line zero" before writing). This way it would use L2 in a flexible way, and would automatically spill to RAM when needed.

    I am starting to feel that we hijacked this thread... This is starting to be a little bit out of topic already...
     
    #2723 sebbbi, Jan 15, 2015
    Last edited: Jan 15, 2015
    Lightman likes this.
  4. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    If you speak about a 960 GM206 128bits for 4GB ? it shouldn't been much relevant performance wise ( even with maxwell storage compression ). Now this could be well more a good reason to see a 960TI with a 256bit bus in two version, 2 and 4GB.
     
  5. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    :roll:

    Since NVIDIA claims that a SMM has 90% the performance of a SMX, why not also advertise an "effective" core clock of 1521 MHz ( = 1127·(192·0.9)/128)? :twisted:
     
  6. A1xLLcqAgt0qc2RyMz0y

    Veteran

    Joined:
    Feb 6, 2010
    Messages:
    1,589
    Likes Received:
    1,490
    Nvidia’s Monstrous 12GB Quadro M6000 Flagship GM200 GPU Confirmed Via Driver Update – Launching Soon

    http://wccftech.com/quadro-m6000-flagship-professional-gpu-spotted-gm200-finally

    Is there any truth to this rumor about the GM200?

    Would Nvidia release another Quadro 6000 as they already have a Kepler one released:

    http://www.nvidia.com/object/product-quadro-6000-us.html

    I am speculating here but if there is going to be a Maxwell GM200 based Quadro M6000 and then since the 6000 model number is kept then the performance will be close (Maxwell may have units disabled) to the Kepler based Quadro 6000. Maxwell will bring new features and lower power.
     
    #2726 A1xLLcqAgt0qc2RyMz0y, Jan 15, 2015
    Last edited: Jan 15, 2015
  7. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    Interesting from the GTX 960 leak ...

     
    #2727 pharma, Jan 15, 2015
    Last edited: Jan 15, 2015
  8. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Sorry but this have make me smile.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    fp16 has 11 effective bits (10 stored) which is roughly 3 significant digits (3.25?). So convert an fp32 result which you know is in the range 0.f - 1.f to int by multiplying by 1024 before packing (losing a quarter of a bit). Gamma-encode before packing, if artefacts appear in the darker tones due to repeated encode/decode...

    I think it's important to distinguish between vec4, which is good for expressing computations on vertices and pixels, and scalar ALUs. I agree, compilers should not be working with (converting to) vec4 as the base data type when they're targetting scalar hardware.

    As I predicted in those exciting times before Larrabee got canned, GPUs were about to hit the wall due to process node slowdown and power. Which would have meant Larrabee would have caught up pretty rapidly. I suspect Intel ran away mostly because of the software problem - it would have to interact with consumers in the enthusiast gaming space to retain credibility. You could argue it does now with its APUs, but I'm doubtful.

    This is a rich topic, that I've only vaguely explored with loop counters and associated math. Sounds like you've had more luck than me!

    I agree that LRU policy could work. But AMD's performance penalties with high tessellation facors, which are generally not solved by writing off chip (and therefore being cached in L2) indicate that in AMD, L2 isn't really working that way. I suspect NVidia configures the cache specifically for this case in tessellation. And this kind of use-case supported by effective, configurable, on-chip storage is precisely the kind of thing developers need, to make progress with their own pipelined algorithms.

    I was hoping someone would share experience with register allocation woes on NVidia :razz:
     
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Problem is, it's not [0,1] but more like [0,20] per source, a good sun/probe though has [0,16k]. That you blow the accu's hypothetical [0,1] with 100 [0,1] lights is also clear, even if they're distributed over the hemisphere.
     
  11. Kaarlisk

    Regular Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    Yes, GM206. It is a form of futureproofing, I'd like to keep it for 3-4 years.
     
  12. Alatar

    Newcomer

    Joined:
    Aug 26, 2014
    Messages:
    26
    Likes Received:
    18
  13. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Ouch, well it is TSMC who will be happy, ( less / wafer ).. specially now that Qualcom ( and Apple ) have drop them for 16nm FF ( back to Samsung / Glofo )
     
  14. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    To quote a crappy American movie: "That's a huge bitch!" ;)

    The next gen after 980 will be my next video card purchase, and I'm going back to Team Green after four iterations of Team Red. LETS MAKE IT HAPPEN NV!!
     
  15. psolord

    Regular

    Joined:
    Jun 22, 2008
    Messages:
    444
    Likes Received:
    55
    They die being so huge, maybe means that there will be many bad chips sooner, hence geforce chips may stock up pretty quick and 980Ti launches sooner than the 780Ti did?
     
  16. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    (Have to resist putting on the Charlie hat. :wink: )
    Around 624mm2 is big, but if they can produce a 780Ti for over a year that's only 10-15% smaller, they should be able to do just the same for this one.

    These die shot measurements always come out a bit more than the actual size, so it's probably more around 600mm2.
     
  17. Dangerman

    Newcomer

    Joined:
    Apr 1, 2014
    Messages:
    43
    Likes Received:
    8
    That's probably 600mm2 or slightly under it, extrapolating the GM200 being 50% bigger than the GM204 and its specs being 50% more.
     
  18. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    9 out of 10 the borders fool folks in thinking that dies are larger than the actually are. GK110 is at 551mm2 and GM200 shouldn't be as big as everyone thinks it is based on the pictures.
     
  19. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    Alatar's estimation is based on the relative difference to GK110, so the border (chip vs package) issue doesn't apply here.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Ugh, yes, whoops, didn't think of that. Clearly, true floating-point doesn't have that problem. A scaled log might work. Clutching at straws now (EDIT: especially because of the bits lost to the most significant digit)...
     
    #2740 Jawed, Jan 16, 2015
    Last edited: Jan 16, 2015
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...