NVIDIA Maxwell Speculation Thread

lanek · Jan 15, 2015

psolord said:
Is it confirmed that there will be another cut down GM204 video card (aka 960Ti) or Nvidia will only release this GM206 one and call it a day?

No confirmation, but i believe that a Ti version based on 204 is more than just a possibility. 460TI, 560TI, 660TI etc etc ... this GM206 have half the spec of a GTX980 and let a lot of places between the 970 and this one.
980 = 16SMM, 970 = 13SMM, 960 = 8SMM.. ( is one TI 10-12SMM with a 256 bits 2GB version, should do the tricks ? )

Kaarlisk · Jan 15, 2015

I'm really hoping there will be not much more expensive 4GB versions. Otherwise, it will be hard to choose between GTX 960 and R280X.

sebbbi · Jan 15, 2015

Jawed said:
In this case I'm thinking of packing three 10-bit integers to save four overall

FP10 is slightly too inaccurate for accumulation purposes. I would prefer to accumulate in FP16 when I output in FP11. In this case FP16 causes practically no loss, but FP10 / FP11 would both cause loss. It is also debatable whether FP11 output (storage) is enough for PBR (with realistic dynamic range, and realistic specular exponents). I find it (barely) enough when used as a storage format. Accumulation multiple lights (of similar brightness) would reduce the mantissa quality by roughly one or two bit (and reducing "barely enough" by two bits is not going to please the artists).

Jawed said:
But anyway, if your register allocation is at 110, then saving four registers isn't going to magically get you down to 84, which is what you need to get a whole extra hardware thread on the ALU

If you are at 110 you are already dead

. A more realistic scenario is optimizing down from something like ~70 to 48 (or 64). This provides a big performance boost (3->5 concurrent waves). Obviously it requires quite a bit more work than saving four registers, but four is always a good start..

Jawed said:
I have not so fond memories of debugging complex memory addressing in IL by writing stuff to a buffer instead of the kernel's output. Assembly, on PC, is still subject to the whims of the driver, so the only real solution seems to be writing/patching the ELF format.

DirectX 10/11 IL is horrible. It's still vector based. The compiler is doing silly things trying optimize your code for vector architectures (that no longer exists). All the major PC GPU vendors (+ Imagination -> Apple) have moved to scalar architectures years ago.

The only reason for writing DirectX assembly (IL) in DX8 / DX9(SM2.0) was the strict instruction limit. The first hlsl compilers were VERY bad, causing the 64 instruction limit to overflow frequently. You basically had to hand write in the assembly language in order to do anything complex with SM 2.0. The strict 64 instruction limit was a silly limit, as it was a IL instruction count limit (not an actual limit of the hardware microcode ops).

Jawed said:
Are there whims to deal with on console if you write assembly?

I can only talk about Xbox 360 here, as Microsoft has posted most of the low level details about the architecture to public, including the microcode syntax (thanks to XNA project). Xbox 360 supported inline microcode (hlsl asm block), making it easy to write the most critical sections with microcode.

Isolate documentation: http://msdn.microsoft.com/en-us/library/bb313977(v=xnagamestudio.31).aspx.
Other hlsl extended attributes: http://msdn.microsoft.com/en-us/library/bb313968(v=xnagamestudio.31).aspx.

Some microcode stuff (and links to more can be found here): https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0CCkQFjAC&url=http://synesthetics.livejournal.com/3720.html&ei=wJO3VP3oHqKeygOd5ILgCg&usg=AFQjCNFPx3Lpdnl1iqhLQlmFsZnxySp67Q&bvm=bv.83640239,d.bGQ

Unfortunately many of the XNA pages have been removed (most likely because XNA is discontinued), so most of the links in that presentation (and some Google search results) do not work. Google cache helps.

Jawed said:
Other techniques I forgot to mention last night...

One of the most important things is to remember is that only peak GPR usage matters. People often describe this as problem of the GPU architecture design. However it is sometimes a good thing as well, since it means that you can freely use as many GPRs in other places (assuming these new registers are not live in the peak), and you only need to optimize the peak to reduce the GPR count (not the other local peaks that are smaller than the biggest peak).

Jawed said:
To catch up with what could have been possible with Larrabee years ago

Yes... but Larrabee was slightly too big and less energy efficient compared to the competition. Hopefully Intel returns to this concept in the future. Intel has the highest chance to pull this off, as they have quite a big process advantage.

Jawed said:
I can't remember seeing an alternative tessellation API that would have been demonstrably as capable and cleaner. Anyway, I'm not sure if there's likely to be yet more pipeline stages.

I didn't mean that the tessellation API is messy. This API is perfect for tessellation, but it could have been more generic to suit some other purposes as well. This pipeline setup has some nice properties, such as running multiple shaders concurrently with different granularity (and passing data between them on-chip).

There are several use cases where you'd want to have different granularity for different processing and memory accesses. GCN scalar unit is helpful for some use cases (when the granularity difference is 1:64 or more), but it's not generic enough. The work suitable for the scalar unit is automatically extracted from the hlsl code by the compiler. As you said earlier, the compilers are not always perfect. I would prefer to manually state which instructions (and loads) are scalar to ensure that my code works the way I intend. Basically you need to perform "threadId / 64" before your memory request (and math) and hope for the best. It seems that loads/math based on system value semantics (and constants read from constant buffers) have higher probability to be extracted to the scalar unit. Scalar unit is also very good for reducing register pressure (as it needs to store the register once for a wave, not once per thread). If you have values that are constant across 64 threads, the compiler should definitely keep these values in scalar registers (as scalar -> vector moves are fast).

Jawed said:
What we need is to be able to lock L2 cache lines (or call them last-level cache lines if you want to make it generic) for on-chip pipe buffering. GPUs have loads of L2. NVidia is effectively doing so as part of load-balancing tessellation.

This sounds like a good idea. However, couldn't it just store the data to memory, because all the memory accesses go though the L2 cache? If the lines are not evicted, the GPU will practically transfer data from L1->L2->L1 (of another CU). To ensure that the temporary memory areas are not written to RAM after it's being read by the other CU, the GPU should mark these pages as invalid when the other CU has received all the data. In the writing side it should of course also ensure that the line is not loaded from memory (make it a special case to implement a PPC style "cache line zero" before writing). This way it would use L2 in a flexible way, and would automatically spill to RAM when needed.

I am starting to feel that we hijacked this thread... This is starting to be a little bit out of topic already...

lanek · Jan 15, 2015

Kaarlisk said:
I'm really hoping there will be not much more expensive 4GB versions. Otherwise, it will be hard to choose between GTX 960 and R280X.

If you speak about a 960 GM206 128bits for 4GB ? it shouldn't been much relevant performance wise ( even with maxwell storage compression ). Now this could be well more a good reason to see a 960TI with a 256bit bus in two version, 2 and 4GB.

iMacmatician · Jan 15, 2015

Alatar said:
Slides, specs and launch date:

http://videocardz.com/54329/nvidia-geforce-gtx-960-confirmed-specifications-and-launch-date

Videocardz said:
Jen-Hsun told me the efficient memory clock is 9.3 GHz (yes, I’m serious). NVIDIA is now advertising memory clock speed after increasing it by 33% (the bandwidth is higher due to 3rd Gen. colour compression (delta compression) algorithm.

Since NVIDIA claims that a SMM has 90% the performance of a SMX, why not also advertise an "effective" core clock of 1521 MHz ( = 1127·(192·0.9)/128)?

A1xLLcqAgt0qc2RyMz0y · Jan 15, 2015

Nvidia’s Monstrous 12GB Quadro M6000 Flagship GM200 GPU Confirmed Via Driver Update – Launching Soon

http://wccftech.com/quadro-m6000-flagship-professional-gpu-spotted-gm200-finally

Is there any truth to this rumor about the GM200?

Would Nvidia release another Quadro 6000 as they already have a Kepler one released:

http://www.nvidia.com/object/product-quadro-6000-us.html

I am speculating here but if there is going to be a Maxwell GM200 based Quadro M6000 and then since the 6000 model number is kept then the performance will be close (Maxwell may have units disabled) to the Kepler based Quadro 6000. Maxwell will bring new features and lower power.

Deleted member 2197 · Jan 15, 2015

Interesting from the GTX 960 leak ...

Something I found in the bin as well:

BUILT FOR DX12

Rasterizer Ordered Views

Typed UAV Load

Volume Tiled Resources

Conservative Raster

AN UNPRECEDENTED SUCCESS

1.3 MILLION Gamers Tuned in to Game24

1 MILLION~ GTX 980 / 970’s Sold

100 MILLION Hours played on Maxwell

lanek · Jan 15, 2015

http://videocardz.com/54329/nvidia-geforce-gtx-960-confirmed-specifications-and-launch-date

Jen-Hsun told me the efficient memory clock is 9.3 GHz (yes, I’m serious). NVIDIA is now advertising memory clock speed after increasing it by 33% (the bandwidth is higher due to 3rd Gen. colour compression (delta compression) algorithm.

Soon, the only real thing in NVIDIA materials will be the name of the card.

Sorry but this have make me smile.

Jawed · Jan 15, 2015

sebbbi said:
FP10 is slightly too inaccurate for accumulation purposes. I would prefer to accumulate in FP16 when I output in FP11. In this case FP16 causes practically no loss, but FP10 / FP11 would both cause loss. It is also debatable whether FP11 output (storage) is enough for PBR (with realistic dynamic range, and realistic specular exponents). I find it (barely) enough when used as a storage format. Accumulation multiple lights (of similar brightness) would reduce the mantissa quality by roughly one or two bit (and reducing "barely enough" by two bits is not going to please the artists).

fp16 has 11 effective bits (10 stored) which is roughly 3 significant digits (3.25?). So convert an fp32 result which you know is in the range 0.f - 1.f to int by multiplying by 1024 before packing (losing a quarter of a bit). Gamma-encode before packing, if artefacts appear in the darker tones due to repeated encode/decode...

DirectX 10/11 IL is horrible. It's still vector based. The compiler is doing silly things trying optimize your code for vector architectures (that no longer exists). All the major PC GPU vendors (+ Imagination -> Apple) have moved to scalar architectures years ago.

I think it's important to distinguish between vec4, which is good for expressing computations on vertices and pixels, and scalar ALUs. I agree, compilers should not be working with (converting to) vec4 as the base data type when they're targetting scalar hardware.

Yes... but Larrabee was slightly too big and less energy efficient compared to the competition. Hopefully Intel returns to this concept in the future. Intel has the highest chance to pull this off, as they have quite a big process advantage.

As I predicted in those exciting times before Larrabee got canned, GPUs were about to hit the wall due to process node slowdown and power. Which would have meant Larrabee would have caught up pretty rapidly. I suspect Intel ran away mostly because of the software problem - it would have to interact with consumers in the enthusiast gaming space to retain credibility. You could argue it does now with its APUs, but I'm doubtful.

There are several use cases where you'd want to have different granularity for different processing and memory accesses. GCN scalar unit is helpful for some use cases (when the granularity difference is 1:64 or more), but it's not generic enough. The work suitable for the scalar unit is automatically extracted from the hlsl code by the compiler. As you said earlier, the compilers are not always perfect. I would prefer to manually state which instructions (and loads) are scalar to ensure that my code works the way I intend. Basically you need to perform "threadId / 64" before your memory request (and math) and hope for the best. It seems that loads/math based on system value semantics (and constants read from constant buffers) have higher probability to be extracted to the scalar unit. Scalar unit is also very good for reducing register pressure (as it needs to store the register once for a wave, not once per thread). If you have values that are constant across 64 threads, the compiler should definitely keep these values in scalar registers (as scalar -> vector moves are fast).

This is a rich topic, that I've only vaguely explored with loop counters and associated math. Sounds like you've had more luck than me!

This sounds like a good idea. However, couldn't it just store the data to memory, because all the memory accesses go though the L2 cache? If the lines are not evicted, the GPU will practically transfer data from L1->L2->L1 (of another CU). To ensure that the temporary memory areas are not written to RAM after it's being read by the other CU, the GPU should mark these pages as invalid when the other CU has received all the data. In the writing side it should of course also ensure that the line is not loaded from memory (make it a special case to implement a PPC style "cache line zero" before writing). This way it would use L2 in a flexible way, and would automatically spill to RAM when needed.

I agree that LRU policy could work. But AMD's performance penalties with high tessellation facors, which are generally not solved by writing off chip (and therefore being cached in L2) indicate that in AMD, L2 isn't really working that way. I suspect NVidia configures the cache specifically for this case in tessellation. And this kind of use-case supported by effective, configurable, on-chip storage is precisely the kind of thing developers need, to make progress with their own pipelined algorithms.

I am starting to feel that we hijacked this thread... This is starting to be a little bit out of topic already...

I was hoping someone would share experience with register allocation woes on NVidia

Ethatron · Jan 16, 2015

Jawed said:
fp16 has 11 effective bits (10 stored) which is roughly 3 significant digits (3.25?). So convert an fp32 result which you know is in the range 0.f - 1.f to int by multiplying by 1024 before packing (losing a quarter of a bit). Gamma-encode before packing, if artefacts appear in the darker tones due to repeated encode/decode...

Problem is, it's not [0,1] but more like [0,20] per source, a good sun/probe though has [0,16k]. That you blow the accu's hypothetical [0,1] with 100 [0,1] lights is also clear, even if they're distributed over the hemisphere.

Kaarlisk · Jan 16, 2015

lanek said:
If you speak about a 960 GM206 128bits for 4GB ? it shouldn't been much relevant performance wise ( even with maxwell storage compression ). Now this could be well more a good reason to see a 960TI with a 256bit bus in two version, 2 and 4GB.

Yes, GM206. It is a form of futureproofing, I'd like to keep it for 3-4 years.

Alatar · Jan 16, 2015

http://videocardz.com/54358/nvidia-maxwell-gm200-pictured

I also tried making a die size comparison in gimp, probably not all that accurate but as I said on OCN, I doubt the thing is under 600mm^2:

lanek · Jan 16, 2015

Ouch, well it is TSMC who will be happy, ( less / wafer ).. specially now that Qualcom ( and Apple ) have drop them for 16nm FF ( back to Samsung / Glofo )

Albuquerque · Jan 16, 2015

To quote a crappy American movie: "That's a huge bitch!"

The next gen after 980 will be my next video card purchase, and I'm going back to Team Green after four iterations of Team Red. LETS MAKE IT HAPPEN NV!!

psolord · Jan 16, 2015

They die being so huge, maybe means that there will be many bad chips sooner, hence geforce chips may stock up pretty quick and 980Ti launches sooner than the 780Ti did?

silent_guy · Jan 16, 2015

(Have to resist putting on the Charlie hat.

)
Around 624mm2 is big, but if they can produce a 780Ti for over a year that's only 10-15% smaller, they should be able to do just the same for this one.

These die shot measurements always come out a bit more than the actual size, so it's probably more around 600mm2.

Dangerman · Jan 16, 2015

That's probably 600mm2 or slightly under it, extrapolating the GM200 being 50% bigger than the GM204 and its specs being 50% more.

Ailuros · Jan 16, 2015

9 out of 10 the borders fool folks in thinking that dies are larger than the actually are. GK110 is at 551mm2 and GM200 shouldn't be as big as everyone thinks it is based on the pictures.

Psycho · Jan 16, 2015

Alatar's estimation is based on the relative difference to GK110, so the border (chip vs package) issue doesn't apply here.

Jawed · Jan 16, 2015

Ethatron said:
Problem is, it's not [0,1] but more like [0,20] per source, a good sun/probe though has [0,16k]. That you blow the accu's hypothetical [0,1] with 100 [0,1] lights is also clear, even if they're distributed over the hemisphere.

Ugh, yes, whoops, didn't think of that. Clearly, true floating-point doesn't have that problem. A scaled log might work. Clutching at straws now (EDIT: especially because of the bits lost to the most significant digit)...

NVIDIA Maxwell Speculation Thread

lanek

Kaarlisk

sebbbi

lanek

iMacmatician

A1xLLcqAgt0qc2RyMz0y

Deleted member 2197

Guest

lanek

Jawed

Ethatron

Kaarlisk

Alatar

lanek

Albuquerque

Red-headed step child

psolord

silent_guy

Dangerman

Ailuros

Epsilon plus three

Psycho

Jawed

Similar threads