NVIDIA Maxwell Speculation Thread

It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

I dont know whether it means Maxwell simply dont care occuppyies that much due to L2 cache and possibily reduced instruction latencies, or actually they have no choice but use that much regs due to they cut corners here and there too much (L1?).

https://devtalk.nvidia.com/default/...hats-new-about-maxwell-/post/4127010/#4127010

Also, notice, that DriverQuery come with CUDA 6.0 reports the maximum SMEM can be used per block is still 48 KB for maxwell cards, despite that Maxwell have 64 KB SMEM per SM, althrough according to above figure, NVCC can build SM5.0 objs even if a block ask for 64KB SMEM, so dont know if its the issue of driver, or NVCC, or both.
 
It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

Nope, sorry to mislead w/the register count. It's still 64K. The regs were changed because I was trying out a different block size. I'll fix the post so that the regs appear the same.
 
Which one is the possibility or Nvidia desire to make a 256 bit memory bus 2xGM107 in one chip if there are delays or problems with TSMC 20nm??
 
It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

Nope, sorry to mislead w/the register count. It's still 64K. The regs were changed because I was trying out a different block size. I'll fix the post so that the regs appear the same.

@LiXiangyang, actually you are right that when NVCC is targeting sm_35 it's very frugal with registers and the Maxwell sm_50 allocation seems less optimized. Perhaps because it has been polished for the past year? :smile:

Anyway, you made a good observation and it actually took some effort to push the register allocation upward in my dummy test program. The test program was just emitting a slew of outstanding loads followed by an equal number of stores with some shared memory accesses mixed in. The sm_35 target was intelligently optimizing the register footprint until I stopped it (__syncthreads() is the solution to everything).

The screen captures are now updated.
 
Last edited by a moderator:
Plus don't forget that register pressure will be lowered a little bit thanks to the shorter arithmetic pipeline.
Do you have a number for this? Is the pipeline really shorter or did they implement result forwarding to cut the register file access from the instruction latency?
 
I'm wondering if Sandra might be launching a bunch of warps/wavefronts that all do memory access and end up getting scheduled in round robin fashion, so that what Sandra really measures is something more like the size of the scheduler's queue of warps/wavefronts rather than actual cache latency.
I thought about one of the specific tests some time ago (pre GCN) to find an explanation for the weird (and clearly wrong) numbers. My conclusion was that at least in one of the tests (I think it was local/shared memory latency) Sandra was basically spitting out a convoluted measure for the clause scheduling latency. So yes, sometimes Sandra appears to be a test of the scheduling behaviour (and the architectural knowledge of the Sandra guys ;)), not memory latency.
 
Tomshardware has the traditional (for CPU at least) graph that plots average memory latency for random accesses within a block of a certain size. The result is remarkable in that the latency has gone down dramatically, especially for the external memory, where it goes from 280 cycles for a 650Ti to 180 for the 750Ti. That's really a massive improvement.

(Link: http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750.html)

Wow, no wonder nv managed to double perf/W on the same node.
 
rpg.314's point about L1 misses mainly going to L2 also makes sense, and I also didn't know that L1 mainly cached stacks. Sounds like maybe things might be different for later chips.
On a GPU, the stack variables are in registers.

L1 is for spilled registers and other data.
 
Also, notice, that DriverQuery come with CUDA 6.0 reports the maximum SMEM can be used per block is still 48 KB for maxwell cards, despite that Maxwell have 64 KB SMEM per SM, althrough according to above figure, NVCC can build SM5.0 objs even if a block ask for 64KB SMEM, so dont know if its the issue of driver, or NVCC, or both.

There is a bug in the RC version of the ptx assembler. The limit is indeed 48 KB of shared memory per block. This will be corrected for the official release.
 
Sorry - my ignorance of GPU programming is showing. I can see how you'd want to keep as much stuff as possible in regs. But I don't really understand where you would spill registers to, if not to something that is essentially stack like. Explanations appreciated :)
 
There is a bug in the RC version of the ptx assembler. The limit is indeed 48 KB of shared memory per block. This will be corrected for the official release.

Which? The limit or the documentation?
 
Not on GM204.. im 99,99% sure than on DP they will crippled it again and disable DP units.

Then you obviously didn't understand what you just quoted. If GM204 is also going to have slightly more FP64 SPs compared to GK104 (just as GM107 vs. GK107), GM204 might end up with MORE SMMs then I would had expected.
 
Sorry - my ignorance of GPU programming is showing. I can see how you'd want to keep as much stuff as possible in regs. But I don't really understand where you would spill registers to, if not to something that is essentially stack like. Explanations appreciated :)


GK110 fits 30720 threads on chip at once. The runtime allocates stack space for each of these threads to hold spilling, but only a small amount for each thread, since there are so many of them. The other wrinkle is that stacks are not contiguous - all the stacks are interleaved so that when a SIMD vector does a load from the same address relative to each lane's stack pointer, the load is vectorized. This is why they talk about "local memory", since it's allocated and accessed differently than normal "global memory".
 
On Maxwell sm50 profiles, blocks are HW-limited to 48 KB of shared memory (each). This is the same as on Kepler and Fermi. ptxas was misreporting the capability as 64 KB per block. This is incorrect. The real HW maximum size is 48 KB of shared memory per block.

SMM has a total shared memory capacity of 64 KB. So it can run one block that uses 48 KB, or two blocks each using 32 KB of shared memory at the same time.

Hope this clarifies things.
 
On Maxwell sm50 profiles, blocks are HW-limited to 48 KB of shared memory (each). This is the same as on Kepler and Fermi. ptxas was misreporting the capability as 64 KB per block. This is incorrect. The real HW maximum size is 48 KB of shared memory per block.

SMM has a total shared memory capacity of 64 KB. So it can run one block that uses 48 KB, or two blocks each using 32 KB of shared memory at the same time.

Hope this clarifies things.


Absolutely. Thanks Bob!
 
Impressive indeed for a low end card! Its about time low end cards get some love.

A 150$ videocard is anything but low-end, some people who are enthusiasts but still either do not wanna spend insane amounts of money or simply do not have it, buy cards in this exact price segment and play everything on close to maximum settings!

Keep in mind also that just one tech process (40 nm) generation ago this was the high-end type of performance!
 
A 150$ videocard is anything but low-end, some people who are enthusiasts but still either do not wanna spend insane amounts of money or simply do not have it, buy cards in this exact price segment and play everything on close to maximum settings!

Keep in mind also that just one tech process (40 nm) generation ago this was the high-end type of performance!


at 500$ msrp...
 
Back
Top