NVIDIA Maxwell Speculation Thread

LiXiangyang · Feb 19, 2014

It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

I dont know whether it means Maxwell simply dont care occuppyies that much due to L2 cache and possibily reduced instruction latencies, or actually they have no choice but use that much regs due to they cut corners here and there too much (L1?).

https://devtalk.nvidia.com/default/...hats-new-about-maxwell-/post/4127010/#4127010

Also, notice, that DriverQuery come with CUDA 6.0 reports the maximum SMEM can be used per block is still 48 KB for maxwell cards, despite that Maxwell have 64 KB SMEM per SM, althrough according to above figure, NVCC can build SM5.0 objs even if a block ask for 64KB SMEM, so dont know if its the issue of driver, or NVCC, or both.

CarstenS · Feb 19, 2014

pixelio · Feb 19, 2014

LiXiangyang said:
It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

Nope, sorry to mislead w/the register count. It's still 64K. The regs were changed because I was trying out a different block size. I'll fix the post so that the regs appear the same.

Osamar · Feb 19, 2014

Which one is the possibility or Nvidia desire to make a 256 bit memory bus 2xGM107 in one chip if there are delays or problems with TSMC 20nm??

pixelio · Feb 19, 2014

LiXiangyang said:
It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

pixelio said:
Nope, sorry to mislead w/the register count. It's still 64K. The regs were changed because I was trying out a different block size. I'll fix the post so that the regs appear the same.

@LiXiangyang, actually you are right that when NVCC is targeting sm_35 it's very frugal with registers and the Maxwell sm_50 allocation seems less optimized. Perhaps because it has been polished for the past year? :smile:

Anyway, you made a good observation and it actually took some effort to push the register allocation upward in my dummy test program. The test program was just emitting a slew of outstanding loads followed by an equal number of stores with some shared memory accesses mixed in. The sm_35 target was intelligently optimizing the register footprint until I stopped it (__syncthreads() is the solution to everything).

The screen captures are now updated.

Gipsel · Feb 20, 2014

Tridam said:
Plus don't forget that register pressure will be lowered a little bit thanks to the shorter arithmetic pipeline.

Do you have a number for this? Is the pipeline really shorter or did they implement result forwarding to cut the register file access from the instruction latency?

Gipsel · Feb 20, 2014

psurge said:
I'm wondering if Sandra might be launching a bunch of warps/wavefronts that all do memory access and end up getting scheduled in round robin fashion, so that what Sandra really measures is something more like the size of the scheduler's queue of warps/wavefronts rather than actual cache latency.

I thought about one of the specific tests some time ago (pre GCN) to find an explanation for the weird (and clearly wrong) numbers. My conclusion was that at least in one of the tests (I think it was local/shared memory latency) Sandra was basically spitting out a convoluted measure for the clause scheduling latency. So yes, sometimes Sandra appears to be a test of the scheduling behaviour (and the architectural knowledge of the Sandra guys

), not memory latency.

rpg.314 · Feb 20, 2014

silent_guy said:
Tomshardware has the traditional (for CPU at least) graph that plots average memory latency for random accesses within a block of a certain size. The result is remarkable in that the latency has gone down dramatically, especially for the external memory, where it goes from 280 cycles for a 650Ti to 180 for the 750Ti. That's really a massive improvement.

(Link: http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750.html)

Wow, no wonder nv managed to double perf/W on the same node.

rpg.314 · Feb 20, 2014

psurge said:
rpg.314's point about L1 misses mainly going to L2 also makes sense, and I also didn't know that L1 mainly cached stacks. Sounds like maybe things might be different for later chips.

On a GPU, the stack variables are in registers.

L1 is for spilled registers and other data.

Bob · Feb 20, 2014

LiXiangyang said:
Also, notice, that DriverQuery come with CUDA 6.0 reports the maximum SMEM can be used per block is still 48 KB for maxwell cards, despite that Maxwell have 64 KB SMEM per SM, althrough according to above figure, NVCC can build SM5.0 objs even if a block ask for 64KB SMEM, so dont know if its the issue of driver, or NVCC, or both.

There is a bug in the RC version of the ptx assembler. The limit is indeed 48 KB of shared memory per block. This will be corrected for the official release.

psurge · Feb 20, 2014

Sorry - my ignorance of GPU programming is showing. I can see how you'd want to keep as much stuff as possible in regs. But I don't really understand where you would spill registers to, if not to something that is essentially stack like. Explanations appreciated

CarstenS · Feb 20, 2014

Bob said:
There is a bug in the RC version of the ptx assembler. The limit is indeed 48 KB of shared memory per block. This will be corrected for the official release.

Which? The limit or the documentation?

Ailuros · Feb 20, 2014

lanek said:
Not on GM204.. im 99,99% sure than on DP they will crippled it again and disable DP units.

Then you obviously didn't understand what you just quoted. If GM204 is also going to have slightly more FP64 SPs compared to GK104 (just as GM107 vs. GK107), GM204 might end up with MORE SMMs then I would had expected.

RecessionCone · Feb 20, 2014

psurge said:
Sorry - my ignorance of GPU programming is showing. I can see how you'd want to keep as much stuff as possible in regs. But I don't really understand where you would spill registers to, if not to something that is essentially stack like. Explanations appreciated

GK110 fits 30720 threads on chip at once. The runtime allocates stack space for each of these threads to hold spilling, but only a small amount for each thread, since there are so many of them. The other wrinkle is that stacks are not contiguous - all the stacks are interleaved so that when a SIMD vector does a load from the same address relative to each lane's stack pointer, the load is vectorized. This is why they talk about "local memory", since it's allocated and accessed differently than normal "global memory".

Bob · Feb 20, 2014

On Maxwell sm50 profiles, blocks are HW-limited to 48 KB of shared memory (each). This is the same as on Kepler and Fermi. ptxas was misreporting the capability as 64 KB per block. This is incorrect. The real HW maximum size is 48 KB of shared memory per block.

SMM has a total shared memory capacity of 64 KB. So it can run one block that uses 48 KB, or two blocks each using 32 KB of shared memory at the same time.

Hope this clarifies things.

CarstenS · Feb 20, 2014

Bob said:
On Maxwell sm50 profiles, blocks are HW-limited to 48 KB of shared memory (each). This is the same as on Kepler and Fermi. ptxas was misreporting the capability as 64 KB per block. This is incorrect. The real HW maximum size is 48 KB of shared memory per block.

SMM has a total shared memory capacity of 64 KB. So it can run one block that uses 48 KB, or two blocks each using 32 KB of shared memory at the same time.

Hope this clarifies things.

Absolutely. Thanks Bob!

Osamar · Feb 20, 2014

In the line of thinking fo Blazkowicz, that I share. This is a nice reading.

http://www.pcper.com/reviews/Graphics-Cards/Upgrade-Story-Can-GTX-750-Ti-Convert-OEMs-PCs-Gaming-PCs

Cookie Monster · Feb 20, 2014

Osamar said:
In the line of thinking fo Blazkowicz, that I share. This is a nice reading.

http://www.pcper.com/reviews/Graphics-Cards/Upgrade-Story-Can-GTX-750-Ti-Convert-OEMs-PCs-Gaming-PCs

Impressive indeed for a low end card! Its about time low end cards get some love.

UniversalTruth · Feb 20, 2014

Cookie Monster said:
Impressive indeed for a low end card! Its about time low end cards get some love.

A 150$ videocard is anything but low-end, some people who are enthusiasts but still either do not wanna spend insane amounts of money or simply do not have it, buy cards in this exact price segment and play everything on close to maximum settings!

Keep in mind also that just one tech process (40 nm) generation ago this was the high-end type of performance!

xDxD · Feb 20, 2014

UniversalTruth said:
A 150$ videocard is anything but low-end, some people who are enthusiasts but still either do not wanna spend insane amounts of money or simply do not have it, buy cards in this exact price segment and play everything on close to maximum settings!

Keep in mind also that just one tech process (40 nm) generation ago this was the high-end type of performance!

at 500$ msrp...

NVIDIA Maxwell Speculation Thread

LiXiangyang

CarstenS

Moderator

pixelio

Osamar

pixelio

Gipsel

Gipsel

rpg.314

rpg.314

Bob

psurge

CarstenS

Moderator

Ailuros

Epsilon plus three

RecessionCone

Bob

CarstenS

Moderator

Osamar

Cookie Monster

UniversalTruth

xDxD

Similar threads