NVIDIA Maxwell Speculation Thread

rpg.314 · Feb 13, 2014

Arun said:
Since they support 48KB Shared Memory + 16KB L1 per 192 ALU SMX on Kepler, each 32 ALU SM will need to have at least 48KB Shared Memory for backwards compatibility. That's a LOT more shared memory (and associated bandwidth) than on Kepler!

Awesome.

I thought Kepler was a step backwards. They increased the ALU's / core but didn't increase the SM/L1 at all. All in all, it's great that GPU's are finally getting more cache.
KNL is supposed to be brutal with it's caching, see RWT.

psurge · Feb 13, 2014

dnavas said:
I'm also intrigued by "the number of instructions per clock cycle has been increased" because "holy hot clock cycle reincarnation" and "wait, this is like Fermi++, wth was Kepler then"....

Revisiting the hot clock thing seems unlikely for power reasons. Maybe better IPC, due to e.g. smaller SIMD size? The diagram also makes it look like the warp scheduler to ALU ratio went up.

I'm wondering if the L2 on that diagram should really read L3.

constant · Feb 13, 2014

Yes it looks like 32 SP:s per warp scheduler (like good old GF100).

This signals that they're improving the on-chip memory to SP ratio. Hopefully they will also increase the number of shared memory banks up from 32 to 64, this will require an increase in the number of load/store units.

All in all looks much more balanced than Kepler. Will be interesting to see if they can also increase the L1/SMEM scratchpad up from 64 KB to 128 KB.

Also why not increase the register file from 256 KB to 512 KB? Registers don're require as much die area as cache does right?

More registers would enable more threads in flight etc => better latency hiding => better utilization of bandwidth => higher througput.

OK, this post turned out to be more like my wishlist

128 KB L1/smem
64+ LD/ST units
512 KB registers

Exciting times!

UniversalTruth · Feb 13, 2014

tviceman said:
NEVERMIND

GTX 480, GTX 570 and GTX 660 should have practically identical performance. What are you arguing? I have checked through three techpowerup reviews to reach this conclusion

Novum · Feb 13, 2014

I think that the SMM itself is not the compute unit (in OpenCL speak) anymore, but the four sub-elements are. I speculated before that they are going to break up the SMX equals CU design.

In my opinion it's very similar to AMD CUs with the difference that it's 32 ALUs instead of 64 ALUs (because of Thread-Group-Size) and therefore the Quad-TMU needs to be shared by two CUs because a 32 ALUs with a Quad-TMU would be a too low ALU:TMU ratio. Question is if it's 1xSIMD32, 2xSIMD16 or 4xSIMD8. The last option would give them similar scheduling than a AMD CU with round robin scheduling from one Warp-Scheduler and 4 clocks latency between instructions to use for store forwarding.

constant · Feb 13, 2014

Novum said:
I think that the SMM itself is not the compute unit (in OpenCL speak) anymore, but the four sub-elements are. I speculated before that they are going to break up the SMX equals CU design.

In my opinion it's very similar to AMD CUs with the difference that it's 32 ALUs instead of 64 ALUs (because of Thread-Group-Size) and therefore the Quad-TMU needs to be shared by two CUs because a 32 ALUs with a Quad-TMU would be a too low ALU:TMU ratio. Question is if it's 1xSIMD32, 2xSIMD16 or 4xSIMD8. The last option would give them similar scheduling than a AMD CU with round robin scheduling from one Warp-Scheduler and 4 clocks latency between instructions to use for store forwarding.

GT200 --> 8 SP:s executed one warp in 4 clocks ( 8 SP:s / SM - one scheduler )
Fermi --> 16 SP:s executed one warp in 2 clocks ( 32 SP:s / SM - two schedulers* ) **
Kepler --> 32 SP:s executed one warp in 1 clocks (192 SP:s / SM - 4 schedulers*)

Now for maxwell I can spot 4 schedulers aswell.

*Capable of doing dual-issue
** Fermi had an exception with the GF114 that had 48 SP:s / SM (think GTX460), Kepler is similar to this.

mczak · Feb 13, 2014

psurge said:
Revisiting the hot clock thing seems unlikely for power reasons. Maybe better IPC, due to e.g. smaller SIMD size? The diagram also makes it look like the warp scheduler to ALU ratio went up.

Yes I read that as "higher effective throughput compared to theoretical throughput" (so better utilization really) as well.

** Fermi had an exception with the GF114 that had 48 SP:s / SM (think GTX460), Kepler is similar to this.

I wouldn't really call this an exception, rather GF100 (and the unbroken version, GF110) was the exception, since all other chips were like that, not just gf114.

Novum · Feb 13, 2014

constant said:
GT200 --> 8 SP:s executed one warp in 4 clocks ( 8 SP:s / SM - one scheduler )
Fermi --> 16 SP:s executed one warp in 2 clocks ( 32 SP:s / SM - two schedulers* ) **
Kepler --> 32 SP:s executed one warp in 1 clocks (192 SP:s / SM - 4 schedulers*)

Now for maxwell I can spot 4 schedulers aswell.

It's not 4 global schedulers per SMM. It's one Scheduler per Sub-CU. Like I said most likely the SMM is not the CU itself anymore.

silent_guy · Feb 13, 2014

constant said:
Registers don're require as much die area as cache does right?

You need more bandwidth to access registers than cache because you need to read up to 3 operands and write 1 result at the same time. So I'm afraid it's going to be just the opposite.

constant · Feb 13, 2014

Novum said:
It's not 4 global schedulers per SMM. It's one Scheduler per Sub-CU. Like I said most likely the SMM is not the CU itself anymore.

Using your terminology: one per CU and 4 CU:s per SMM (btw you haven't clearly defined what you mean by a CU yet).

Novum · Feb 13, 2014

Yeah I did. What OpenCL defines as compute unit.

tviceman · Feb 13, 2014

UniversalTruth said:
GTX 480, GTX 570 and GTX 660 should have practically identical performance. What are you arguing? I have checked through three techpowerup reviews to reach this conclusion

That's why I said nevermind. I was looking at gtx 660 ti scores, not gtx 660.

fellix · Feb 13, 2014

Any idea how the LDS/L1 will fit along this new SMM subdivision?
Probably shared among all the four sub-SMs or local and then there would be a second level shared cache for each SMM, followed by the big 2MB "L3". :???:

dnavas · Feb 13, 2014

psurge said:
Revisiting the hot clock thing seems unlikely for power reasons. Maybe better IPC, due to e.g. smaller SIMD size? The diagram also makes it look like the warp scheduler to ALU ratio went up.

Yeah, the hot-clock thing was a joke. I know sometimes it's hard to tell whether I'm being dumb on purpose or from ignorance :> Warning: I'm about to be dumb from ignorance....

Yes, there appears to be a dispatcher per 16 alus, and a scheduler per 32. Before we had 8 dispatchers and 4 schedulers across 192 alus (which makes for 24 and 48). Those same dispatchers also fed dp work, and we have yet to see the dp capabilities on here.

Are the dispatchers doing less work? Is batch size decreasing? I would have thought that such decreases would reflect in CUDA, and I thought those updates were for in-pipeline job spawning (raison d'etre of denver), but maybe there are other changes as well? Maybe the dispatchers are not doing less, and some instructions can be dual-issued -- maybe int & fp, maybe mul & add -- mul&add don't seem particularly crazy if you assume the bandwidth for an additional arg, but it isn't clear to me that there'd be a lot of gain? Another crazy note -- the 2011 talk that was mostly about power use also mentioned work being done on vliw architectures....

zorg · Feb 13, 2014

fellix said:
Any idea how the LDS/L1 will fit along this new SMM subdivision?

I heard that the four subdivision will share one 64KB LDS, and one L1Data/texture cache plus the texturing block. The LDS bandwith is 64 byte/cycle. Each subdivision has a 64KB register file and 8 LD/ST units.

But I don't have the cards, I just speak with my Chinese friend.

It won't be a compute monster if true. Maybe even worse than Kepler.

tviceman · Feb 13, 2014

zorg said:
I heard that the four subdivision will share one 64KB LDS, and one L1Data/texture cache plus the texturing block. The LDS bandwith is 64 byte/cycle. Each subdivision has a 64KB register file and 8 LD/ST units.

But I don't have the cards, I just speak with my Chinese friend.

It won't be a compute monster if true. Maybe even worse than Kepler.

That just goes to show Nvidia is further bifurcating their GPU's from the flagship die. I, for one, think it's a good move from a business / die size / profit margin perspective. Let the big die be big and do what it does best, let the other dies continue to focus on the best possible bang for buck mm^2 in graphics.

zorg · Feb 13, 2014

tviceman said:
That just goes to show Nvidia is further bifurcating their GPU's from the flagship die. I, for one, think it's a good move from a business / die size / profit margin perspective. Let the big die be big and do what it does best, let the other dies continue to focus on the best possible bang for buck mm^2 in graphics.

If the rumors are accurate. Don't get me wrong, it seems legit, but who knows.
He also said that the API functionality will be the same as Kepler. It's hard to believe that they still don't support 64 UAVs ...

DSC · Feb 13, 2014

Fermi and Kepler supports UAVs. MS is too boneheaded in their D3D11.1 design.

http://nvidia.custhelp.com/app/answers/detail/a_id/3196/~/fermi-and-kepler-directx-api-support

Fermi and Kepler GPUs do not support two of these features:

UAVOnlyRenderingForcedSampleCount supports 16x raster coverage sampling
TIR - aliased color-only rendering with up to 16x raster coverage sampling

These two features are intended as path rendering acceleration aids for Direct2D, used optionally if the hardware supports feature level 11_1. We felt that for Fermi and Kepler, it was more important to maximize our investment in work that is more important to 3D graphics and therefore chose not to implement support for those two features.

UAVs in the vertex, geometry and tessellation shaders

is supported by Fermi and Kepler GPUs however because this is only exposed through the hardware feature level 11_1, as a group of three features, we currently do not support it via the DX11.1 interfaces. We may expose support for the UAVs in the vertex, geometry and tessellation shaders feature on an app-specific basis in the future.

mczak · Feb 13, 2014

psurge said:
It won't be a compute monster if true. Maybe even worse than Kepler.

Why would that be worse? That would mean LDS is still shared per SMX/SMM, and there's still a (slight) net increase of LDS size (because the SMM is smaller), as well as LD/ST units. Or said differently, all in all there's the same amount of LDS, LD/ST units, but less ALUs (and TMUs) per SMM, but of course more of those SMM. I don't know though off-hand what LDS bandwidth Kepler had, but otherwise I don't see anything which would make compute worse.

Novum · Feb 13, 2014

fellix said:
Any idea how the LDS/L1 will fit along this new SMM subdivision?
Probably shared among all the four sub-SMs or local and then there would be a second level shared cache for each SMM, followed by the big 2MB "L3".

If the sub elements are the CUs, then every one of them must have it's own LDS (required by the programming models). But the L1 could still be shared.

NVIDIA Maxwell Speculation Thread

Similar threads