Nvidia Pascal Announcement

A1xLLcqAgt0qc2RyMz0y · May 16, 2016

Grall said:
Only slightly larger than good ole GK104, but a majorly higher sales price. Geforce 770 was upper midrange or lower high-end, however you want to look at it, and certainly did not retail for 600-700 USD.

Because price is determined by performance not the size of the die.

If Intel used logic that die size would determine selling price they would be selling processors from new nodes at much lower prices than the previous node even though the new ones out perform the older ones which clearly they are not doing.

I'm not sure why people expect Nvidia to do something that everyone else in the industry doesn't do.

Razor1 · May 16, 2016

Kaotik said:
x mm^2 chip on 14nm is also vastly more expensive to make than x mm^2 chip on 28nm

Yep its ~2x the cost per transistor.

Razor1 · May 16, 2016

Ext3h said:
This didn't change. It just can switch between either mode now without halting the GPU, and possibly switch without waiting for the SM to run dry, thanks to preemption.

How? It's not like you could control the SM selection, nor does the API permit any pattern in which actual concurrent execution of multiple kernels would be required. So unless someone manages to achieve a measurable speedup by concurrently running two kernels with orthogonal per SM resource constraints (e.g. two kernels which intentionally exhaust RF and SMEM limit each), I still can't consider this confirmed.

Well with the diagrams of the dynamic load balancing, it seems both graphics and compute kernels can be done on an individual SMM, but the ALU that finish can then switch over to another kernel when they are free (or is this by the use of preemption it was kind of vague?), it seems there is no time delay though so I'm assuming there is no need for preemption.

Ext3h · May 16, 2016

pixelio said:
Below is a presentation from GTC2015. See pages 29, 30 and 32. The K1 is only one multiprocessor so any execution overlap of independent kernel grids is happening within the same multiprocessor.

All I see on page 32 is concurrent execution of multiple grids of the same kernel on the TK1 in the final batch of stages, and both grids had even the same dimensions(!) - but I don't see overlap of different kernels apart from the small measurement error most likely originating from monitoring the completion signals on the CPU long after the GPU has dispatched the next grid internally.

If there was any parallel execution on the same SM, it should have happened in the early stages as well, as different kernels were only involved then, and not only in the final stages. Only the K40 exposes a measurable overlap between the different kernels, but it also has more than one SMX, so this was expected.

On the TK1, using multiple streams really just cut the overhead for switching between kernels to the point where the order of the signals on the CPU appears inverted, but not to the point where we could speak about concurrent execution.

So far, that looks more like a point in case that different kernels can not run in parallel, but the workload wasn't orthogonal as requested, so we can't eliminate the possibility that the concurrency was just prevented by coincidentally saturating the whole SMX for the whole duration of one grid.

Razor1 said:
Well with the diagrams of the dynamic load balancing, it seems both graphics and compute kernels can be done on an individual SMM, but the ALU that finish can then switch over to another kernel when they are free (by the use of preemption?)

Yes, from compute to graphics at instruction granularity, from graphics to compute at vertex/pixel granularity. I think it was hinted though that the driver needs to trigger the rescheduling, it's not entirely automated. And I don't know if this actually involves preemption, or if it is just triggered when the SMM runs idle to avoid the overhead involved with preemption.

Razor1 · May 16, 2016

well if its using preemption, there should be some latency incurred, we shell see when tests come out, at least more than the driver forcing rescheduling I presume.

CSI PC · May 16, 2016

Ext3h said:
You mean the Dynamic Parallelism stuff? I thought so as well, but that's actually just sequential execution with an explicit yield.

From https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/:

So, no, this isn't confirming it either.

Yeah I thought the Dynamic Parallelism (and that was an example they provided somewhere) has changed between Fermi, then Kepler, and now Pascal.
Maxwell I thought was the same as Kepler *shrug*.
Ah well, wonder what changed then in it.
Cheers

Voxilla · May 16, 2016

A1xLLcqAgt0qc2RyMz0y said:
Because price is determined by performance not the size of the die.

GPU performance has increased exponential for the last 15 years, price not.
Same could be said for ARM CPU.

It's undeniable the last couple of generations the introduction price of GPUs has gone up by $50+
A healthy competitive ecosystem keeps prices low.
We can only hope AMD comes up with something great again.
(Speaking of course as a customer, not a stakeholder)

Deleted member 2197 · May 16, 2016

CSI PC said:
Ah well, wonder what changed then in it.
Cheers

Hopefully we will know more once some detailed reviews start to come in, though not sure many will delve into this aspect.

Gipsel · May 16, 2016

Razor1 said:
well if its using preemption, there should be some latency incurred, we shell see when tests come out, at least more than the driver forcing rescheduling I presume.

The leaked slide on that mentioned a "sub 100usec latency". So probably still potentially something like 100,000 clock cycles. Nothing I would call blazing fast. Perfectly fine for a single asynchronous time warp kernel call per frame, but a hundred context switches may add up to something significant. But who knows how this will work in practice. That needs to be tested.

pixelio · May 16, 2016

Ext3h said:
If there was any parallel execution on the same SM, it should have happened in the early stages as well

Makespan drops from 84 to 78.9 ms. using concurrent kernels.

The kernels at the end of the job are small enough to cohabitate the single SMX.

Concurrency doesn't mean preemption... hopefully we are discussing the same thing.

Ext3h · May 16, 2016

pixelio said:
Makespan drops from 84 to 78.9 ms. using concurrent kernels.

The kernels at the end of the job are small enough to cohabitate the single SMX.

The grids at the end of the job are apparently also identical in dimensions of the grid and the kernel invoked, only the passed parameters differ, at least if I didn't mistake the context. This is still in line with the assumption that different kernels can not simultaneously cohabitate the same SMX. At no point it was stated that they would have to belong to the same grid.

pixelio said:
Concurrency doesn't mean preemption... hopefully we are discussing the same thing.

But preemption is a possible method to provide concurrency. The question is whether we can have simultaneous, and not only concurrent execution.

Grall · May 16, 2016

A1xLLcqAgt0qc2RyMz0y said:
Because price is determined by performance not the size of the die.

Except, I didn't dispute that.

In fact, it's pretty much inferred from my statement.

I'm not sure why people expect Nvidia to do something that everyone else in the industry doesn't do.

Didn't dispute that either. Although as a customer, I wouldn't mind less moneygrubbing from NV, but lacking any true competition there's no pressure on them to not gouge the hell out of their users, just like Intel is doing on their high-end chips. $1.5k for an 8-core socket 2011 CPU, whee... *golfclap*

3dilettante · May 16, 2016

Gipsel said:
The leaked slide on that mentioned a "sub 100usec latency". So probably still potentially something like 100,000 clock cycles. Nothing I would call blazing fast. Perfectly fine for a single asynchronous time warp kernel call per frame, but a hundred context switches may add up to something significant. But who knows how this will work in practice. That needs to be tested.

The option for allowing a wavefront a fixed period of time to complete before a context switch was invoked was proposed in some of AMD's descriptions for pre-emptable compute. That seems in line with other QoS measures that try to provide predictable or deterministic latencies rather than absolute minimum latency.

Some of that latency could be from that kind of grace period, since that's plenty of time to write out context, or it might give an idea of the kind of latency it takes to make the determination that something must be preempted at the driver level and signalling it. That might be fast for an external request, given queuing delays that can be experienced in other scenarios.

That might not be used in the allocation method, similar to how AMD's priority queue functionality (edit for clarity: quick response queue) has a minimum GCN version that cannot preempt graphics.

The dynamic resource allocation slide, if it accurately describes prior Nvidia GPUs in the static scenario, might indicate not having the ability to rapidly check for SM availability for compute issue.
That would seemingly fall out the implementation of preemption functionality (what's the point of preempting if you don't know when a switch is ready?), even if the allocation method itself does not use a preemption request. There does not seem to be any actual change in the area devoted to the graphics portion, so it's not getting cut off.

xEx · May 16, 2016

Voxilla said:
GPU performance has increased exponential for the last 15 years, price not.
Same could be said for ARM CPU.

It's undeniable the last couple of generations the introduction price of GPUs has gone up by $50+
A healthy competitive ecosystem keeps prices low.
We can only hope AMD comes up with something great again.
(Speaking of course as a customer, not a stakeholder)

Price is determinate by how much does it cost to make the product vs how much the consumer are willing to pay for it. If sells for the 1080 drops because its too expensive then Nvidia would have to drop the price. Something we saw(and suffer) with Intel. Selling CPUs 3 years old more expensive than when they came out A completely insult in my opinion, but as long as people keep giving them 5 stars at amazon they are going to keep selling and the prices(even the used ones...) wont drop a single dollar.

pixelio · May 16, 2016

Ext3h said:
The grids at the end of the job are apparently also identical in dimensions of the grid

No, they're not. They wouldn't have shorter runtimes if they were the same size grid.

Different kernels can run concurrently on the same SM if there are enough resources.

Here's an example: https://gist.github.com/allanmac/049837785a10b7999fce6ca282f62dc6

Kernel "a" and "b" are similar and both require half of the Maxwell v1's shared memory.

Thus the expectation is that only a pair of kernels can fit into an SMM.

Two timelines are shown running on a 3 SMM Quadro K620:

Launch 6 kernels on 6 streams. The first kernel is "b" the rest are "a". All run concurrently.
Launch 7 kernels on 7 streams. The first and 7th are "b". The rest are "a". Six run concurrently and the seventh runs by its lonesome.

This is reliable and routine behavior that CUDA devs depend on.

Any questions?

ck --device=1 --nkernels=6

ck --device=1 --nkernels=7

Ext3h · May 16, 2016

pixelio said:
Any questions?

No, thank you very much for proving it properly and providing a reproducible test.
The second run with two "b" kernels doesn't allow any conclusion as they might have been matched, but the first one with the 5:1 ratio does.

pixelio · May 16, 2016

Ext3h said:
The second run with two "b" kernels doesn't allow any conclusion as they might have been matched, but the first one with the 5:1 ratio does.

If you run the example for 6, 12, 18, 24, 30, etc. times you'll see many different concurrent arrangements of "a" and "b" kernels.

Translation: the scheduling rule is probably somewhat dynamic and a mix of first-fit and round-robin. I wouldn't be surprised if there was a heuristic hiding in there to drain more of the same kernel type before launching a different kernel on the same SM (for many many reasons).

For 600 concurrent kernels it looks like this (purple is launch mod 6 = 0):

A1xLLcqAgt0qc2RyMz0y · May 16, 2016

What time tomorrow (May 17, 2016) does the GTX 1080 NDA expire?

CSI PC · May 16, 2016

pixelio said:
If you run the example for 6, 12, 18, 24, 30, etc. times you'll see many different concurrent arrangements of "a" and "b" kernels.

Translation: the scheduling rule is probably somewhat dynamic and a mix of first-fit and round-robin. I wouldn't be surprised if there was a heuristic hiding in there to drain more of the same kernel type before launching a different kernel on the same SM (for many many reasons).

For 600 concurrent kernels it looks like this (purple is launch mod 6 = 0):

Would be interesting if someone could run on this on a Fermi GPU, to see if there is a difference between the architectures on this.
Cheers

fellix · May 16, 2016

A1xLLcqAgt0qc2RyMz0y said:
What time tomorrow (May 17, 2016) does the GTX 1080 NDA expire?

PCPer will host a live stream with NV's Tom Petersen at 10:00am PST (1:00pm EST), so I guess it'll be by that time -- http://www.pcper.com/live/

Nvidia Pascal Announcement

A1xLLcqAgt0qc2RyMz0y

Razor1

Razor1

Ext3h

Razor1

CSI PC

Voxilla

Deleted member 2197

Guest

Gipsel

pixelio

Ext3h

Grall

Invisible Member

3dilettante

xEx

pixelio

Ext3h

pixelio

A1xLLcqAgt0qc2RyMz0y

CSI PC

fellix

Similar threads