Nvidia Pascal Announcement

Only slightly larger than good ole GK104, but a majorly higher sales price. Geforce 770 was upper midrange or lower high-end, however you want to look at it, and certainly did not retail for 600-700 USD. :p

Because price is determined by performance not the size of the die.

If Intel used logic that die size would determine selling price they would be selling processors from new nodes at much lower prices than the previous node even though the new ones out perform the older ones which clearly they are not doing.

I'm not sure why people expect Nvidia to do something that everyone else in the industry doesn't do.
 
This didn't change. It just can switch between either mode now without halting the GPU, and possibly switch without waiting for the SM to run dry, thanks to preemption.

How? It's not like you could control the SM selection, nor does the API permit any pattern in which actual concurrent execution of multiple kernels would be required. So unless someone manages to achieve a measurable speedup by concurrently running two kernels with orthogonal per SM resource constraints (e.g. two kernels which intentionally exhaust RF and SMEM limit each), I still can't consider this confirmed.


Well with the diagrams of the dynamic load balancing, it seems both graphics and compute kernels can be done on an individual SMM, but the ALU that finish can then switch over to another kernel when they are free (or is this by the use of preemption it was kind of vague?), it seems there is no time delay though so I'm assuming there is no need for preemption.
 
Below is a presentation from GTC2015. See pages 29, 30 and 32. The K1 is only one multiprocessor so any execution overlap of independent kernel grids is happening within the same multiprocessor.
All I see on page 32 is concurrent execution of multiple grids of the same kernel on the TK1 in the final batch of stages, and both grids had even the same dimensions(!) - but I don't see overlap of different kernels apart from the small measurement error most likely originating from monitoring the completion signals on the CPU long after the GPU has dispatched the next grid internally.

If there was any parallel execution on the same SM, it should have happened in the early stages as well, as different kernels were only involved then, and not only in the final stages. Only the K40 exposes a measurable overlap between the different kernels, but it also has more than one SMX, so this was expected.

On the TK1, using multiple streams really just cut the overhead for switching between kernels to the point where the order of the signals on the CPU appears inverted, but not to the point where we could speak about concurrent execution.

So far, that looks more like a point in case that different kernels can not run in parallel, but the workload wasn't orthogonal as requested, so we can't eliminate the possibility that the concurrency was just prevented by coincidentally saturating the whole SMX for the whole duration of one grid.
Well with the diagrams of the dynamic load balancing, it seems both graphics and compute kernels can be done on an individual SMM, but the ALU that finish can then switch over to another kernel when they are free (by the use of preemption?)
Yes, from compute to graphics at instruction granularity, from graphics to compute at vertex/pixel granularity. I think it was hinted though that the driver needs to trigger the rescheduling, it's not entirely automated. And I don't know if this actually involves preemption, or if it is just triggered when the SMM runs idle to avoid the overhead involved with preemption.
 
well if its using preemption, there should be some latency incurred, we shell see when tests come out, at least more than the driver forcing rescheduling I presume.
 
You mean the Dynamic Parallelism stuff? I thought so as well, but that's actually just sequential execution with an explicit yield.

From https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/:

So, no, this isn't confirming it either.
Yeah I thought the Dynamic Parallelism (and that was an example they provided somewhere) has changed between Fermi, then Kepler, and now Pascal.
Maxwell I thought was the same as Kepler *shrug*.
Ah well, wonder what changed then in it.
Cheers
 
Because price is determined by performance not the size of the die.

GPU performance has increased exponential for the last 15 years, price not.
Same could be said for ARM CPU.

It's undeniable the last couple of generations the introduction price of GPUs has gone up by $50+
A healthy competitive ecosystem keeps prices low.
We can only hope AMD comes up with something great again.
(Speaking of course as a customer, not a stakeholder)
 
Last edited:
well if its using preemption, there should be some latency incurred, we shell see when tests come out, at least more than the driver forcing rescheduling I presume.
The leaked slide on that mentioned a "sub 100usec latency". So probably still potentially something like 100,000 clock cycles. Nothing I would call blazing fast. Perfectly fine for a single asynchronous time warp kernel call per frame, but a hundred context switches may add up to something significant. But who knows how this will work in practice. That needs to be tested.
 
Last edited:
If there was any parallel execution on the same SM, it should have happened in the early stages as well

Makespan drops from 84 to 78.9 ms. using concurrent kernels.

The kernels at the end of the job are small enough to cohabitate the single SMX.

Concurrency doesn't mean preemption... hopefully we are discussing the same thing.
 
Makespan drops from 84 to 78.9 ms. using concurrent kernels.

The kernels at the end of the job are small enough to cohabitate the single SMX.
The grids at the end of the job are apparently also identical in dimensions of the grid and the kernel invoked, only the passed parameters differ, at least if I didn't mistake the context. This is still in line with the assumption that different kernels can not simultaneously cohabitate the same SMX. At no point it was stated that they would have to belong to the same grid.
Concurrency doesn't mean preemption... hopefully we are discussing the same thing.
But preemption is a possible method to provide concurrency. The question is whether we can have simultaneous, and not only concurrent execution.
 
Because price is determined by performance not the size of the die.
Except, I didn't dispute that.

In fact, it's pretty much inferred from my statement. :p

I'm not sure why people expect Nvidia to do something that everyone else in the industry doesn't do.
Didn't dispute that either. Although as a customer, I wouldn't mind less moneygrubbing from NV, but lacking any true competition there's no pressure on them to not gouge the hell out of their users, just like Intel is doing on their high-end chips. $1.5k for an 8-core socket 2011 CPU, whee... *golfclap*
 
The leaked slide on that mentioned a "sub 100usec latency". So probably still potentially something like 100,000 clock cycles. Nothing I would call blazing fast. Perfectly fine for a single asynchronous time warp kernel call per frame, but a hundred context switches may add up to something significant. But who knows how this will work in practice. That needs to be tested.

The option for allowing a wavefront a fixed period of time to complete before a context switch was invoked was proposed in some of AMD's descriptions for pre-emptable compute. That seems in line with other QoS measures that try to provide predictable or deterministic latencies rather than absolute minimum latency.

Some of that latency could be from that kind of grace period, since that's plenty of time to write out context, or it might give an idea of the kind of latency it takes to make the determination that something must be preempted at the driver level and signalling it. That might be fast for an external request, given queuing delays that can be experienced in other scenarios.

That might not be used in the allocation method, similar to how AMD's priority queue functionality (edit for clarity: quick response queue) has a minimum GCN version that cannot preempt graphics.

The dynamic resource allocation slide, if it accurately describes prior Nvidia GPUs in the static scenario, might indicate not having the ability to rapidly check for SM availability for compute issue.
That would seemingly fall out the implementation of preemption functionality (what's the point of preempting if you don't know when a switch is ready?), even if the allocation method itself does not use a preemption request. There does not seem to be any actual change in the area devoted to the graphics portion, so it's not getting cut off.
 
Last edited:
GPU performance has increased exponential for the last 15 years, price not.
Same could be said for ARM CPU.

It's undeniable the last couple of generations the introduction price of GPUs has gone up by $50+
A healthy competitive ecosystem keeps prices low.
We can only hope AMD comes up with something great again.
(Speaking of course as a customer, not a stakeholder)
Price is determinate by how much does it cost to make the product vs how much the consumer are willing to pay for it. If sells for the 1080 drops because its too expensive then Nvidia would have to drop the price. Something we saw(and suffer) with Intel. Selling CPUs 3 years old more expensive than when they came out A completely insult in my opinion, but as long as people keep giving them 5 stars at amazon they are going to keep selling and the prices(even the used ones...) wont drop a single dollar.
 
The grids at the end of the job are apparently also identical in dimensions of the grid

No, they're not. They wouldn't have shorter runtimes if they were the same size grid.

Different kernels can run concurrently on the same SM if there are enough resources.

Here's an example: https://gist.github.com/allanmac/049837785a10b7999fce6ca282f62dc6

Kernel "a" and "b" are similar and both require half of the Maxwell v1's shared memory.

Thus the expectation is that only a pair of kernels can fit into an SMM.

Two timelines are shown running on a 3 SMM Quadro K620:
  1. Launch 6 kernels on 6 streams. The first kernel is "b" the rest are "a". All run concurrently.
  2. Launch 7 kernels on 7 streams. The first and 7th are "b". The rest are "a". Six run concurrently and the seventh runs by its lonesome.
This is reliable and routine behavior that CUDA devs depend on.

Any questions?

ck --device=1 --nkernels=6

ce1f3770-1b58-11e6-91aa-35ba57a24c63.png



ck --device=1 --nkernels=7
ce1ee1b2-1b58-11e6-9ab0-a4a22686bef3.png
 
Any questions?
No, thank you very much for proving it properly and providing a reproducible test.
The second run with two "b" kernels doesn't allow any conclusion as they might have been matched, but the first one with the 5:1 ratio does.
 
The second run with two "b" kernels doesn't allow any conclusion as they might have been matched, but the first one with the 5:1 ratio does.

If you run the example for 6, 12, 18, 24, 30, etc. times you'll see many different concurrent arrangements of "a" and "b" kernels.

Translation: the scheduling rule is probably somewhat dynamic and a mix of first-fit and round-robin. I wouldn't be surprised if there was a heuristic hiding in there to drain more of the same kernel type before launching a different kernel on the same SM (for many many reasons).

For 600 concurrent kernels it looks like this (purple is launch mod 6 = 0):

img
 
If you run the example for 6, 12, 18, 24, 30, etc. times you'll see many different concurrent arrangements of "a" and "b" kernels.

Translation: the scheduling rule is probably somewhat dynamic and a mix of first-fit and round-robin. I wouldn't be surprised if there was a heuristic hiding in there to drain more of the same kernel type before launching a different kernel on the same SM (for many many reasons).

For 600 concurrent kernels it looks like this (purple is launch mod 6 = 0):

img

Would be interesting if someone could run on this on a Fermi GPU, to see if there is a difference between the architectures on this.
Cheers
 
Back
Top