Nvidia Pascal Announcement

xEx · May 16, 2016

Very interesting. Also talking about HDR as far as it seens the Gforce wont be able to run HDR video content(there is no mention about it) which is a disappointment.

pixelio · May 16, 2016

Ext3h said:
PS: I was told that Maxwell / Pascal can only execute a single kernel per SMM, respectively that all warps must be belong to the same kernel

That's not correct.

xpea · May 16, 2016

xEx said:
Very interesting. Also talking about HDR as far as it seens the Gforce wont be able to run HDR video content(there is no mention about it) which is a disappointment.

and Pascal supports 12bit HEVC decode

renderstate · May 16, 2016

12 or 10 bit?

xpea · May 16, 2016

renderstate said:
12 or 10 bit?

2x4k @60Hz 10bit HEVC encode but 2x4k @120Hz / 8k @30Hz 320Mbps 12bit decode.
See the third table of this article (NVIDIA GeForce GTX 1080 Video Support):
http://videocardz.com/59962/nvidia-geforce-gtx-1080-final-specifications-and-launch-presentation

xEx · May 16, 2016

xpea said:
2x4k @60Hz 10bit HEVC encode but 2x4k @120Hz / 8k @30Hz 320Mbps 12bit decode.
See the third table of this article (NVIDIA GeForce GTX 1080 Video Support):
http://videocardz.com/59962/nvidia-geforce-gtx-1080-final-specifications-and-launch-presentation

HEVC is the container for h265 witch is a video code but it is not* related to HDR perse. It is just a way to compress the video saving bandwidth. and yes it support 10 and 12 bits color but remember, colors are just a small part of HDR. If you want to be able to use HDR video content you will need to ether implement the standard DV or HDR10. Why? Well because the TVs will ether support one or both. Why is this important? Well both standards are very similar the mayor differences are: HDR10 work with 10.000 nits while DV works with 1000 DV use p3 color gamut and HDR10 use an enchantment of the P3(trying to cover the rec.2020 but not fully supporting it) and the way the meta data is processed. DV process the meta data in the TV: The devise renders in SDR and creates the additional data to make that SDR content into HDR(this saves tons of bandwidth but content creators have only a guess of how the tv will make that conversion. HDR10 "ask" the TV for its capabilities(brightness, contrasts, color gamut, et) then creates and HDR image within the parameters of the TV and send it(uses a lot of bandwidth)

As you can see, you won't be able to reproduce HDR content without fully supporting one or both standards because the TVs simple won't understand it. Nvidia presentation is about HDR in games which is something nice but no what I was looking for.

Looking at the graph it seens that Nvidia supports rec2020 but only 1000 nits. It is in the middle of the 2 standards; HDR10 use rec2020 but at 10000 nits. and DV use 1000 nits but only supports P3.

Gipsel · May 16, 2016

pixelio said:
That's not correct.

Is there anything to back that up?
And is it possible for Maxwell or Pascal to freely mix warps from different kernels of the same context? Are there restrictions? Because different contexts are clearly not possible, from the look of it also not with Pascal.

pixelio · May 16, 2016

Gipsel said:
Is there anything to back that up?

Sure, the CUDA forums have examples of Kepler and Maxwell GPUs that support more concurrent streams than there are multiprocessors.

Streams are independent "kernel queues". Kernels are executed in grids of cooperating blocks of threads. An SM can only execute a block when there are enough resources available -- smem, registers, warps, etc.

For example, a tiny two SMX GK208 Kepler GPU can support 16 concurrent streams. With the proper driver, a single SMX K1 mobile GPU supports 4 streams.

spworley · May 16, 2016

xpea said:
2x4k @60Hz 10bit HEVC encode but 2x4k @120Hz / 8k @30Hz 320Mbps 12bit decode.
See the third table of this article (NVIDIA GeForce GTX 1080 Video Support):
http://videocardz.com/59962/nvidia-geforce-gtx-1080-final-specifications-and-launch-presentation

Caveat: The full 10/12 bit decoder is probably only for Windows. The existing (10 bit) HEVC decoder on the GTX 950 is limited to 8 bit in Linux, mostly due to an unfortunate design limitation of the Linux VDPAU decoder interface. This affects AMD and Intel too.

Ext3h · May 16, 2016

pixelio said:
Sure, the CUDA forums have examples of Kepler and Maxwell GPUs that support more concurrent streams than there are multiprocessors.
[...]
For example, a tiny two SMX GK208 Kepler GPU can support 16 concurrent streams. With the proper driver, a single SMX K1 mobile GPU supports 4 streams.

But that does not guarantee parallel execution, does it? You may schedule to 16 different queues and they are not blocking each other, but there is no guarantee that they are processed in parallel and not temporarily stall each other on available SMMs. Plus, we have observed weird disparities between compute via CUDA, and via DirectX before, so even if it should with CUDA (for which we do't have proof yet), that doesn't allow for a generic statement yet.

Ext3h · May 16, 2016

PS:
Why I'm tempted to believe that rumor, is that this would allow Nvidia to perform allocation of SMEM, registers etc. for a single kernel in a single go and statically, with only warp initialization happening at run time. Seems like a great way to eliminate any chance of fragmentation of the RF and SMEM, so you can avoid any form of paging inside these, and possibly even optimize for a multi-tier layout of these.

Why I'm less convinced about it, is the graphics pipeline, as that would also mean all the shared special function units (mainly the Raster Engine) would be unable to run if not at least a single SMM had a corresponding kernel enabled, so you wouldn't be able to put these units under permanent load. On the other hand, this probably allows to reuse parts of the polymorph engine rather than providing dedicated resources to each function or requiring arbitration, so who knows?

Grall · May 16, 2016

ninelven said:
314 mm^2...... Looks like this will be another cash cow for Nvidia.

Only slightly larger than good ole GK104, but a majorly higher sales price. Geforce 770 was upper midrange or lower high-end, however you want to look at it, and certainly did not retail for 600-700 USD.

Kaotik · May 16, 2016

Grall said:
Only slightly larger than good ole GK104, but a majorly higher sales price. Geforce 770 was upper midrange or lower high-end, however you want to look at it, and certainly did not retail for 600-700 USD.

x mm^2 chip on 14nm is also vastly more expensive to make than x mm^2 chip on 28nm

Grall · May 16, 2016

Kaotik said:
x mm^2 chip on 14nm is also vastly more expensive to make than x mm^2 chip on 28nm

Yes, but $700 is like twice the price the 770 retailed for AFAIR. IE, GP102 costs several hundred dollars more to make than GF104? Seems a tad much maybe.

CSI PC · May 16, 2016

Ext3h said:
PS:
Why I'm tempted to believe that rumor, is that this would allow Nvidia to perform allocation of SMEM, registers etc. for a single kernel in a single go and statically, with only warp initialization happening at run time. Seems like a great way to eliminate any chance of fragmentation of the RF and SMEM, so you can avoid any form of paging inside these, and possibly even optimize for a multi-tier layout of these.

Why I'm less convinced about it, is the graphics pipeline, as that would also mean all the shared special function units (mainly the Raster Engine) would be unable to run if not at least a single SMM had a corresponding kernel enabled, so you wouldn't be able to put these units under permanent load. On the other hand, this probably allows to reuse parts of the polymorph engine rather than providing dedicated resources to each function or requiring arbitration, so who knows?

Not sure this fits, but didn't NVIDIA show a difference in how it handles Parallel Recursion between Fermi (more crude and complex) and Kepler (separate streams) with Cuda examples?

Cheers

Ext3h · May 16, 2016

CSI PC said:
Not sure this fits, but didn't NVIDIA show a difference in how it handles Parallel Recursion between Fermi (more crude and complex) and Kepler (separate streams) with Cuda examples?

You mean the Dynamic Parallelism stuff? I thought so as well, but that's actually just sequential execution with an explicit yield.

From https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/:

However, concurrent execution is in no case guaranteed: it is primarily intended for better utilization of GPU resources. Any program that depends on specific grids or thread blocks executing concurrently is ill-formed.

So, no, this isn't confirming it either.

DuckThor Evil · May 16, 2016

Grall said:
Yes, but $700 is like twice the price the 770 retailed for AFAIR.

770 was going for $399, but that was 14 months after the GK104 chip had been introduced and there were more high end products on top of it. GTX 680 was $499 four years ago. I would think that in summer 2017 you'll get a GP104 cheaper than at launch and you actually do get it now too, just not fully enabled, but 1070 is closer to the top than 770 was back in 2013.

RecessionCone · May 16, 2016

Ext3h said:
PS:
Why I'm tempted to believe that rumor, is that this would allow Nvidia to perform allocation of SMEM, registers etc. for a single kernel in a single go and statically, with only warp initialization happening at run time. Seems like a great way to eliminate any chance of fragmentation of the RF and SMEM, so you can avoid any form of paging inside these, and possibly even optimize for a multi-tier layout of these.

Why I'm less convinced about it, is the graphics pipeline, as that would also mean all the shared special function units (mainly the Raster Engine) would be unable to run if not at least a single SMM had a corresponding kernel enabled, so you wouldn't be able to put these units under permanent load. On the other hand, this probably allows to reuse parts of the polymorph engine rather than providing dedicated resources to each function or requiring arbitration, so who knows?

It's actually rather easy to write a program that runs two kernels on the same SM simultaneously. Your rumor is false.

The thing that couldn't be done (at least until Pascal, not sure if this has changed) is to run a compute kernel and a graphics kernel concurrently on the same SM. But two different compute kernels can indeed run concurrently on the same SM.

Ext3h · May 16, 2016

RecessionCone said:
The thing that couldn't be done (at least until Pascal, not sure if this has changed) is to run a compute kernel and a graphics kernel concurrently on the same SM.

This didn't change. It just can switch between either mode now without halting the GPU, and possibly switch without waiting for the SM to run dry, thanks to preemption.

RecessionCone said:
It's actually rather easy to write a program that runs two kernels on the same SM simultaneously.

How? It's not like you could control the SM selection, nor does the API permit any pattern in which actual concurrent execution of multiple kernels would be required. So unless someone manages to achieve a measurable speedup by concurrently running two kernels with orthogonal per SM resource constraints (e.g. two kernels which intentionally exhaust RF and SMEM limit each), I still can't consider this confirmed.

pixelio · May 16, 2016

Ext3h said:
I still can't consider this confirmed.

Below is a presentation from GTC2015. See pages 29, 30 and 32. The K1 is only one multiprocessor so any execution overlap of independent kernel grids is happening within the same multiprocessor.

http://on-demand.gputechconf.com/gtc/2015/presentation/S5457-Paulius-Micikevicius.pdf

Note that this presentation describes a really great use case where grids become so small that if you did not have concurrent execution the grids would conga line through an underutilized multiprocessor.

Nvidia Pascal Announcement

xEx

pixelio

xpea

renderstate

xpea

xEx

Gipsel

pixelio

spworley

Ext3h

Ext3h

Grall

Invisible Member

Kaotik

Drunk Member

Grall

Invisible Member

CSI PC

Ext3h

DuckThor Evil

RecessionCone

Ext3h

pixelio

Similar threads