PS: I was told that Maxwell / Pascal can only execute a single kernel per SMM, respectively that all warps must be belong to the same kernel
Very interesting. Also talking about HDR as far as it seens the Gforce wont be able to run HDR video content(there is no mention about it) which is a disappointment.
2x4k @60Hz 10bit HEVC encode but 2x4k @120Hz / 8k @30Hz 320Mbps 12bit decode.12 or 10 bit?
2x4k @60Hz 10bit HEVC encode but 2x4k @120Hz / 8k @30Hz 320Mbps 12bit decode.
See the third table of this article (NVIDIA GeForce GTX 1080 Video Support):
http://videocardz.com/59962/nvidia-geforce-gtx-1080-final-specifications-and-launch-presentation
Is there anything to back that up?That's not correct.
Is there anything to back that up?
Caveat: The full 10/12 bit decoder is probably only for Windows. The existing (10 bit) HEVC decoder on the GTX 950 is limited to 8 bit in Linux, mostly due to an unfortunate design limitation of the Linux VDPAU decoder interface. This affects AMD and Intel too.2x4k @60Hz 10bit HEVC encode but 2x4k @120Hz / 8k @30Hz 320Mbps 12bit decode.
See the third table of this article (NVIDIA GeForce GTX 1080 Video Support):
http://videocardz.com/59962/nvidia-geforce-gtx-1080-final-specifications-and-launch-presentation
But that does not guarantee parallel execution, does it? You may schedule to 16 different queues and they are not blocking each other, but there is no guarantee that they are processed in parallel and not temporarily stall each other on available SMMs. Plus, we have observed weird disparities between compute via CUDA, and via DirectX before, so even if it should with CUDA (for which we do't have proof yet), that doesn't allow for a generic statement yet.Sure, the CUDA forums have examples of Kepler and Maxwell GPUs that support more concurrent streams than there are multiprocessors.
[...]
For example, a tiny two SMX GK208 Kepler GPU can support 16 concurrent streams. With the proper driver, a single SMX K1 mobile GPU supports 4 streams.
Only slightly larger than good ole GK104, but a majorly higher sales price. Geforce 770 was upper midrange or lower high-end, however you want to look at it, and certainly did not retail for 600-700 USD.314 mm^2...... Looks like this will be another cash cow for Nvidia.
x mm^2 chip on 14nm is also vastly more expensive to make than x mm^2 chip on 28nmOnly slightly larger than good ole GK104, but a majorly higher sales price. Geforce 770 was upper midrange or lower high-end, however you want to look at it, and certainly did not retail for 600-700 USD.
Yes, but $700 is like twice the price the 770 retailed for AFAIR. IE, GP102 costs several hundred dollars more to make than GF104? Seems a tad much maybe.x mm^2 chip on 14nm is also vastly more expensive to make than x mm^2 chip on 28nm
Not sure this fits, but didn't NVIDIA show a difference in how it handles Parallel Recursion between Fermi (more crude and complex) and Kepler (separate streams) with Cuda examples?PS:
Why I'm tempted to believe that rumor, is that this would allow Nvidia to perform allocation of SMEM, registers etc. for a single kernel in a single go and statically, with only warp initialization happening at run time. Seems like a great way to eliminate any chance of fragmentation of the RF and SMEM, so you can avoid any form of paging inside these, and possibly even optimize for a multi-tier layout of these.
Why I'm less convinced about it, is the graphics pipeline, as that would also mean all the shared special function units (mainly the Raster Engine) would be unable to run if not at least a single SMM had a corresponding kernel enabled, so you wouldn't be able to put these units under permanent load. On the other hand, this probably allows to reuse parts of the polymorph engine rather than providing dedicated resources to each function or requiring arbitration, so who knows?
You mean the Dynamic Parallelism stuff? I thought so as well, but that's actually just sequential execution with an explicit yield.Not sure this fits, but didn't NVIDIA show a difference in how it handles Parallel Recursion between Fermi (more crude and complex) and Kepler (separate streams) with Cuda examples?
So, no, this isn't confirming it either.However, concurrent execution is in no case guaranteed: it is primarily intended for better utilization of GPU resources. Any program that depends on specific grids or thread blocks executing concurrently is ill-formed.
Yes, but $700 is like twice the price the 770 retailed for AFAIR.
It's actually rather easy to write a program that runs two kernels on the same SM simultaneously. Your rumor is false.PS:
Why I'm tempted to believe that rumor, is that this would allow Nvidia to perform allocation of SMEM, registers etc. for a single kernel in a single go and statically, with only warp initialization happening at run time. Seems like a great way to eliminate any chance of fragmentation of the RF and SMEM, so you can avoid any form of paging inside these, and possibly even optimize for a multi-tier layout of these.
Why I'm less convinced about it, is the graphics pipeline, as that would also mean all the shared special function units (mainly the Raster Engine) would be unable to run if not at least a single SMM had a corresponding kernel enabled, so you wouldn't be able to put these units under permanent load. On the other hand, this probably allows to reuse parts of the polymorph engine rather than providing dedicated resources to each function or requiring arbitration, so who knows?
This didn't change. It just can switch between either mode now without halting the GPU, and possibly switch without waiting for the SM to run dry, thanks to preemption.The thing that couldn't be done (at least until Pascal, not sure if this has changed) is to run a compute kernel and a graphics kernel concurrently on the same SM.
How? It's not like you could control the SM selection, nor does the API permit any pattern in which actual concurrent execution of multiple kernels would be required. So unless someone manages to achieve a measurable speedup by concurrently running two kernels with orthogonal per SM resource constraints (e.g. two kernels which intentionally exhaust RF and SMEM limit each), I still can't consider this confirmed.It's actually rather easy to write a program that runs two kernels on the same SM simultaneously.
I still can't consider this confirmed.