Nvidia Pascal Announcement

CarstenS · Jun 4, 2016

Not so sure about that. Rather looks like they might just use those to manage their inventory - and at the same time make a premium. While I obviously cannot be sure, I would guess no new GM200/204 out of TSMC any more.

CSI PC · Jun 4, 2016

Ryan Smith said:
Note that NVIDIA just updated Quadro M6000 and Tesla M40 to 24GB barely two months ago. They may be winding down GeForce production, but NVIDIA is going to be minting GM200 for some time to come.

Yeah I think they will initially continue with GM200 even when they go into production with Tesla GP102 as the performance gap should be large enough for one to be a competitive priced Tesla and the other more about performance at a price.
One to watch out for IMO is when they stop some of the Kepler K series Tesla products, such as the K80.

Cheers

CarstenS · Jun 4, 2016

And whatever is supported natively on GTX 1080: AotS is a DirectX title and as such has to make do with what's exposed there. And FP16 isn't on Pascal. As well as fine-grained preemption, I might add.

RecessionCone · Jun 4, 2016

CarstenS said:
Not so sure about that. Rather looks like they might just use those to manage their inventory - and at the same time make a premium. While I obviously cannot be sure, I would guess no new GM200/204 out of TSMC any more.

GM200 is on the roadmap (as M40) for a long time still (far into 2017). I have no inside information into Nvidia's production schedules, but at the moment it is much easier to buy a M40 than a P100, and I expect that to be true at least until Q1 2017.

CarstenS · Jun 4, 2016

Well, seems the odds are against me.

CSI PC · Jun 4, 2016

So any of the publication members here know if their magazine/site discussed with Oxide the FP16 pipe they use and implemented in AoTS back when they reviewed the game or interviewed them?
Cheers

CSI PC · Jun 4, 2016

CarstenS said:
And whatever is supported natively on GTX 1080: AotS is a DirectX title and as such has to make do with what's exposed there. And FP16 isn't on Pascal. As well as fine-grained preemption, I might add.

I thought the conclusion from discussion/hypotheticals was that FP16 was reduced on 1080, but full with Tesla P100; what was never concluded was how this is done with 1080.
You also think fine-grained preemption is also missing generally or is your context more around AoTS?

Did I miss something in the threads or on a site/publication.
Thanks

CarstenS · Jun 4, 2016

I am only stating what's exposed by the driver in DirectX (see DX Caps Viewer), not making any statements about the hardware.

CSI PC · Jun 4, 2016

CarstenS said:
I am only stating what's exposed by the driver in DirectX (see DX Caps Viewer), not making any statements about the hardware.

OK here is another being a pain question

; if fp16 calculations currently has no benefit on existing hardware, then can someone ask Oxide why they went ahead and used it in a way that seems pretty core to the procedural map generation/rendering?
Or again is this something AMD hardware can take advantage of (this would still tie into your driver-DirectX comment) or something related to render target/texture filtering?
Cheers

Edit:
Context being the Stardock statement and the terrain image issues associated with the 1080FE:

We (Stardock/Oxide) are looking into whether someone has reduced the precision on the new FP16 pipe.
Both AMD and NV have access to the Ashes source base. Once we obtain the new cards, we can evaluate whether someone is trying to modify the game behavior (i.e. reduce visual quality) to get a higher score.
In the meantime, taking side by side screenshots and videos will help ensure GPU developers are dissuaded from trying to boost numbers at the cost of visuals.

https://www.reddit.com/r/pcgaming/c...ng_the_aots_image_quality_controversy/d3t9ml4

Psycho · Jun 4, 2016

CSI PC said:
OK here is another being a pain question ; if fp16 calculations currently has no benefit on existing hardware, then can someone ask Oxide why they went ahead and used it

Even though you don't have double-speed fp16 alus, 16 bit operands will ease register pressure and maybe bandwidth. At least GCN takes advantage of this, and I would assume it's the same for maxwell.

CSI PC · Jun 4, 2016

Psycho said:
Even though you don't have double-speed fp16 alus, 16 bit operands will ease register pressure and maybe bandwidth. At least GCN takes advantage of this, and I would assume it's the same for maxwell.

Yeah that is a good point, wonder if that is the sole reason they did this or their scope was more.
Will be interesting as well to see the implications this has with 1080.
CHeers

MDolenc · Jun 4, 2016

There is general advantage of using fp16 data (memory size, bandwidth,...), but this has been around since floating point was first introduced to GPU pipeline.
Of current architectures only GCN3 can reduce register pressure by using fp16 (that is from min precision hints in HLSL). This presumably applies to GCN4 (Polaris) and Pascal as well.
Now it shouldn't need pointing out but since I'm starting to have some serious doubts regarding all this I'll do it any way: taking 3 fp16 registers and doing an actual fp16 multiply add on them can produce significantly different results then taking 3 fp16 registers and actually doing fp32 multiply add on them.

P.S.: Forgot about Tegra X1, that's current too.

RecessionCone · Jun 5, 2016

I've been using FP16 on Maxwell to reduce memory allocations and off-chip bandwidth. It works pretty well, but the conversion instructions run quite a bit slower than FP32 ops. Makes it hard to use to reduce register pressure.

CarstenS · Jun 5, 2016

I'm sure, Oxide did their very best to optimize that and keep losses in check.

CSI PC · Jun 5, 2016

CarstenS said:
I'm sure, Oxide did their very best to optimize that and keep losses in check.

Well it would be amusing if they went this route solely looking at GCN3

But then maybe Nvidia should had engaged better *shrug*.
And yeah I am also curious on the performance benefit/penalty this has for both manufacturers and their various cards along with the potential implications we see with terrain difference using 1080FE.
Cheers

MDolenc · Jun 5, 2016

RecessionCone said:
I've been using FP16 on Maxwell to reduce memory allocations and off-chip bandwidth. It works pretty well, but the conversion instructions run quite a bit slower than FP32 ops. Makes it hard to use to reduce register pressure.

Just to add here: this is CUDA and requires manual conversions. It's not exposed to DX. But then again CarstenS says Pascal doesn't expose min precision either.

Entropy · Jun 5, 2016

MDolenc said:
There is general advantage of using fp16 data (memory size, bandwidth,...), but this has been around since floating point was first introduced to GPU pipeline.
Of current architectures only GCN3 can reduce register pressure by using fp16 (that is from min precision hints in HLSL). This presumably applies to GCN4 (Polaris) and Pascal as well.
Now it shouldn't need pointing out but since I'm starting to have some serious doubts regarding all this I'll do it any way: taking 3 fp16 registers and doing an actual fp16 multiply add on them can produce significantly different results then taking 3 fp16 registers and actually doing fp32 multiply add on them.

P.S.: Forgot about Tegra X1, that's current too.

I may not understand you fully, but are you talking about 32-bit operations on 16-bit operands producing differences in results outside what would be expected from that change of precision?
Otherwise, numerical differences are to be expected - the question is whether those differences produce significant issues in the real-life use case.

For someone who majorly belong to another field, this would seem one of the nice things about interactive graphics programming - if it looks fine, then it IS fine.

CarstenS · Jun 5, 2016

MDolenc said:
Just to add here: this is CUDA and requires manual conversions. It's not exposed to DX. But then again CarstenS says Pascal doesn't expose min precision either.

Correct. I was talking about DX, since AotS is a DX game and it has to use what the API exposes here. In Cuda, things are different (also from OpenCL and Open GL, where FP16 doesn't seem to be exposed currently either).

sebbbi · Jun 5, 2016

Full fp16 (ALU + reg) vs fp32 ALU running on fp16 registers (split 32 bit register to upper and lower) should only result in slight additional rounding errors. Assuming of course that the result is stored/loaded to/from 16 bit register after each operation. DX allows 1 ULP error. 32 bit ALU with proper rounding at output to 16 bit register produces ~0.5 ULP max error (mantissa cut would be 1 ULP). Native fp16 ALU results in 1 ULP max error (assuming it follows DX spec). We are talking about 0.5 ULP difference per instruction at most. So GCN3 vs Pascal should be almost identical (assuming Pascal is fp16 ALU and GCN3 is fp16 storage + fp32 ALU).

Shader compiler is not allowed to reorder floating point math freely. This is especially important to know when writing numerically stable fp16 code. Good article about things that compilers are not doing: http://www.humus.name/index.php?page=Articles&ID=6

Of course GPUs that do not support fp16 at all have significantly higher precision at math done to min16float variables. But if this results in notable differences in a shipping application, it is most likely the developers fault. You should always check your fp16 code on both fp16 and fp32 to ensure that the image looks the same. #ifdef the type attribute (allows you to disable fp16 from all shaders with a single line code change). Every rendering programmer who has worked with PS3 knows how to deal with this. But fp16 support on modern PC hardware is still very limited, meaning that many developers don't yet have full hardware matrix to test it.

CSI PC · Jun 5, 2016

So looks like G-Sync and FastSync are integral to each other, of course it may just be early driver/technology issues or teething problems.
But a member on another site has reported that for G-Sync to behave correctly with his 1080, he also had to enable FastSync.

I got it figured out. You guys were both on the right track. It was a refresh rate problem. I ended up leaving Gsync on and also enabled Fastsync. Everything seems to be working perfect now. It really odd though because I never had this happen at all with the same setting on my 980. Something must be a little different in the driver. Either way thanks to both of you for the suggestions. I'm just glad the card doesn't have issues. +rep

http://www.overclock.net/t/1601922/anybody-having-problems-with-gtx-1080

Any publications likely to test or investigate G-Sync/FastSync/etc with Pascal cards and also Maxwell 2?
Cheers

Nvidia Pascal Announcement

CarstenS

Moderator

CSI PC

CarstenS

Moderator

RecessionCone

CarstenS

Moderator

CSI PC

CSI PC

CarstenS

Moderator

CSI PC

Psycho

CSI PC

MDolenc

RecessionCone

CarstenS

Moderator

CSI PC

MDolenc

Entropy

CarstenS

Moderator

sebbbi

CSI PC

Similar threads