Native FP16 support in GPU architectures

Albuquerque · Oct 21, 2014

Your statement was that ipc had not changed since the C2D days, I have demonstrated that claim to be incorrect. You point out cache differences as if that's somehow not comparable, but that architectural change (among dozens of others) is what allowed the increase.

I do not understand why you pointed it out, as strapping that L1 cache structure to C2D isn't going to magically net a 30% increase in IPC and you know it.

Ailuros · Oct 21, 2014

Xmas said:
Even that is still a win. And we're also talking about single digit % SoC area increases.

As I said a single FP64 unit at 1GHz under 28nm is for synthesis alone 0.025mm2, which I grabbed from one of Dally's past presentations somewhere long before Kepler appeared.

If anyone expected "close to 2 times perf/watt for the whole SOC in real games", I don't see it in this thread.

homerdog · Oct 21, 2014

Albuquerque said:
Your statement was that ipc had not changed since the C2D days, I have demonstrated that claim to be incorrect. You point out cache differences as if that's somehow not comparable, but that architectural change (among dozens of others) is what allowed the increase.

I do not understand why you pointed it out, as strapping that L1 cache structure to C2D isn't going to magically net a 30% increase in IPC and you know it.

Haswell is basically just Conroe with moar cachez and 22nm. Xmas told me so you know it's true.

Albuquerque · Oct 21, 2014

My intention was not to be inflammatory, rather to clear up a misunderstanding that seems to pervade a lot of minds in the x86 world.

I do infrastructure architecture for a living now, one of my many job duties is defining standards around hardware configuration for our x86 platforms. It very much surprises me when I have a "server admin", who for years has been the subject matter expert for spec'ing servers for decades before I showed up, who makes statements not dissimilar to the above: "Oh, clock speeds have been the same forever, the only reason new CPU's are faster nowadays is because there are more cores."

Not true at all. A lot of these same folks seem clueless to the implications of NUMA domains the affect on code not specifically built to deal with it. The server admins here had been buying dual socket servers for everything and then blaming terrible performance on the lower clockspeeds of the decacore CPU's they purchase. It boggles their mind when I suggest removing one of the CPU's and benchmarking again, especially when I finally forced the issue and the product performance increased some ~25%.

Dual socket decacore servers, running 32GB of ram, running nothing more than a fat stack of very simplistic (and non-computationally bound) JVM's. What idiot doesn't profile the use case?.

homerdog · Oct 21, 2014

Albuquerque said:
Dual socket decacore servers, running 32GB of ram, running nothing more than a fat stack of very simplistic (and non-computationally bound) JVM's. What idiot doesn't profile the use case?.

I would say most "server admins". I regularly work with self proclaimed experts (that naturally are paid more than me) who don't know the difference between an HDD and an SSD.

swaaye · Oct 21, 2014

Ailuros said:
Unless I'm understanding here something wrong: do you mean that NV3 memories are more current than anything PS3?

lol yeah I actually use a FX 5900 Ultra for some old games. I have only barely used PS3.

Albuquerque · Oct 21, 2014

homerdog said:
I would say most "server admins". I regularly work with self proclaimed experts (that naturally are paid more than me) who don't know the difference between an HDD and an SSD.

Ah yes, but all in the name of standards: "We only buy this configuration." Yeah, a $12,000 configuration that doesn't meet the needs of 80% of our deployed application base.

Anyway, i need to stop hijacking this thread. IPC on modern processors has indeed significantly increased compared to processors five years ago.

Ailuros · Oct 21, 2014

swaaye said:
lol yeah I actually use a FX 5900 Ultra for some old games. I have only barely used PS3.

Ewwww that was the successor of the dustblower

Jensen claimed back then that the G7x (I think Sony called it the "Reality Synthesizer") in the PS3 was twice as fast as a GeForce6800 [/end of OT]

sebbbi · Oct 21, 2014

Xmas said:
My guess for this specific case would be precision of varyings (post-transform vertex attributes) stored in memory. As a TBDR, Rogue writes transformed vertices to memory. It would be perfectly sensible to store mediump varyings as FP16 values.

I don't know how their hardware works, but I assume that they are only storing the (post projection) position data to memory. According to my experiments in GPU based software rasterization, I would definitely split the vertex shader (and vertex buffer) into two parts. There isn't usually that much ALU and fetched data that is shared by the position calculations and the other calculations (tangent frame transform + decompress + tangent fetch + UV fetch). This way you save 75%-80% of the storage cost (for non-skinned stuff). I would personally execute all the parts of the vertex shader that produce varyings in the second stage (output them directly to LDS or similar on-chip memory pool), so there's not much gains in using smaller format varyings.

Obviously with random shaders (not under your control) it might be hard to ensure that the spitting is always a win (or works 100% perfectly), and it might be hard to split the vertex data efficiently (driver might do this, but transform feedback etc dynamic modifications still needs to work).

swaaye · Oct 21, 2014

Ailuros said:
Ewwww that was the successor of the dustblower Jensen claimed back then that the G7x (I think Sony called it the "Reality Synthesizer") in the PS3 was twice as fast as a GeForce6800 [/end of OT]

My 5900U is dead silent. It has an Arctic Accelero S2 on it. It does have some inductor noise but it's just enough for you to know it's there.

GeForce FX has some advantages for old games like hardware palletized textures and table fog. The 45.23 drivers (oldest a 5900 can run) have solid compatibility with old stuff. Makes it better than ATI and also GF6+. FX cards are great as long as you aren't trying to play D3D 9 games. Think of them as beefed up GF4 Ti cards.

Xmas · Oct 22, 2014

sebbbi said:
I don't know how their hardware works, but I assume that they are only storing the (post projection) position data to memory. According to my experiments in GPU based software rasterization, I would definitely split the vertex shader (and vertex buffer) into two parts. There isn't usually that much ALU and fetched data that is shared by the position calculations and the other calculations (tangent frame transform + decompress + tangent fetch + UV fetch). This way you save 75%-80% of the storage cost (for non-skinned stuff). I would personally execute all the parts of the vertex shader that produce varyings in the second stage (output them directly to LDS or similar on-chip memory pool), so there's not much gains in using smaller format varyings.

Rogue executes the vertex shader only once. Post-transform geometry is written to memory using lossless compression.

There are various reasons for and against a two-stage approach, and this is an area where mobile GPU vendors have experimented a lot.

MrMilli · Oct 22, 2014

MrMilli said:
As for the Apple A8, even the quad core GX6450 performs nicely and Apple is known for not pushing clock frequencies. The comparison to the K1 falls short because you're comparing a phone SOC to a tablet SOC. The real comparison will come from A8X.

As I suspected. In GFXBench 3.0 Manhattan Offscreen, 32,5fps vs 31fps. Small victory for PowerVR. Need to see more benchmarks though.

RecessionCone · Oct 22, 2014

MrMilli said:
As I suspected. In GFXBench 3.0 Manhattan Offscreen, 32,5fps vs 31fps. Small victory for PowerVR. Need to see more benchmarks though.

A small victory despite the 20/28 nm implementation difference, at unknown power consumption and cost difference.

Hard to say who it's a small victory for other than Apple - it doesn't tell us much about the underlying architecture. It would be equally wrong to say GTX980 is a big victory for NVIDIA over PowerVR.

MrMilli · Oct 23, 2014

RecessionCone said:
A small victory despite the 20/28 nm implementation difference, at unknown power consumption and cost difference.

Hard to say who it's a small victory for other than Apple - it doesn't tell us much about the underlying architecture. It would be equally wrong to say GTX980 is a big victory for NVIDIA over PowerVR.

True but the A8X is a 3 billion transistor chip. I'm pretty sure they can't push clocks that high, even on 20nm. Remember 20nm isn't that magical. Apparently the K1 GPU is clocked around 850Mhz.

RecessionCone · Oct 23, 2014

MrMilli said:
True but the A8X is a 3 billion transistor chip. I'm pretty sure they can't push clocks that high, even on 20nm. Remember 20nm isn't that magical. Apparently the K1 GPU is clocked around 850Mhz.

We'll have to see what Nvidia's 20nm Maxwell SoC can do, then. Given Maxwell on 28nm, I'm thinking Maxwell on 20nm might be kind of magical.

Ailuros · Oct 23, 2014

RecessionCone said:
We'll have to see what Nvidia's 20nm Maxwell SoC can do, then. Given Maxwell on 28nm, I'm thinking Maxwell on 20nm might be kind of magical.

It'll lead you into the same dead end as it already is, since NV and Apple have different hw refresh cycles. Erista will be most likely ahead of A8X as likewise A9 will end up ahead of Erista.

Maxwell is "magical" already in desktop GPUs even on "just" 28nm. Despite that the whole thing is way OT, aren't we overrating a bit the benefits of the underlying manufacturing process?

DemoCoder · Nov 2, 2014

Sorry to divert the thread, I've been away for a long time, but one of the things that concerns me in these mobile benchmarking reviews is that no one seems to be doing direct image quality analyses anymore.

In the heyday of desktop GPUs evolving quickly, vendors would get caught with their pants down using all kinds of approximations, deliberately cheating, having poor anisotropic filtering, or otherwise disobeying what the benchmark requested of the drivers.

How do we know this isn't occurring again? I don't think we should trust GFXBench or 3dMark, Apple or Nvidia, and reviewers should be digging deep into the outputs of these chips. Who knows, maybe there's another "Quack.exe" lurking. Unfortunately on these mobile platforms, things seem very locked down, it's hard to get source and hard to probe the sandboxes to make the kinds of hacks needed to expose cheating as much as the past (although it's easier expose cheats on Android)

Ailuros · Nov 2, 2014

DemoCoder said:
Sorry to divert the thread, I've been away for a long time, but one of the things that concerns me in these mobile benchmarking reviews is that no one seems to be doing direct image quality analyses anymore.

In the heyday of desktop GPUs evolving quickly, vendors would get caught with their pants down using all kinds of approximations, deliberately cheating, having poor anisotropic filtering, or otherwise disobeying what the benchmark requested of the drivers.

How do we know this isn't occurring again? I don't think we should trust GFXBench or 3dMark, Apple or Nvidia, and reviewers should be digging deep into the outputs of these chips. Who knows, maybe there's another "Quack.exe" lurking. Unfortunately on these mobile platforms, things seem very locked down, it's hard to get source and hard to probe the sandboxes to make the kinds of hacks needed to expose cheating as much as the past (although it's easier expose cheats on Android)

Yes it should definitely be a concern, but I'm also sure that all involved IHVs are wide awake watching what each competitor is doing. Oh and I missed you definitely; it's good to see you again.

lanek · Nov 2, 2014

DemoCoder said:
Sorry to divert the thread, I've been away for a long time, but one of the things that concerns me in these mobile benchmarking reviews is that no one seems to be doing direct image quality analyses anymore.

In the heyday of desktop GPUs evolving quickly, vendors would get caught with their pants down using all kinds of approximations, deliberately cheating, having poor anisotropic filtering, or otherwise disobeying what the benchmark requested of the drivers.

How do we know this isn't occurring again? I don't think we should trust GFXBench or 3dMark, Apple or Nvidia, and reviewers should be digging deep into the outputs of these chips. Who knows, maybe there's another "Quack.exe" lurking. Unfortunately on these mobile platforms, things seem very locked down, it's hard to get source and hard to probe the sandboxes to make the kinds of hacks needed to expose cheating as much as the past (although it's easier expose cheats on Android)

Sadly, this is not the same market, its allready really hard to find reliable benchmarks for mobiles, let alone peoples who understand how they work when they use them for review.

Dominik D · Nov 3, 2014

DemoCoder said:
In the heyday of desktop GPUs evolving quickly, vendors would get caught with their pants down using all kinds of approximations, deliberately cheating, having poor anisotropic filtering, or otherwise disobeying what the benchmark requested of the drivers.

How do we know this isn't occurring again?

We don't. That's why we need a proper OGL cert suite. What Khronos provides is a joke compared to what you have to go through for DirectX validation. I hope that there would be some sort of transparency into waivers for different cores too - certain vendors cut corners wherever they can (or can't).

Native FP16 support in GPU architectures

Albuquerque

Red-headed step child

Ailuros

Epsilon plus three

homerdog

donator of the year

Albuquerque

Red-headed step child

homerdog

donator of the year

swaaye

Entirely Suboptimal

Albuquerque

Red-headed step child

Ailuros

Epsilon plus three

sebbbi

swaaye

Entirely Suboptimal

Xmas

Porous

MrMilli

RecessionCone

MrMilli

RecessionCone

Ailuros

Epsilon plus three

DemoCoder

Ailuros

Epsilon plus three

lanek

Dominik D

Similar threads