PowerVR Rogue Architecture

We don't execute different thread types simultaneously, but we can switch to a new thread type in the next clock, and threads can issue to either the F32 or F16 datapath, just not both at the same time.
Thanks for the info Rys :)

To confirm that I understood it correctly: If there are stall cycles (cache miss) in my pixel shader that is running FP16 instructions, the same USC can efficiently fill the stall cycles with for example vertex shader FP32 instructions?
 
More common where? Where did you take these statistical samples from?

Current and older Android/iOS games, or did you get input from games currently in development?

I'm asking this because as mobile games approach full-fledged desktop/console games in graphics fidelity, I thought the presence of dedicated FP16 units would start to diminish, not grow.



Or to just face the elephant in the room at once, please reassure us that the changed ratio for the 6XT series is meant to make us play newer games with better graphics and faster framerates, and not just to get a faster gfxbench or 3dmark score :oops:


The input to the decision on what to build and the amount of various things is driven by an enormous set of analysed inputs. Shaders and kernels from games old and new on mobile and PC, internal tests and microbenchmarks, input from and discussion with developers (especially the major middleware guys; I talk to Epic v.regularly for example), regression suites from our customers, the research team's research simulators....

Basically everywhere graphics and compute code runs in any market, we try and take a look at apps and tests and middleware to see what they're doing now and what developers want in the future, and put it all through current HW and SW and the stuff in development and research for the future.

Benchmarks are just one of many many inputs to the process, but games and UIs are what we focus on when making a call one way or the other at decision making time.
 
Thanks for the info Rys :)



To confirm that I understood it correctly: If there are stall cycles (cache miss) in my pixel shader that is running FP16 instructions, the same USC can efficiently fill the stall cycles with for example vertex shader FP32 instructions?


Yep. If we sleep threads we can run anything else that's ready, issuing to whatever paths those threads need.
 
To answer Sebbi's question differently, there's no "modality", there's F16 and F32 ALU's, what's executed only depends on the instruction being issued.
 
To answer Sebbi's question differently, there's no "modality", there's F16 and F32 ALU's, what's executed only depends on the instruction being issued.
Using Intel's CPU terminology: The F16 and F32 ALUs share a dispatch port.
 
I bumped into this super exciting Apple extension: https://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt

I assume this allows me to perform all kinds of custom read/modify/write operations inside the on-chip tile buffers? If yes, then this is HUGE :)

For example I could implement super efficient deferred rendering using this extension: Rasterization pass writes g-buffers, g-buffer data stays in the on-chip tile buffer, lighting shader reads it from there and I writes the lighting results on top of the existing g-buffer data. I could also do soft particle blending with this... etc, etc.

I would like to see new OpenGL extensions that expose the TBDR hardware more. This extension is a very good start (assuming I understood it correctly). If possible random access read/write inside the on-clip tile buffer (inside the tile) would be even better. That would allow crazy algorithms with zero bandwidth cost. At 2048x1536 you need crazy stuff to make high quality post processing a possibility (mobile SOCs have very limited BW).
 
Rys has repeatedly made the point that FP16's are rather ideal for blending and compositing.

I completely agree.

I've developed various compositing kernels (via GPU compute) over the last few years and *not* having FP16 support is painful. FP32 is overkill and U8 is not enough. Furthermore, integer composites wind up requiring more instructions even with hackery -- at least with the GPU instruction sets I'm familiar with.

Additionally, compositing ordered pixels is not an insignificant workload in a pipeline despite mapping almost perfectly to SIMT. Compositing can grow to be an ever larger part of a pipeline if you're doing things like refreshing or redrawing a static (layered) scene.

I've read a few places that FP16's have half the gate delay and require about 1/4th the number of transistors.

CUDA doesn't expose FP16v2 ops but it looks like Broadwell's IGP 5300 can/will/does in OpenCL.

Rogue questions for Rys:

(1) Do Rogues with FP16 ALUs support "packed" FP16v2's in a 32-bit register? I assume so.

(2) From the architectural diagrams that are out there I assume Rogue can either execute a max of 2 FP32 ops or 4 FP16 ops per thread clock. But not simultaneously perform an FMA32x2 and FMA16x4?

(3) How much is (2) gated by register-file throughput? Not sure you can answer that one. :cool:

Thanks!
 
1) We do pack values and the main RF is common to the datapaths.

2) Correct, we don't run them simultaneously.

3) Hard to answer without disclosing other bits we don't want to talk about!
 
Rys (or anyone else): can you give any insight into how GFXBench calculates "Render Quality - Medium Precision" and "Render Quality - High Precision" PSNR results? What makes these render quality values fluctuate so much from one GPU architecture to another?
 
Last edited by a moderator:
Rys (or anyone else): can you give any insight into how GFXBench calculates "Render Quality - Medium Precision" and "Render Quality - High Precision" PSNR results? What makes these render quality values fluctuate so much from one GPU architecture to another?
Worth for us to look at the code to double-check. I think it's noteworthy that every single pre-DX11 Radeon has ~3750dB despite using 0.5 ULP FP32 for everything (and I'm pretty sure that has nothing to do with mipmap selection although it'd be interesting if it did). Qualcomm seems to have very similar PSNR as PVR while ARM has slightly higher ones (but still lower than NVIDIA despite claimed DX11 support since T604).

BTW, a small but IMO very important note - I need to double check the driver code to make sure about behaviour on different cores/drivers, but AFAIK every single shipping Rogue compiler should always force FP32 for all operations inside a normalize(). This is one of the *rare* cases where developers love to shoot themselves in the foot (caused us issues on SGX, e.g. in Taiji) so sometimes you just gotta hide away the bullets ;) We can do this because FP16<->FP32 conversion is 100% free on Rogue. That also means developers shouldn't be afraid to mix-and-match and use FP32 where they need it (e.g. inverse-matrix multiplication of depth in deferred shading, coordinate calculation for very large textures, etc.)
 
If they are just taking a PSNR of the image, I imagine stuff like texture filtering is going to have as large an effect (if not more) than ALU precision... and there's a huge variety of implementations there of varying quality.

That said, PSNR is pretty much a terrible way to measure "better" in things like that... hrm. We need some better mobile benchmarks :S
 
Thanks for answering, guys. This thread has been great :)


I don't understand the hostility towards FP16 (...)

What hostility are you talking about? The questions I've seen in this thread are just what you would expect to find in a formal interview with a journalist from the hardware media.
Rys, Dominik D and JohnH just answered the questions in a honest and straight-forward way (thank you for that, honestly) and that was it.

I asked about the FP16 subject because I've seen speculation about that in lots of places. And if we're given the opportunity from the source to explain this, then we should ask the question for the benefit of all.

Moreover, this was a peaceful and adult conversation. There's no need to come forth and protect someone who doesn't need protection.



Why are consumers still unable to get basic framerate data on a game-by-game basis as we are accustomed to on the PC?
I've been wondering about that.
Is it possible to make at least a FRAPS clone for Android?
 
I think silent_guy was referring to the parallel thread where fp16 hostility (to strong a word but whatever) is pretty prominent. :)
 
Back
Top