PowerVR Rogue Architecture

Rys

Graphics @ AMD
Moderator
Veteran
Supporter
It looks like there's enough of an interest in Rogue's architecture to warrant a dedicated technical thread. Fire away if you want to ask questions about why we made certain design decisions, including F16 (although keep technical discussion about F16 in general to the other thread in the Architecture forums, because we're certainly not the only vendor to implement a compute core this way), or if you just want to understand more about Rogue in general.
 
The point where I'm confused is if there's any dedicated INT10 hw in Rogue. In the developers guide from PowerVR there are for Series6 recommendations for highp and mediump; lowp is marked for Series5 only. Probably a very dumb question: could it be that medium & lowp is handled by the FP16 ALUs?
 
There's no INT10 native support. Less than F16 is promoted to either F16 or F32, depending on what the compiler thinks is the best fit for the core, given the surrounding computational context.
 
There's no INT10 native support. Less than F16 is promoted to either F16 or F32, depending on what the compiler thinks is the best fit for the core, given the surrounding computational context.

Typically FP16 ALUs and if there aren't any enough free resources they just use FP32 units? Doesn't that waste power (in the latter case especially) or are INT10 cases rather a small proportion in today's mobile scenarios?
 
Because we don't run F32 and F16 together there's a bit more nuance to the "free resources" part of the decision, but effectively yes. One of the primary uses of the F16 datapath is INT10-style UI blending and composition, so the vast majority of that kind of workload, if not the whole thing if you're very careful or can hand-tune the ISA, runs on the F16 datapaths.

It's a very common workload (more common than rendering "game" pixels over the lifetime of the device), with sRGB commonly thrown into the mix these days as well.
 
Two last questions for now:

1. Do other ULP mobile GPUs use still dedicated hw for INT or not? Even a gut feeling would be good enough.

2. While I realize that you don't run FP16 & FP32 in parallel, that stands for each ALU and not the entire GPU or not?

Oh and thank you :D
 
Yes, there's dedicated integer logic in other architectures. And yes, each USC is independent so they don't run in lock step like that.
 
Rys, can you explain the reasoning behind moving to a 2:1 ratio of FP16:FP32 ALU's in Series 6XT vs. the 1.5:1 ratio of FP16:FP32 ALU's in Series 6?
 
To give more FP16 throughput (specifically more throughput for more common types of arithmetic; the setup in Series6 is more limited in its application as well as being a flop less), because it's an incredibly common part of the workload in many common games and applications.
 
To give more FP16 throughput (specifically more throughput for more common types of arithmetic; the setup in Series6 is more limited in its application as well as being a flop less), because it's an incredibly common part of the workload in many common games and applications.


More common where? Where did you take these statistical samples from?
Current and older Android/iOS games, or did you get input from games currently in development?
I'm asking this because as mobile games approach full-fledged desktop/console games in graphics fidelity, I thought the presence of dedicated FP16 units would start to diminish, not grow.



Or to just face the elephant in the room at once, please reassure us that the changed ratio for the 6XT series is meant to make us play newer games with better graphics and faster framerates, and not just to get a faster gfxbench or 3dmark score :oops:
 
I don't understand the hostility towards FP16: it's obvious that it will consume less power and it thus makes total sense to advocate its usage. And given enough incentives (read: FP16 is faster than fp32), game developers will adopt it no matter what.

(And I'm pretty sure Rys has done plenty of statistic sampling...)
 
hell I always use fp16
its ~20% faster for a little bit looking worse so why not, OK next year pure 32 but now phones need all the help they can get
 
I personally think Rogue is a step in the right direction. With many other mobile architectures you still have to use the painful ~FP10 formats to extract the best performance. I sincerely hope that lowp vanishes soon, and the ES standard would dictate FP16 minimum for mediump.

FP16 is great for pixel shaders. You can avoid vast majority of precision issues by thinking about your numeric ranges and transforms. Doing your pixel shader math in view space or in camera centered world space goes long way to the right direction.

Obviously you need FP24/32 for vertex shaders. However most of the GPU math is done in pixel shaders as high resolutions such as 2048x1536 and 2560x1600 are common in mobile devices. The decision to increase the FP16 ALUs was a correct one.

Increased FP16 ALUs also means that the performance trade off from dropping the lowp ALUs is decreased. This is certanly a good thing. I hope that other mobile GPU developers also think alike and lowp is soon gone :)

One question to Rys: If the vertex shader is running in FP32 and the pixel shader in FP16, can both of them be executed simultaneously? Or is there even need for this in the tiling architecture (vertex processing -> tiling -> pixel processing)?
 
Criticism of providing dedicated FP16 performance would only have merit if PowerVR's FP32 performance was somehow lacking, but it's not.

And overall application/benchmark performance improved by increasing FP16 capability without increasing FP32, so the approach again proved itself.

And it's been explained several times that most of the math done by the GPU among its total usage across the OS and all apps -- not just games -- is dominated by more basic ops for GUI/page composition, so the justification is obvious.
 
@ToTTenTranz: Workloads come from both mobile and desktop scenarios. These are composition workloads, games, benchmarks and productivity scenarios (web browsing for example). Some workloads are synthetic and some are real workloads you'd find in the wild. This is true for all steps of the development: theoretical analysis, design, validation. And I'm sure it's true for every vendor out there. It would be weird if it weren't.

@sebbbi: It's been public for ages that VS and PS can be run in parallel on Rogue. You can run PVRTune and see the overlap yourself. :) (BTW: tools for PowerVR devices are under-appreciated) Rys also wrote that "each USC is independent so they don't run in lock step like that" which means that yes, you can do that. Zero overlap increases latency and is probably a pretty bad idea for modern screens (could make sense for other uses though).
 
I should actually lean back and just lurk reading when you folks start explaining things, but it took me some time as a layman to understand how Series5 roughly worked with highp/mediump/lowp through the vector ALUs, I'm still not sure I got scheduling across 5XT MPs right and am now battling to get a very rough understanding of how Rogue works....each piece of additional information creates more questions.

Dominik,

Ok but at which level can you pick between different precision values. On USC level or else for each SIMD16 only one type of precision at a time? If yes how get things handled on a 6XE case with only one USC or just a half USC?

Another one would be how the heck they manage to keep all those slots busy especially in 6XT cores and especially with FP16, considering how "wide" each SIMD lane actually is.
 
I can't discuss internals publicly so it's better if Rys tackles this one, sorry.

One thing I can say is that driver is doing its best to provide compiler with as many context around the stuff being compiled as possible. You get hints from the API, you understand what's being executed and on what resources (unless you're doing bindless), you know what shader it is (pixel, vertex,...), what inputs and outputs are, and so on. You take this knowledge and pass it to the compiler which does whatever it can to maximize HW utilization and minimize all the bad things (latency, resource usage, etc.). If there's something that compiler could do better - this will be discussed, assessed against wider spectrum of APIs and needs and potentially implemented. This has been discussed in the driver optimization thread before and is an absolute must for every driver on every HW.

On top of that driver has to (well, should) be conservative so if e.g. we know that we could optimize something and gain 10% boost 90% of the time but it'd break a bunch of apps - we won't do this. If there's something that can be done in the API to mitigate this - this should be discussed with API people (Khronos, MS). If there's something that HW could do, it's being discussed with HW folks and if sane - prototyped, measured and budgeted (people, time, die area). This is how every business operates (or should).

So, yeah, I can't state how things work internally w/o getting shot in the head but people should follow Occam's razor in discussions about HW. Changes in HW - from every vendor I'm sure - happen for a reason. It's absolutely counterproductive and probably counter-factual to assume, that companies do stuff in the HW for political reasons. There are engineers measuring and tweaking stuff. They are the people who make calls - sometimes correct, sometimes not. Happens. So how work in the GPU is scheduled, what precision is being used - those things come from real world data, not from a religious leader. And in general things improve from core to core. :)
 
@sebbbi: It's been public for ages that VS and PS can be run in parallel on Rogue. You can run PVRTune and see the overlap yourself. :) (BTW: tools for PowerVR devices are under-appreciated) Rys also wrote that "each USC is independent so they don't run in lock step like that" which means that yes, you can do that. Zero overlap increases latency and is probably a pretty bad idea for modern screens (could make sense for other uses though).
I don't personally develop on PVR hardware (but I am discussing with guys who do daily), so I am not familiar with the development tools.

Obviously VS and PS can be running at the same time. What I wanted to know was that is there a downside or an upside if the VS is running FP32 code and the PS is running FP16 code. Basically what I wanted to know is the granularity of "we don't run F32 and F16 together". Does this mean that a single USC is in either FP32 or FP16 mode (and the other execution unit is shut down)? I don't know enough about the capability of PVR architecture to run multiple shaders simultaneously to ask the right questions. GCN for example can run multiple shaders at the same time on the same CU. If one of the shaders uses different units/resources than the other, both can co-exist happily and finish faster (than just running them one after the other). This also applies to the VS + PS scenario on GCN. It would be nice if the vertex shader could use the FP32 execution resources simultaneously as the pixel shader is using the FP16 execution resources.
 
Again, I can't talk about internals and granularity of fp16/fp32 work, but I can say that this kind of dependency would affect HW performance on Windows, where DXGK is scheduling work from multiple sources (applications and compositor) and you can't tell upfront what VS you'll be running in parallel with what PS. So you'd have to be conservative and run everything fp32, swallow the lack of overlap (unless you can hide it) or deal with image quality and run everything fp16.

It would be hard to manage this kind of dependencies forced by the HW design even on platforms where you control scheduling yourself as you'd have to look in your scheduler at the stuff you're running in parallel from the angle of shader precision (which is most likely tied into the ISA, so your user mode driver decided about the machine shader precision beforehand) on top of resolving resource dependencies, fences and what not.

It is feasible to give this kind of explicit control to the developer, especially for single-app scenarios (DX12+ probably could do that) but it should be optional, not mandatory, and you still have to design your driver and your HW for all cases, not just for the high performance games.
 
We don't execute different thread types simultaneously, but we can switch to a new thread type in the next clock, and threads can issue to either the F32 or F16 datapath, just not both at the same time.

There's more than just the F16 and F32 paths in the hardware as well. As mentioned in another thread ages ago, there's also the SFU path and the integer path.
 
Back
Top