PowerVR GR6500 Raytracing GPU?

I wish Google,amazon,microsoft or one of the major smart phone companies like lg or htc would include this in one of their higher end tablets.
 
Well we are moving toward ray tracing anyway, we are just unlucky to have to use rasterizers :(
I really wished Nintendo picked PowerVR Wizard, but they wouldn't have gotten the same level of dev support as with nVidia. (Which is massively bigger than IMG)
 
Awesome update!

With the news that AMD will have Tile Based Raster and Nvidia already (secretly) had it since Maxwell, it seems that PowerVR basically mapped out the future
and this is probably another area where the big two will catch-up eventually.
 
Awesome update!

With the news that AMD will have Tile Based Raster and Nvidia already (secretly) had it since Maxwell, it seems that PowerVR basically mapped out the future
and this is probably another area where the big two will catch-up eventually.


AMD and Maxwell may have Tile Based Rendering, but the "true" strenght of PowerVR is Tile-Based Deferred Rendering and all the actions taking place on chip, if I understand the tech correctly. It's still a few step ahead of AMD&nVidia.


IMG0040877_1.jpg


Vs


IMG0040878_1.jpg
 
slides-32.jpg


Shade once means they are actually doing HSR (Hidden Surface Removal) per tile. How are they doing it is a different story altogether but it shouldn't be that far from TBDR's efficiency. Although they also have an Immediate Rendering Mode as they can toggle this binning mode on/off which seem to suggest that you don't always get improved perf/power using TBDR.
 
Would they need so much vram BW if they were near PowerVR TBDR efficiency ? (true question).
 
More the number of compute units you have you also need more BW to keep them fed. Over the years compute has been growing much faster and they are trying to find whatever ways to reduce BW requirements to maintain that FLOPS:BW ratio. TBDRs are also more power efficient due to not shading any invisible pixels which desktops adopted because of their increased focus on power.
 
Would they need so much vram BW if they were near PowerVR TBDR efficiency ? (true question).
It's a lot of bandwidth in absolute terms, but compared to Vega's direct predecessor Fiji theoretical bandwidth remains the same while compute throughput increases significantly.
 
Would they need so much vram BW if they were near PowerVR TBDR efficiency ? (true question).
iPad Pro (PowerVR TBDR) has 51.2 GB/s bandwidth. Intel laptop GPUs with comparable (somewhat slower) performance have only 25.6 GB/s (I am talking about models without eDRAM). AMDs APUs (integrated Radeon) don't have huge bandwidths either. Nobody is questioning whether PowerVR is slightly more BW efficient in rasterization (than Pascal & Vega), but the reality is: If you scale the GPU 8x bigger (integrated vs discrete) then you also need 8x more bandwidth, no matter what kind of bandwidth savings you have.

Also high bandwidth is needed for compute tasks. NV & AMD GPUs are also used for GPGPU. All AAA games nowadays are also using compute shaders. Triangle rasterization is becoming smaller and smaller portion of the frame time.
 
Last edited:
Shade once means they are actually doing HSR (Hidden Surface Removal) per tile. How are they doing it is a different story altogether but it shouldn't be that far from TBDR's efficiency. Although they also have an Immediate Rendering Mode as they can toggle this binning mode on/off which seem to suggest that you don't always get improved perf/power using TBDR.
It's not just that it can be toggled on/off, but that there are cases where it has to fall back to an immediate mode or the number of bins that spawn from the primitive stream rises to an inflection point where there's as much work put into the bins as an immediate stream.
The pipeline needs to track how many resources a bin's fragments are going to take up in terms of buffers and on-chip resources, which means fragments can cumulatively oversubscribe GPU resources unless the binning logic cuts a bin short and sends it for shading with fewer primitives than a full load.
Other scenarios are shaders that update depth, which can force the GPU to become conservative for fear of discarding something that it shouldn't. Transparency is another area where binning can at least become less effective depending on how quickly multiple fragments per pixel reach bin or GPU resource limits.
I believe the worst-case is something like each triangle becomes its own bin, which is touted in patents as merely acting like an IMR with a delay. What that actually costs in terms of latency and lost throughput may require empirical measurement.

Also high bandwidth is needed for compute tasks. NV & AMD GPUs are also used for GPGPU. All AAA games nowadays are also using compute shaders. Triangle rasterization is becoming smaller and smaller portion of the frame time.
With GCN at least and AMD's emphasis on asynchronous compute, this is something of a blurred distinction. The dedicated graphics load may have a shrinking share of overall computational throughput, but in terms of its contribution as a serial component in wall-clock time it has an effect.
 
With GCN at least and AMD's emphasis on asynchronous compute, this is something of a blurred distinction. The dedicated graphics load may have a shrinking share of overall computational throughput, but in terms of its contribution as a serial component in wall-clock time it has an effect.
Yes. It might sound backwards, but async compute actually allows much larger rasterization hardware usage.

For example, let's discuss this workload: 25% g-buffer (raster), 25% shadows (raster), 25% lighting (compute), 25% post processing (compute).

This example is 50% raster (pixel+vertex shader) + 50% compute shaders. All rasterization and geometry setup hardware is thus idling 50% of the frame. This is a realistic scenario for modern deferred shaded games. You can't do more raster related tasks, as you need to spare some GPU time for high quality lighting and post processing as well.

However with async compute you can execute compute shaders concurrently with raster tasks. Most people would just think: "hey let's add some compute stuff and overlap it with the raster tasks". This obviously works. But wouldn't it be better to overlap all those existing raster tasks with the existing compute tasks. One way to do this: Render full frame worth of raster tasks (100% rasterization and geometry setup hardware usage), then during the next frame continue with lighting and post processing in the compute queue, while the raster pipeline is processing the next frame. Basically, asynchronous compute has allowed you to double your geometry workload size. Geometry passes can take double time (wall clock) without consuming the whole GPU. Shadows are pure raster + geometry setup. Practically zero ALU, register or sampler usage. Perfect compute overlap. G-buffer rendering is raster + geometry setup, but it is also consuming bandwidth and using texture units. Still it combines very well with lighting. And there also exists G-buffer rendering techniques that do not sample textures and use very little bandwidth: https://forum.beyond3d.com/threads/modern-textureless-deferred-rendering-techniques.57611/.

In conclusion: Async compute will make rasterization bandwidth consumption more important. Wall clock time (total pipeline depth) however gets less important as there's lots of other concurrent work running to hide the latency. Thus DCC and tiling are excellent techniques.
 
This example is 50% raster (pixel+vertex shader) + 50% compute shaders. All rasterization and geometry setup hardware is thus idling 50% of the frame. This is a realistic scenario for modern deferred shaded games. You can't do more raster related tasks, as you need to spare some GPU time for high quality lighting and post processing as well.

However with async compute you can execute compute shaders concurrently with raster tasks. Most people would just think: "hey let's add some compute stuff and overlap it with the raster tasks". This obviously works. But wouldn't it be better to overlap all those existing raster tasks with the existing compute tasks. One way to do this: Render full frame worth of raster tasks (100% rasterization and geometry setup hardware usage), then during the next frame continue with lighting and post processing in the compute queue, while the raster pipeline is processing the next frame.
I'm not debating whether AC can increase the time budget available for the rasterization portion of the GPU, just that it makes the calculation its contribution to frame time more complicated and that scaling compute throughput would bring Amdahl's law into the mix.
This is another way the evaluation can become blurry with asynchronous compute.
Wall-clock time in this scenario is no longer one frame in isolation, but something of a split across multiple stages.

Frame N's time will be the sum of the longest overlapping phases in Frame N and N+1
Time of Stage N = ceiling(Comp(N-1),Raster N) + overhead
Time of Stage N+1 = ceiling(Comp N,Raster (N+1))

If the assertion is that Raster will never take longer than Compute, then rasterization's apparent contribution to overall frame time gets much lower (setting aside that Raster and Comp likely take longer individually since another frame is doing some work alongside them).
In scenarios where compute happens to take less time, rasterization would tend to dominate.
Raising CU count ahead of front-end resources would make the time for Comp go down, and then suddenly rasterization appears to have a larger contribution.
What this is generally doing, as you hinted at, is actually raising wall-clock time for individual phases in order to get better utilization--although the gain versus intra-frame AC is bound up primarily in poorer resource allocation and the terrible straightline performance and coarse granularity of the GPU domain, where dependences and sharing penalties just take a long time to resolve. If it weren't for that, then splitting across another frame versus intra-frame wouldn't have been so much of a win.

As noted, the two components can be more finely divided, and could be more finely synchronized. Alternately things could be allowed to relax dependences (use data from N-2 versus N-1, bleed across 3 or 4 frames), which can raise utilization with various downsides.

Basically, asynchronous compute has allowed you to double your geometry workload size. Geometry passes can take double time (wall clock) without consuming the whole GPU.
Adjusting the workload in this manner does make it more difficult to calculate the contribution of one stage versus another. It would likely change the ratios given for the example. This is more of a GCN item, but with Vega's teaser and the PS4 Pro, some of the assumptions about overlap, responsibilities for each stage, and the time each stage takes may be varying.

In conclusion: Async compute will make rasterization bandwidth consumption more important. Wall clock time (total pipeline depth) however gets less important as there's lots of other concurrent work running to hide the latency. Thus DCC and tiling are excellent techniques.
This seems somewhat contradictory to the premise that rasterization would contribute less if it becomes more important in its consumption of a limited resource. The tiling techniques in particular are actually adding some serialization, although without knowing where they draw the line the worst-case is unclear.
The wall-clock time that rendering is concerned about would be at what point the outputs are expected to reflect the inputs of a user. This is really relaxed--but not infinitely more so--compared to some other metrics that CPUs or other hardware are subject to.
 
Last edited:
This seems somewhat contradictory to the premise that rasterization would contribute less if it becomes more important in its consumption of a limited resource.
What I am trying to point out is that all resource usage matters when async compute is used. Prior to compute overlap (async compute and otherwise), if some pass didn't use 100% memory bandwidth, the remainder was lost. Same was true for ALU, sampler, groupshared memory, registers and various fixed function rasterization hardware. DCC for example didn't improve performance at all if the bandwidth wasn't the bottleneck in that particular shader pass. Nowadays you'd likely overlap that pixel shader with some compute shader, and it could use the reminding bandwidth. Result is that DCC now shows improvement. NVIDIAs tiling also reduces render target bandwidth usage, so it has similar implications. If tiling adds some latency, background compute is perfect for hiding it -> potential worst case stalls are much less of a problem.

Quad efficiency (pixel shaders running the whole 2x2 quad, even if triangle only covers one pixel) is also becoming a bigger issue. Triangles have gotten very small. It's common to see 50%+ more pixel shader invocations because of poor quad efficiency. Quad overdraw is pure extra ALU, register and sampler usage. Masked out lanes (helper lanes) do not access memory at all. Traditionally this didn't matter at all, if you were mostly bottlenecked by geometry processing, ROP fill rate or bandwidth. Nowadays you'd want to run async compute concurrently with your g-buffer shaders. Thus techniques that defer expensive operations (such as sampling all material textures) to a later compute shader pass are going to be showing bigger gains. Compute shader doesn't suffer from quad efficiency problems.

Pixels in general have gotten much more complex to render. Quad inefficiency when calculating complex lighting and filtering multiple shadow maps is not preferable. With async compute, deferring more is a better option than going back to forward rendering (heavy quad overdraw for every operation). Traditionally this kind of deferred rendering utilized GPUs poorly (simple pixel shader didn't utilize most resources at all), but async compute solves that problem. It of course adds some latency, but the amount of wasted work goes down a lot.

If you are scared about the total pipeline latency, you can always split the screen to big macro tiles, and render them + light them all concurrently. This approaches forward rendering latency as the tile size gets smaller. You need to synchronize once before doing post processing (bloom, motion blur, DOF etc require neighbor pixels). This way you don't even need to overlap two frames as you find plenty of concurrent tasks inside a single frame. Overlapping two frames however would still likely bring some benefits.
 
Pixels in general have gotten much more complex to render. Quad inefficiency when calculating complex lighting and filtering multiple shadow maps is not preferable. With async compute, deferring more is a better option than going back to forward rendering (heavy quad overdraw for every operation). Traditionally this kind of deferred rendering utilized GPUs poorly (simple pixel shader didn't utilize most resources at all), but async compute solves that problem. It of course adds some latency, but the amount of wasted work goes down a lot.
I am not disagreeing with the premise that AC helps in many situations. What I think is more nuanced is the way we would calculate the share of wall-clock time between the rasterization component and compute, and the difference between wall-clock time and utilization.
This does go back to my reference to Amdahl and the claim that rasterization's share of frame time is going down. Deferring in the front rasterization bucket adds latency, such that utilization is improved in the compute bucket. This means the serial portion (non-scalable with parallel resources) of the rasterization phase goes up, while the amount of redundant work in the compute portion goes down, potentially reducing its run-time (more if CU count rises).
Depending on the extent of the load on the raster phase, and how much winds up being culled/coalesced before getting the compute side, its share of execution time actually goes up even if the overall run-time is better than the alternative.

What splitting this across frames might do, as the parallel portion is whittled down, is leave the theoretical floor of the wall-clock time at ~2x the serial component. That may not be a common concern except for very small frame budgets or something particularly twitchy, so I agree it can be generally a satisfactory trade-off.

One additional complication, particularly with GCN, was statements made about of overly relying on the AC synthetic in the DX12 thread for measuring a graphics vendor's ability to get concurrency out of a command stream. Drivers and GPU hardware do run ahead and pipeline work quite some distance ahead--perhaps imperfectly--in a manner that explicitly pipelining graphics and asynchronous compute into two frames (asynchronous across a broad synchronization point?) is doing.
Perhaps certain GPUs that have coherent color pipelines, more aggressive drivers, and more flexible front ends are more effective in doing this, whereas an architecture with incoherent raster output, issues with inferior geometry throughput and culling, and issues with synchronization and driver bugs for deferred engines seems to benefit more (citation: AMD says Vega's coherent ROPs avoid the latter issue).
A driver or intelligent GPU might in the future optimize the pipelined case, by detecting a frame split like this and quietly reordering the second frame's work back into the first or eliding some stream-out and read-in as it sees fit.

If you are scared about the total pipeline latency, you can always split the screen to big macro tiles, and render them + light them all concurrently. This approaches forward rendering latency as the tile size gets smaller. You need to synchronize once before doing post processing (bloom, motion blur, DOF etc require neighbor pixels). This way you don't even need to overlap two frames as you find plenty of concurrent tasks inside a single frame. Overlapping two frames however would still likely bring some benefits.
This or some variant might be what AMD is hoping for with the "scalability" descriptor for Navi and the rumors of a multi-chip solution.
Would tiling and synchronization points fall under one of the two buckets in the wall-clock time calculation, or in another category? If this is somehow handled by a work distributor or a future version of Vega's binning logic, would it fall under the rasterization category?

To perhaps bring my digression more on-topic, a more fully deferred solution and traversal of acceleration structures could take a decent amount of time, but elide even more compute.
If something like the read and dirty bits for page table entries were extended to a graphics context, a an acceleration structure might be some kind of hierarchical buffer of samples whose inputs (rays, surfaces, pages) haven't changed since the last frame, and their values. In that case, the compute portion drops precipitously.
 
Last edited:
I am not disagreeing with the premise that AC helps in many situations. What I think is more nuanced is the way we would calculate the share of wall-clock time between the rasterization component and compute, and the difference between wall-clock time and utilization.
This does go back to my reference to Amdahl and the claim that rasterization's share of frame time is going down. Deferring in the front rasterization bucket adds latency, such that utilization is improved in the compute bucket. This means the serial portion (non-scalable with parallel resources) of the rasterization phase goes up, while the amount of redundant work in the compute portion goes down, potentially reducing its run-time (more if CU count rises).
Agreed. More complex rasterizer that buffers (= delays) triangles to extract better locality will naturally take more wall clock time (before triangles reach compute units and results reach memory). This will make the startup delay worse after flushing the GPU. Moving ROP caches under L2 will help (Vega), as you no longer need to flush ROP caches to make ROP output visible to texture units. Also no need to flush L2 cache to make UAV compute writes visible to ROP caches (when writing to render targets with compute shaders). Reduced amount of cache flushes will also help async compute, as background running tasks might also accidentally stall from cache flushes.
One additional complication, particularly with GCN, was statements made about of overly relying on the AC synthetic in the DX12 thread for measuring a graphics vendor's ability to get concurrency out of a command stream. Drivers and GPU hardware do run ahead and pipeline work quite some distance ahead--perhaps imperfectly--in a manner that explicitly pipelining graphics and asynchronous compute into two frames (asynchronous across a broad synchronization point?) is doing.
GCN also perfectly overlaps compute and graphics tasks from the same queue. It even overlaps rasterization tasks (pixel shaders) writing two different render targets. DX12 split barriers offer fine grained synchronization that doesn't stall the GPU. Vulkan's vkCmdSetEvent/vkCmdWaitEvents do the same. Async compute is better for random background tasks that are not timing critical (or if you want to overlap multiple frames - this this case two queues is easier to get right).
Perhaps certain GPUs that have coherent color pipelines, more aggressive drivers, and more flexible front ends are more effective in doing this, whereas an architecture with incoherent raster output, issues with inferior geometry throughput and culling, and issues with synchronization and driver bugs for deferred engines seems to benefit more (citation: AMD says Vega's coherent ROPs avoid the latter issue).
I am mainly writing GCN console code using low level APIs, so I don't have any intimate knowledge about PC AMD drivers. As long as you implement similar constructs as Vulkan events or DX12 split barriers, you will get good GPU utilization of GCN without much stalls... Assuming of course your workload has some task parallelism. If the next dispatch/draw always has dependency on the previous, then your design obviously will not run well (on any modern GPU)...

ROP caches under L2 in Vega will be a big improvement. And Polaris (GCN4) already improved the geometry pipeline a lot. So I don't see any big problems. Especially in DX12 and Vulkan games where the drivers aren't going to be a problem. NVIDIA has much more resources to write custom DX11 driver code paths for all the most important AAA engines, so I expect NVIDIA to keep their lead in DX11. Reduced amount of flushes will of course make the worst case scenario run faster, so Vega will probably need a little bit less hand holding to perform well.
 
Back
Top