ROPs and GCN cores

TomRL

Newcomer
I've been seeing a lot of talk about pixel throughput being dictated by gcn cores and not ROP's in the future. Where does this leave ROP's? The PS4 apparently has already gone overkill with its 32 ROP's, what are developers going to do with that extra room they have with ROP's?

Disclaimer: I know nothing about what I'm talking about, the line of logic I have used above is based on total ignorance. So it probably doesn't even work.

Edit: If I'm a lost cause here, and I'm talking completely out of my back end, a book recommendation on how GPU's work will be more than helpful.
 
More ROPs benefit ROP bound shaders. Usually pixel shaders that are very simple are ROP bound. Good examples are shadow map rendering, simple particle rendering (to a 32 bpp render target), simple foliage rendering, UI element blitting, etc.

Any alpha blending to (HDR) 64 bpp or higher bit depth render target is always bandwidth bound on all modern GPUs (including Titan and R9 290X). HDR blending is thus never ROP bound. Output to a 64 bpp render target (or MRT) with z-buffering tends to be also BW bound before it is ROP bound, assuming the shader reads some textures. Obviously these kind of shaders can be TMU (texture filtering / cache) or ALU (math) bound instead of BW / ROP bound. Or bound by some other parts of the GPU pipeline.

Modern GPUs are never ROP bound on 2d image processing operations (such as lighting or post processing), because compute shaders do not use ROPs. Obviously many launch window games still use modified old engines or need backwards compatibility and thus still use pixel shaders for post processing and lighting. Simple blur kernels for example can be ROP bound when executed as a pixel shader (compute shader is much faster for blur kernels for various reasons, but old code still exists).
 
Hi sebbbi sir

could you envision any sort of use of ROP to make at the very least some sort of use of the extra ROP in the PS4, perhaps in the case of Sony 1st party titles?

Perhaps the primary vision for 32ROP is while standard ps4 games might not make use of it, it helps with fillrate for future 4k content to support their 4k TV initiative [strike] and ensure the fill rate for upcoming Project Morpheus 1080p @120hz content (60fps to each eye) [/strike] ?
I know for 4k games [strike]& 120hz Morpheus games in both these scenarios [/strike] IQ would be reduced.
 
Last edited by a moderator:
If you double your screen resolution you also double your pixel shader instances. This means that you need double ALU, double bandwidth and double TMUs as well. Double ROPs alone isn't enough, unless you simplify your shaders radically (do half the stuff in the pixel shader).

Sometimes ROPs are of course the bottleneck, and having more ROPs improves the performance a lot. If for example your game targets a super high resolution, you likely code much simpler shaders and this would be the case where ROPs are valuable. But it's important to notice that ROPs alone doesn't let any game to scale up to a higher resolution, you need simpler shaders or an otherwise beefier GPU as well.
 
Yeah I always found that to be a curious assertion. More ROPs only help if ROPs are already the bottleneck. If you're shader bound at 1080p you'll be shader bound at 4k too.
 
Yeah I always found that to be a curious assertion. More ROPs only help if ROPs are already the bottleneck. If you're shader bound at 1080p you'll be shader bound at 4k too.

If they are porting some last gen content, many indie games, or any mobile game. PS4 was developed back round the time Sony had big dreams for their 4k initiative. That would make sense given that very few developers are going to make use of those extra ROP @ 1080p. Trine2 developer Frozenbyte discussed during their Eurogamer interview of the possibility of being able to do 30fps @ 4k on the ps4.
 
Last edited by a moderator:
If they are porting some last gen content, many indie games, mobile games. PS4 was developed back round the time Sony had big dreams for their 4k initiative. That would make sense givent that very few developers are going to make use of those extra ROP. Trine2 developers Frozenbyte said discussed during their Eurogamer interview being able to do 30fps @ 4k on the ps4.
If you are porting last gen content to 4k resolution or to 1080p @ 60 fps (from 720p @ 30 fps) you will definitily benefit from all the extra ROPs you can get.
 
So pixel throughput is how many pixels a GPU throw out there. If the GCN cores ended up dictating pixel throughput, would it make it a significantly easier job for developers to hit 1080p or 60fps target over ROPs? Are there any developers that do this yet? And could you possible use GCN for throughput and ROPs for aliasing on the same frame? (ROPs do AA don't they?)
 
And could you possible use GCN for throughput and ROPs for aliasing on the same frame? (ROPs do AA don't they?)
Yes, ROPs handle MSAA (subpixel storage, per sample tests and replication). Antialiasing resolve (blend subsamples together) operation is nowadays most often done by a custom shader.

If your rendering shader is ROP bound you can for example use the remaining CU (compute unit) resources by running a compute shader simultaneously (asynchronous compute). Compute shaders do not use ROPs at all. Many modern engines perform their lighting and post processing in compute shaders.
 
Modern GPUs are never ROP bound on 2d image processing operations (such as lighting or post processing), because compute shaders do not use ROPs.
This strikes me as a curious discrepancy. How do compute shaders get around not having to use ROPs, and why? Are ROPs in current GPUs merely a vestigal organ of past GPU design practices; are they really needed if compute shaders can get by without using them?
 
This strikes me as a curious discrepancy. How do compute shaders get around not having to use ROPs, and why? Are ROPs in current GPUs merely a vestigal organ of past GPU design practices; are they really needed if compute shaders can get by without using them?

You need ROPs to update a pixel in a render target. That includes operations like blending and multi-sampling, and maybe pixel format conversion (I don't know if that's coupled with ROPs or not, but it does happen somewhere). Historically it also included depth test and update, these days I don't know how much that's coupled either, or if it only includes late-Z for example.

If you just want to do simple writes to memory you don't need to use the ROPs, there are simpler store pipelines available now for that. Same way how you can use simple load pipelines instead of TMUs which do a bunch of other texture-related things.
 
Yeah I always found that to be a curious assertion. More ROPs only help if ROPs are already the bottleneck. If you're shader bound at 1080p you'll be shader bound at 4k too.

Its a trinity... GPU are overbalanced thoses days, so maybe we had forget the base of it.
 
Last edited:
This strikes me as a curious discrepancy. How do compute shaders get around not having to use ROPs, and why? Are ROPs in current GPUs merely a vestigal organ of past GPU design practices; are they really needed if compute shaders can get by without using them?

ROPs are units which process the pre-defined outputs calculations on a pre-defined raster on a pre-defined buffer offering special features as multi-sampling. Which means the ROPs also know about stuff that happens in the rasterizer, and they know stuff about the pixel neighbourhood in the same quad, which means they can write/blend in 4 pixel bursts. There are controls to specify what ROPs do, like blending modes, alpha test etc. That's a lot of global state, and it's extreme slow to change that. But they are extreme good and fast at what they are doing. They are the equivalent of a TMU, but for writing. There are anly a few of those units available, enough to at most write to 4 different surfaces simultaniously per input-set (which is a pixel instance). That's unique.

Compute doesn't have any pre-defined outputs, you bind outputs the same way you bind inputs. You can't attach ROPs to compute shaders, because is no rasterizer suppose to be involved, and there is absolute nil persitent state wanted to be connected to a compute shader. All state is local and essentially buffers of various kinds (constant buffer, read-only texture buffer, read-write texture buffer, just memory buffer etc.). They are minimal to have much more flexibility. Their writing capability is severly limited though, basically only variations of uint-moves. But they can be yielded very well because there is almost no chip-wide state to switch. There are more possible units for writing than for ROPS, enough for 8 different surfaces. But you can only write to one per input-set (which is a thread) at a time. You can only write at most 2x uints/floats.

One pixel shader instance equals one compute thread in granularity. You can run 256 quads (256x4 = 1024), and 1024 compute threads, and you could issue 4x the number of writes with ROPs than with compute, and 2x the size, 4 floats per ROP write vs. 2 floats per thread write. The maximum is thus 4x4x4 bytes = 64 bytes per ROP and 1x2x4 = 8 bytes per thread.
Additionally, because the compute outputs are tightly coupled to the memory controller, you need to worry about banking conflicts. The ROPs have caches and the issue doesn't exist. On the other hand you have absolute control what you want to do with your output in compute, including atomic operations. Something like that doesn't exist for ROPs.

With DX12 we get typed UAV writes, and the writing will be a bit better, but we're still far from having ROP-like writing functionality for compute.

I hope I got it all broadly correct, some numbers are different for different architectures. :)
 
I hope I got it all broadly correct, some numbers are different for different architectures. :)
Good post. Minor additions below...
ROPs are units which process the pre-defined outputs calculations on a pre-defined raster on a pre-defined buffer offering special features as multi-sampling. Which means the ROPs also know about stuff that happens in the rasterizer, and they know stuff about the pixel neighbourhood in the same quad, which means they can write/blend in 4 pixel bursts. There are controls to specify what ROPs do, like blending modes, alpha test etc. That's a lot of global state, and it's extreme slow to change that. But they are extreme good and fast at what they are doing.
ROP hardware performs MSAA replication and (late) depth + stencil testing. Pixel shader outputs one value per pixel. ROP does depth, stencil and triangle coverage test at sample frequency. Modern hardwares also have depth buffer and color buffer compression. Hardware closely tied to ROP also handles these tasks. Modern GPUs also have early depth and hierarchical depth+stencil hardware. This hardware operates before the pixel shader is invoked (= not directly a ROP task, but only available when ROPs are used = in pixel shaders).

When blending is enabled ROP guarantees that pixels written to the back buffer at the same (x,y) location are blended in correct order. Order is determined by the triangle order in the vertex buffer. Other memory writes (or reads) do not have ordering guarantees (yet). DirectX 11.3 ROV (rasterizer ordered views) adds ordering capability to UAV writes from the pixel shader.
Their writing capability is severly limited though, basically only variations of uint-moves. But they can be yielded very well because there is almost no chip-wide state to switch. There are more possible units for writing than for ROPS, enough for 8 different surfaces. But you can only write to one per input-set (which is a thread) at a time. You can only write at most 2x uints/floats.
DirectX 11.0 actually supports typed UAV writes (https://msdn.microsoft.com/en-us/library/windows/desktop/hh447238(v=vs.85).aspx). Typed UAV loads are not supported until DirectX 11.3. This has always been a silly limitation.
The ROPs have caches and the issue doesn't exist. On the other hand you have absolute control what you want to do with your output in compute, including atomic operations. Something like that doesn't exist for ROPs.
UAV writes and reads are also cached (by the L1 and L2 of the GPU). Bank conflicts can occur if the developer is careless about the memory access patterns. On GCN, the ROP caches (color and depth block caches) are fully separate from the main cache hierarchy. On GCN global atomics write directly to L2 (passing through L1). ROP "kind of" supports atomics. Alpha blending provides multiple different atomic read-modify-write operations. You can't get the atomic result back however (ROPs are "fire and forget").
 
And after Mantle i was think, strangely that UAV loads will be compltely eliminated...
 
Last edited:
DirectX 11.0 actually supports typed UAV writes (https://msdn.microsoft.com/en-us/library/windows/desktop/hh447238(v=vs.85).aspx). Typed UAV loads are not supported until DirectX 11.3. This has always been a silly limitation.

Oh yes, sorry, I flipped it around. Here's the extent of it visible:
https://msdn.microsoft.com/en-us/library/windows/desktop/ff471325(v=vs.85).aspx#UAV_Typed_Load

It can lead to really weird constructions in the shader where the i/o code can end up asymmetric.
Personally I use AMP for most compute stuff and I rarily ever use RW-Textures, I manage to work on W-Textures almost exclusively (ofc it's the same buffer-type I simply don't read).
I haven't yet used a RWStructuredBuffer yet, I wonder what the implications for those loads is. Would it produce 1x load instruction for every uint in the structure?
 
RWTextures have some odd limitations. For example there is no way to read or write the mips (no .mips operator). And you can't sample the texture either. Of course you can create a SRV to the same texture to sample it and/or access the mips, but you can't simultaneously bind two views (UAV + SRV) to the same resource, limiting the possible read + modify usage patterns quite a bit.
 
Back
Top