Is pixel fillrate about to become more imporant again?

Markus

Newcomer
It looks to me like events are conspiring to reverse the long trend towards more shader math ops per texture read. Anyone like to chime in and give a clearer understanding?

What I see is:

Increases in shader performance have outpaced memory bandwidth for a very long time. If it's easy to improve shaders and hard to improve fillrate, then GPUs will be made to provide more shader math capability and games will follow, making ever more complex shaders that incestuously utilize cache bandwidth without referencing the outside world (RAM) as much. As late as the ATi R400 (e.g. radeon X850) series there was an equal number of pixel shaders, TMUs and ROPs; where as R9 480 has 2304 unified shaders to 144 TMUs and 32 ROPs. Some of those shaders will be cranking on vertices, but still, that's a card capable of 72 math operations and 4.5 texture reads per output pixel.

3D stacks of memory (e.g. HBM) with extremely wide interfaces on a silicon interposer appear to make it possible to keep making memory interfaces wider and wider without eating so much power. HBM1 to HBM2 is a factor ~2 leap in memory performance, and this seems like it could keep going for some time. How fast does the memory controller grow in die size when memory interface increases in width?

Each new process node seems to be spaced farther apart and transistor performance, cost and power seem to be making smaller increments. Geometric scaling started failing in 2000 and Dennard scaling has been almost dead since 2006. ITRS has basically given up and does not offer a road map beyond 10 nm any more; though TSMC might try to keep going on their own to 5 nm or whatever; looks like the end is in sight for silicon.

High resolution displays and VR also seem to make fillrate relatively more important than it used to be. Many VR games now use forward renderers; as far as I understand, mainly to reduce fillrate requirements over defered shading. Foveated rendering is an interesting wildcard; that would keep the resolution required to match the human visual system quite modest as you only need 60 pixels per degree within a few degrees of the foeva and detail can be drastically stepped down towards further out. Framerate has also finally started to become recognized as important.

I have a Vive and I don't at all regret it. Amazing experiences can allow you to overlook all the various visual artefacts, but they are still prescent and things would be better without them. Among the visual artefacts and problems I would say first of all resolution; you're blowing up the pixels with a huge 110 degree FOV, which makes it feel similar to sitting close to a 19 inch CRT at somewhere around 800x600 to 1024x768. Second is the optics; can't get away from god rays with a fresnel. Thirdly I'd say refresh rate. 90 Hz is OK, but not nearly close to perfect. Strobing is what saves it from being outright bad (90 FPS on a CRT or on the vive or does look better than 144 FPS on a constantly lit, non-strobing LCD; the movement of your eyes during the constantly lit phase smears the object, AKA persistence blur). Normal mapping doesn't work well; only on small features; in VR you can't fool me; I can see that door is perfectly flat and that beveling or whatever is not really there; more polygons is very important, at least up close. A distant 5th is better shader quality; that's not very important for feeling present in this alien world but surely it will look nicer. At current resolutions, supersampling and AA more generally makes a huge difference in quality. Everything about the requirements for VR just screams fillrate to me.

What about the balance of TMUs to ROPs? Is there a reason for this to change? AFAIK; if a surface uses a bunch of different maps (i.e. specular map, specular sharpness map, normal map, colour map etc.) you'd want quite a lot of texture reads for each output pixel. I don't see a reason for less TMUs per ROP, but perhaps more shader work can be rolled into textures instead of computing it all on the fly?

Do you think we can expect a significant change in the ratios between shaders, ROPs and TMUs in the comming cards from Nvidia and AMD sporting HBM2?
 
I can't see why it would change, I'd expect the trend to continue.
Actually though even early r5xx chips had the same 1:1 alu/tmu ratio as r4xx (later r5xx chips tripled that ratio). However, it's not quite true in comparison with newer chips that this would be a 1:1 ratio, since these were vec4 shader alus, not scalar ones (and, as you mentioned, the chips had some separate vertex shader alus as well), so the increase in alu:tmu ratio isn't all _that_ big.
 
That's interesting. I did not know shaders were scalar today, but it makes some sense to me when they're universal shaders that are tasked with doing just about everything, including non-graphics compute applications.

I remember there being a fuss over the x1800 and x1900; the x1800 has as you say 1:1 ALU/TMU ratio.

If what you can get at the end of Moore's law is more memory bandwidth; can it be put to good use without the corresponding increase in ALU performance?
 
By scalar he means the vector is transposed and fed into what is essentially a scalar processor in a predictable manor to pipeline everything. There are however actual scalar processors on some GPUs typically tasked with flow control and scalar operations. Some papers out there suggest we may be seeing more scalar capabilities in the future where a shader transitions from vector to scalar for uniform operations.

The whole texturing issue gets interesting largely because of tessellation. Geometry can be pushed down to a sub-pixel level at which point a normal map is somewhat irrelevant.

As for 90Hz, some of the reprojection techniques (see asynchronous timewarp for VR) could make that interesting. Slightly adjusting the frame immediately before presenting it to account for certain camera movement. Foveated rending works for VR, but likely not standard screens as a user could stare at the corner. With some motion blur effects it would likely reduce your fillrate concerns.

One big thing you're missing here is compute shaders. Physics simulation and game logic being accelerated alongside graphics where fillrate is largely meaningless.
 
The whole texturing issue gets interesting largely because of tessellation. Geometry can be pushed down to a sub-pixel level at which point a normal map is somewhat irrelevant.

You not only can do that, you must do that in VR; at least up close. If your normal map represents something like surface roughness, that seems to still work fine; but as soon as you try to take larger details like small rocks, beveled surfaces and such the slight of hand is immediately visible.

As for 90Hz, some of the reprojection techniques (see asynchronous timewarp for VR) could make that interesting. Slightly adjusting the frame immediately before presenting it to account for certain camera movement.

I'm aware of this, but taking 45 FPS and pushing it up to 90 FPS with ATW just keeps you from getting sick; it still looks bad in most scenarios that matter. ATW can only account for head motion; in a stationary environment this works very well. Most applications I care about doesn't have a stationary environment. If you're following a moving character or object with your vision with ATW at 45 FPS, the screen will flash (it's a low persistance screen) and the after image will follow along on your retina in a correct way; the next frame the screen will flash, corrected for your head motion and the environment and everything will be in the correct place, but the moving character or object will still be in the old position and the after-image on your retina is in the wrong place. The perception is that the object is moving backwards every other frame. In something like a rally game where the entire environment is a moving object, this means the entire environment appears to lurch one step backward and two steps forward on alternating frames.

Foveated rending works for VR, but likely not standard screens as a user could stare at the corner. With some motion blur effects it would likely reduce your fillrate concerns.

Can correct motion blur be faked given very accurate eye-tracking? "Standard" motion blur in games that relies on mouse look rotational speed and/or ingame speed camera translational speed for determining what to blur tends to blur all the wrong things and looks worse than no motion blur. If you are rotating evenly, following a moving object, what the "standard" implementation will do is blur everything, without any knowledge that you're following and focusing on a moving object or character that needs to not be blured. This seems like a hard problem, it seems like you'd need to render into a buffer the speed of each pixels of the final image and use this to blur the image, correctly accounting for the movement of the screen and fovea. The "easiest" solution to correct motion blur may be the brute-force stupid one of simply blasting out ~1000 FPS; ATW will only help you here with stational objects.

One big thing you're missing here is compute shaders. Physics simulation and game logic being accelerated alongside graphics where fillrate is largely meaningless.

It seems to me that rendering complexity is rapidly outpacing this problem. Doom of 2016 has less enemies at a time than Doom of 1993 and especially Doom 2 of 1994. The enemies are massively more individually complex, and there's a bunch of physically simulated objects that mostly amount to eye-candy, but it still seems to be in the same familiar scale, where as the increase in graphical complexity is bordering on the ridiculous.

Of course, I could be wrong, and it could just be an artifact GPUs scaling faster than CPUs. When GPUs become more used for enemy AI, pathing, physics etc. parts of the game simulation, perhaps there will be a long overdue increase in the complexity of the simulation that just couldn't happen before?
 
Required ROP rate is directly tied to shader complexity. The ratio of ROP:ALU and ROP:BW stays relatively constant when resolution is changed (or double eye rendering is used). Pixel shader cost scales 1:1 with resolution (however quad efficiency and cache locality improve slighty when resolution grows). If we also assume a LOD system that tries to keep triangle size (in pixels) constant, the vertex shader cost also roughly scales according to resolution. Thus ROP cost scales similarly than ALU and BW cost when resolution is increased.

Forward shading (usually used in VR games) is actually less ROP bound than deferred shading. Forward shading pixel shaders are much more complex (more ALU) than deferred counterparts. This is because forward rendering calculates the whole material formula and lighting and samples shadow maps in the pixel shader. RGBA16F output is also full ROP rate (on all Radeons at least). Thus modern PBR forward shading is practically never ROP bound (no matter what resolution).

Deferred shading on the other hand uses much more simple shaders. Usually the shader just reads input textures and stores them in the g-buffer. Possibly normal is also transformed, but nothing else needs to be done. Output is done to multiple render targets, each requiring an additional ROP processing cycle (1/2, 1/3 or 1/4 ROP rate). Thus deferred shading (g-buffer pass) is often ROP bound. Deferred shading lighting pass is often done using compute shaders, meaning that it doesn't use ROPs at all. Compute shaders are also nowaways commonly used in post processing (no ROPs).
 
It looks to me like events are conspiring to reverse the long trend towards more shader math ops per texture read. Anyone like to chime in and give a clearer understanding?

What I see is:

Increases in shader performance have outpaced memory bandwidth for a very long time. If it's easy to improve shaders and hard to improve fillrate, then GPUs will be made to provide more shader math capability and games will follow, making ever more complex shaders that incestuously utilize cache bandwidth without referencing the outside world (RAM) as much. As late as the ATi R400 (e.g. radeon X850) series there was an equal number of pixel shaders, TMUs and ROPs; where as R9 480 has 2304 unified shaders to 144 TMUs and 32 ROPs. Some of those shaders will be cranking on vertices, but still, that's a card capable of 72 math operations and 4.5 texture reads per output pixel.

3D stacks of memory (e.g. HBM) with extremely wide interfaces on a silicon interposer appear to make it possible to keep making memory interfaces wider and wider without eating so much power. HBM1 to HBM2 is a factor ~2 leap in memory performance, and this seems like it could keep going for some time. How fast does the memory controller grow in die size when memory interface increases in width?

Each new process node seems to be spaced farther apart and transistor performance, cost and power seem to be making smaller increments. Geometric scaling started failing in 2000 and Dennard scaling has been almost dead since 2006. ITRS has basically given up and does not offer a road map beyond 10 nm any more; though TSMC might try to keep going on their own to 5 nm or whatever; looks like the end is in sight for silicon.

High resolution displays and VR also seem to make fillrate relatively more important than it used to be. Many VR games now use forward renderers; as far as I understand, mainly to reduce fillrate requirements over defered shading. Foveated rendering is an interesting wildcard; that would keep the resolution required to match the human visual system quite modest as you only need 60 pixels per degree within a few degrees of the foeva and detail can be drastically stepped down towards further out. Framerate has also finally started to become recognized as important.

I have a Vive and I don't at all regret it. Amazing experiences can allow you to overlook all the various visual artefacts, but they are still prescent and things would be better without them. Among the visual artefacts and problems I would say first of all resolution; you're blowing up the pixels with a huge 110 degree FOV, which makes it feel similar to sitting close to a 19 inch CRT at somewhere around 800x600 to 1024x768. Second is the optics; can't get away from god rays with a fresnel. Thirdly I'd say refresh rate. 90 Hz is OK, but not nearly close to perfect. Strobing is what saves it from being outright bad (90 FPS on a CRT or on the vive or does look better than 144 FPS on a constantly lit, non-strobing LCD; the movement of your eyes during the constantly lit phase smears the object, AKA persistence blur). Normal mapping doesn't work well; only on small features; in VR you can't fool me; I can see that door is perfectly flat and that beveling or whatever is not really there; more polygons is very important, at least up close. A distant 5th is better shader quality; that's not very important for feeling present in this alien world but surely it will look nicer. At current resolutions, supersampling and AA more generally makes a huge difference in quality. Everything about the requirements for VR just screams fillrate to me.

What about the balance of TMUs to ROPs? Is there a reason for this to change? AFAIK; if a surface uses a bunch of different maps (i.e. specular map, specular sharpness map, normal map, colour map etc.) you'd want quite a lot of texture reads for each output pixel. I don't see a reason for less TMUs per ROP, but perhaps more shader work can be rolled into textures instead of computing it all on the fly?

Do you think we can expect a significant change in the ratios between shaders, ROPs and TMUs in the comming cards from Nvidia and AMD sporting HBM2?

It already has become relevant. nVidia is attempting memory compression, and high clock rates, but it's not enough. Fury X with HDR was good, but constrained to 4 GB. When both the RAM constraints and the artificial clock rates due to throttling are unleashed it will be a different ball game.

Make no mistake this is a game, gentlemen and ladies, and it is being played so that dead stock can be blown out.
 
Err no lol.......

Selling no inventory vs writing it all off and making new stock that can sell is better then leaving dead weight around. The Fury line, the 3x0 line was all dead weight, if AMD had a choice they would just replace them entirely, which right now they don't. (so to speak), the rx480 makes all of the 3x0 irrelevant. And nV's cards take care of the rest.
 
It seems to me that rendering complexity is rapidly outpacing this problem. Doom of 2016 has less enemies at a time than Doom of 1993 and especially Doom 2 of 1994. The enemies are massively more individually complex, and there's a bunch of physically simulated objects that mostly amount to eye-candy, but it still seems to be in the same familiar scale, where as the increase in graphical complexity is bordering on the ridiculous.

Of course, I could be wrong, and it could just be an artifact GPUs scaling faster than CPUs. When GPUs become more used for enemy AI, pathing, physics etc. parts of the game simulation, perhaps there will be a long overdue increase in the complexity of the simulation that just couldn't happen before?
It's my belief the APIs are still largely what's holding things back. While Vulkan would likely work now, DX12 is still too limited in audience if high level APIs need discarded. It's easier to change detail levels than number of objects rendered when dealing with characters.

I'm aware of this, but taking 45 FPS and pushing it up to 90 FPS with ATW just keeps you from getting sick; it still looks bad in most scenarios that matter. ATW can only account for head motion; in a stationary environment this works very well.
It's not even VR where I'd consider this. Doom's TSSAA I think is a prime example. In effect they are downsampling frames with their reprojection. Rendering lots of frames quickly with no AA or postprocess then include a slight reprojection, supersampling, and postprocess coupled to the actual vblank using compute. In the case of ROPs you'd remove bandwidth costs for all the subsamples, AA, and postprocessing for an entire frame if you were rendering 120 frames and presenting 60.

Tiago Sousa: I've always been a fan of amortising/decoupling frame costs. TSSAA is essentially doing that - it reconstructs an approximately 8x super-sampled image from data acquired over several frames, via a mix of image reprojection and couple heuristics for the accumulation buffer.
It has a relatively minimal runtime cost, plus the added benefit of temporal anti-aliasing to try to mitigate aliasing across frames (eg shading or geometry aliasing while moving camera slowly). It's mostly the same implementation between consoles and PC, differences being some GCN-specific optimisations for consoles and couple of minor simplifications.

http://www.eurogamer.net/articles/digitalfoundry-2016-doom-tech-interview

Yes this is different from the traditional ATW to increase perceived framerate, but this method should help with perceived input latency while potentially increasing FPS with the reduced AA and postprocessing costs. Most accounts say Doom plays really well with the effects. The only difficulty, I think, is that it's forcing Pascal to preempt while prior generations still need that updated driver.
 
It's my belief the APIs are still largely what's holding things back. While Vulkan would likely work now, DX12 is still too limited in audience if high level APIs need discarded. It's easier to change detail levels than number of objects rendered when dealing with characters.

No its not, Doom 2016 character modelling and development pipeline is quite complex, its not a single model you are looking, its quite a few different models needed to create their complex different death scenes. DX12 and Vulkan aren't holding the amount of characters back, its at the amount of graphics resources that is holding them back, cause ya still have those consoles that are the lowest common denominator. A 7970 performance GPU and memory amounts aren't sufficient enough to have crazy amount of meshes, textures, etc, specially when they are pushing 4k or multiple 4k textures per character.

It's not even VR where I'd consider this. Doom's TSSAA I think is a prime example. In effect they are downsampling frames with their reprojection. Rendering lots of frames quickly with no AA or postprocess then include a slight reprojection, supersampling, and postprocess coupled to the actual vblank using compute. In the case of ROPs you'd remove bandwidth costs for all the subsamples, AA, and postprocessing for an entire frame if you were rendering 120 frames and presenting 60.

Yes this is different from the traditional ATW to increase perceived framerate, but this method should help with perceived input latency while potentially increasing FPS with the reduced AA and postprocessing costs. Most accounts say Doom plays really well with the effects. The only difficulty, I think, is that it's forcing Pascal to preempt while prior generations still need that updated driver.

We have no idea what the real problem is right now, even with async off, Pascal's performance doesn't change, which is just weird.
 
Last edited:
Not its not, Doom 2016 character modelling and development pipeline is quite complex, its not a single model you are looking, its quite a few different models needed to create their complex different death scenes. DX12 and Vulkan aren't holding the amount of characters back, its at the amount of graphics resources that is holding them back, cause ya still have those consoles that are the lowest common denominator. A 7970 performance GPU and memory amounts aren't sufficient enough to have crazy amount of meshes, textures, etc, specially when they are pushing 4k or multiple 4k textures per character.
Updating and syncing all those meshes would be the problem in my mind. Doom tends to fill rooms with similar monsters so texturing shouldn't change much by adding more. That would leave geometry throughput or animation as the culprit and I'm sure a nightmare setting which spawned more monsters on PC wouldn't be unreasonable. Allowing more draw calls to get away from instancing should help there.

We have no idea what the real problem is right now, even with async off, Pascal's performance doesn't change, which is just weird.
Performance doesn't change, yet the timing of frames seems to vary significantly around the vblank interval. They are using an effect that would likely only apply around that interval as well. I would think pipelining frames that alternate between heavy graphics and heavy compute might cause some issues like that. Seems a little odd to use Doom as a launch demo for Pascal and not notice or fix a presentation issue by now. They even demoed with/without vsync. Async on/off wouldn't necessarily make a difference in this case. it would help the timing with it on and likely boost performance more than expected, but most of the postprocessing may be occurring on frames close to the vblank. It would make sense that swings in graphics:compute ratios for Maxwell and even Pascal could be problematic. Most of the time that wouldn't occur often in a rendering path until you get an async style effect. I'll admit I'm just guessing here, but if debugging that'd be one of the first things I'd look at and may be problematic to solve.
 
DX12 and Vulkan aren't holding the amount of characters back, its at the amount of graphics resources that is holding them back, cause ya still have those consoles that are the lowest common denominator. A 7970 performance GPU and memory amounts aren't sufficient enough to have crazy amount of meshes, textures, etc, specially when they are pushing 4k or multiple 4k textures per character.
Id-software is using virtual texturing. VT has (almost) constant texture memory cost, no matter how high texture resolution or how many different textures you have. With VT you can basically use all your assets in every single level, and VT only loads those texels that are currently visible. As there are fixed amount of pixels in the screen (~2M for 1080p, ~8M for 4K) there is a hard limit of texels you need to have loaded.

For example our Xbox 360 games (~720p) had a fixed size 48 MB VT texture atlas in memory, and that was used to store all loaded texture data (except for some special cases). Players could create levels with dozens of 4K textures visible simultaneously, but the VT system would never use more than 48 MB of memory. This works fine, as each new visible surface covers as many pixels as it adds, roughly maintaining a constant 1:1 visible texel ratio. Virtual texturing doesn't load texture data for hidden surfaces, so overdraw doesn't count.

With virtual texturing, a single 256 MB texture atlas should be enough (to hold all texture data) for a modern PBR pipeline at 4K. Additional texture data might however be needed depending on how you do your decaling and how you composite your materials at runtime. Id-sofware is nowadays using "dynamic" virtual texturing, meaning that they no longer just load baked data. VT pages are generated at runtime (similar to the RedLynx system). This is a great way to reduce the disk storage, and amortize decaling and material compositing, etc cost over multiple frames.
 
Back
Top