Breakdown of graphics bandwidth usage

Today's GPUs typically have around 20GB/s of bandwidth. A few questions on bandwidth for this gen and the next:

It's been said that as shaders get more complex, memory bandwidth will become less important. Current benches (3DMark05 included) doesn't show this; performance scales better when memory is overclocked versus core. This could change in the future. Do you think this will be the case?

There's the idea that going forward, colour and z access is going to become less important. Texture access on the otherhand will become more important. (Possibly due to increased use of render to texture) Agree/Disagree? Thoughts on this matter?

A (probably incomplete list) on consumer of bandwidth:

- Reading of geometry data
- Reading of textures
- Colour & Z access
- Stencil
- AA

Could someone guestimate, for a typical game (say Half-Life 2), with 4x MSAA, what's a breakdown of the bandwidth numbers in percentages?

Eg. Out of 100% Bandwidth:
30% Texture acess (texture maps, rendering to texture etc)
30% Colour/Z
20% Stencil
20% Geometry
 
It's going to be VERY application dependant. Also depends on what bandwidth saving features you're using.

Z/Stencil will likely be most (assuming either front to back draw order or Z prepass). There are bandwidth saving mechanisms in use on both NV and ATI hardware for X. ATI's Heirarchical Z and NV's Z compression that might end up making this number 2.

Color would be number 2 and might beat out Z in cases where heavy alpha blending is in use.

Then Textures - the small caches do a good job of reducing this dramaticlally unless you have dependant reads or the mip bias is set very negative

Then Vertices last by a long way.
 
Render to texture, then using the texture in rendering another frame, basically counts as rendering two frames (what is treated as a texture in the second pass is exclusively used as a frame-buffer in the first pass). These two passes will each have their own usage patterns, with their own balances of vertex, texture, Z and color accesses.

If you use render-to-texture to enable you to do post-processing effects, you can expect the vertex and Z accesses to be virtually zero during the post-process pass; if the post-processing pass can be done with 1 pass through the pixel shader and the texture cache doesn't screw up and the result format of post-processing is the same as the input format, the ratio of color buffer writes to texture accesses should be about 1:1. This does, however, say nothing about the typical case of 3d rendering.
 
This is just a simple example and as others have said the bandwidth usage changes from application to application, from map to map, from batch to batch, from frame to frame and even inside a frame. Using AF, MSAA or SSAA or changing the resolution also affects the bandwidth distribution (and by a lot!).

The data for the graphics comes from two frames of UT2004 Primeval map rendered at a resolution of 1024x768 and sampled each 100K cycles. The GPU is configured with 4 vertex shaders, 8 fragment shaders and 4 memory buses. I would test MSAA but that is yet to be implemented, likely in the far future ...

This first graph shows how much of the total GPU bandwidth is consumed by each kind of data access (that's real accesses to GPU memory, caches and texture and Z compression reduce the amount of GPU bandwidth required).

The second graph shows the bandwidth usage for each kind of data access relative to each other so that they can be compared easily.

bw.png


bw-usage-relative.png
 
JF_Aidan_Pryde said:
Could someone guestimate, for a typical game (say Half-Life 2), with 4x MSAA, what's a breakdown of the bandwidth numbers in percentages?
It varies hugely - it's easy to generate shaders with any particular bottleneck. However, Z bandwidth is typically larger than colour, and if texture compression is in use (which it should be for 90% of textures) texture bandwidth is usually relatively low.

One interesting point is that as the resolution rises bandwidth typically drops because coherency leads to relatively more effective compression.
 
RoOoBo, that DAC bandwidth seems fairly strange. I think this should be a rather constand share of bandwidth over time instead of short peaks. The DAC can't buffer the whole frame, and it has to feed the monitor continuously.
 
Dio said:
It varies hugely - it's easy to generate shaders with any particular bottleneck. However, Z bandwidth is typically larger than colour, and if texture compression is in use (which it should be for 90% of textures) texture bandwidth is usually relatively low.

One interesting point is that as the resolution rises bandwidth typically drops because coherency leads to relatively more effective compression.

Are there a couple words missing in that last para after "rises" or "drops"? Otherwise my whole worldview just collapsed in a smoking heap. :LOL: Did you mean to subset just texture bandwidth there?
 
Xmas said:
RoOoBo, that DAC bandwidth seems fairly strange. I think this should be a rather constand share of bandwidth over time instead of short peaks. The DAC can't buffer the whole frame, and it has to feed the monitor continuously.

The DAC was set in the simulator just to dump the frame after each swap command as fast as possible just for verification purposes. And working as if there was a single vsynched color buffer so the GPU waits until the whole color buffer is dumped before starting the next frame. In fact I don't use to count the DAC cycles or memory consumption in my current experiments.

The DAC has a fixed refresh rate mode with non synchronized display support but it may display weird things when the map that stores if a color buffer block is cleared/compressed isn't properly updated. And I never use it. I completely forgot that there is a bandwidth limit out of the DAC to the monitor so it isn't implemented, but the bandwidth from the memory controller to the DAC can be configured and could be used as a way to simulate that limit (in the examples above it was to set to half the GPU bw, 32 bytes per cycle).

So, in short, no, the DAC behaviour doesn't pretend to be realistic right now. And I only use it in those graphics to display the frame transitions.
 
geo said:
Are there a couple words missing in that last para after "rises" or "drops"? Otherwise my whole worldview just collapsed in a smoking heap. :LOL: Did you mean to subset just texture bandwidth there?
Why would the positive effects of coherency just be limited to texture bandwidth? Or am I just reading the wrong way?
 
Very nice info RoOoBo. It seems that color data still consumes at least 50% of the used bandwidth (at least in UT2004). By the way, how do you measure this data?
 
Now that you mention it, it strikes me as very odd that color read and color write bandwidth seems to be identical across the whole frame. That should only be the case if alpha blending is always on or if you're rendering very small objects only while block based frame buffer compression is enabled (i.e. when MSAA is enabled). Both cases are highly unlikely for UT2k4.
 
I think that most of that trace is using alpha blending for the terrain (grass). Of course the simulator may be wrong (in fact that is very likely, the question is by how much) and/or I may have screwed something when I generated the graphics. The color compression algorithm doesn't work right now as it should (only use is fast color clear or compressing single colored blocks) and may be the simulator is reading blocks that it shouldn't but the theory is that the current implementation only reads uncompressed blocks when blend isn't enabled.

I don't remember which frames I used but most of the time in that trace there are only a few objects other than the terrain, and most of the terrain is grass.

I have uploaded a version of the simulator compiled for win32 to my web in case someone wants to try it. There is also an UT2004 trace and a trace from one of Humus demos.
 
Xmas said:
Why would the positive effects of coherency just be limited to texture bandwidth? Or am I just reading the wrong way?

Damned if I know --I'm just trying to rescue my worldview! :LOL: Everything I've ever read says that as resolution goes up so does bandwidth usage, and here's Dio (whose technical acumen is unquestioned) telling me it just ain't so.
 
geo said:
Xmas said:
Why would the positive effects of coherency just be limited to texture bandwidth? Or am I just reading the wrong way?

Damned if I know --I'm just trying to rescue my worldview! :LOL: Everything I've ever read says that as resolution goes up so does bandwidth usage, and here's Dio (whose technical acumen is unquestioned) telling me it just ain't so.
Bandwidth usage per frame goes up of course, but since there's only a limited number of processing units, bandwidth usage per clock or per second goes down.
Similarly, AF requires more texels from a higher-res mipmap, so it needs more bandwidth than isotropic filtering. However, since you're usually still limited to a single bilinear sample per clock per TMU, you need less bandwidth per clock (color and Z reads/writes are still once per pixel).
 
Xmas said:
Bandwidth usage per frame goes up of course, but since there's only a limited number of processing units, bandwidth usage per clock or per second goes down.

Ah. The sun shines again, children play, birds chirp, etc. Thanks. Be interesting to get a feel for how much this helps --so bumping from 1024 to 1600 would usually indicate X percent increase (uncompressed), but actually only requires X - Y percent due to the phenomenon you guys are pointing at. On what order is Y?
 
geo said:
Ah. The sun shines again, children play, birds chirp, etc. Thanks. Be interesting to get a feel for how much this helps --so bumping from 1024 to 1600 would usually indicate X percent increase (uncompressed), but actually only requires X - Y percent due to the phenomenon you guys are pointing at. On what order is Y?
As always, it depends on several factors. Considering bandwidth required per frame, vertex bandwidth usually stays the same as long as there is no resolution-dependent geometry LOD system.

Texturing bandwidth does scale less than linear, because more pixels per triangle means more coherency and more texture magnification. So it depends on texture resolution and how they're mapped to polygons. In today's games, texture bandwidth requirements per frame may scale proportionally when going from 640x480 to 800x600. But going from 2048x1536 to 3840x2880, you'll almost always see mip level 0 magnified. So there's practically an upper bound of how much texture bandwidth you might require per frame.

When no framebuffer compression is used, framebuffer (color, Z, and DAC readout) bandwidth scales about linear with resolution. However, framebuffer compression becomes more efficient with higher resolution, because the ratio of "interior tiles" to incompressible "edge tiles" increases.

Dave's reviews feature those niche fillrate graphs: http://www.beyond3d.com/reviews/sapphire/512/index.php?p=10
The game is GPU limited, still fillrate grows with higher resoultion. Of course this is not only the effect of less bandwidth required per pixel, but also of higher rendering efficiency. These always come together.
 
geo said:
Xmas said:
Bandwidth usage per frame goes up of course, but since there's only a limited number of processing units, bandwidth usage per clock or per second goes down.
Ah. The sun shines again, children play, birds chirp, etc. Thanks. Be interesting to get a feel for how much this helps --so bumping from 1024 to 1600 would usually indicate X percent increase (uncompressed), but actually only requires X - Y percent due to the phenomenon you guys are pointing at. On what order is Y?
I've mentioned this before a few times here, as the majority of people here have made the same assumption :D. Firstly, I would consider that the word 'Memory Bandwidth' refers to 'Memory accessed per second" not "Memory used". It's a common confusion. (Although xmas' explainations have been mostly excellent, note that 'Bandwidth per frame' is not a term that makes sense to me - although it's an easy mistake to make and I made a similar one while drafting this, and it's possible I'm wrong with my exact specification of the term)

For an architecture which is executing at a particular number of pixels per clock, it's clearly going to use memory per clock (which is equivalent to bandwidth) of approximately equal amount, if all else was equal.

Now, granularity losses decrease as resolution rises; compression is likely to improve at higher res (as there is lower entropy per pixel in the input data) and textures are more likely to be magnified. Against that, textures may select higher mipmap levels and display refresh consumes more bandwidth, assuming the framerate is below refresh. I also did neglect to mention one thing, which is that I'm assuming the initial resolution is high enough to limit pre-rasterisation bottlenecks.

I don't have any hard figures but I'd guess that on average we see no change or a small net reduction in memory bandwidth once we get the rasteriser up over 90% busy - although it's easy to generate cases that are miles away from this pattern.
 
Back
Top