How to calculate the required video card memory bandwidth?

991060

Regular
to begin with,everyone knows that better card is equiped with faster memory,and higher GPU<->memory bandwidth always means higher performance(as far as the same GPU architecture is concerned). I'm wondering how to calculate the required memory bandwidth which can satisfied the GPU's need.

I have a coarse equation here:
required bandwidth=(scene height)*(scene width)*FPS*(scene complexity)*[(average number of texture used for each pixel*4bytes)+(average number of Z-read/write for each pixel*8bytes)]

did I miss sth important? :rolleyes:
 
(1600 * 1200 * 4 * (1 (front buffer) + 1 (back buffer) + 1 (Z buffer) + 6(AA back buffer) + 6 (AA Z buffer)))/2^20(mebibytes) = 109 MB

OpenGL Guy's equation for Ati cards.
 
Bandwidth, K.I.L.E.R...

I'd say:

fps * width * height * [((overdraw * 2 + scene complexity) * bytes per pixel * # of AA samples) + (overdraw * avg # of texel per pixel * bytes per texel)] + bandwidth for downsampling/RAMDAC

where scene complexity is the average number of "layers" per pixel, i.e. the number of triangles you'd intersect if you shoot a ray through that pixel. And overdraw is the average number of times a pixel is written to the frame buffer. Note that overdraw <= scene complexity, because of the depth test.

This is of course ignoring any bandwidth saving features such as hierarchical Z, compression, etc. And it assumes "simple rendering" with no special passes like stencil shadows.

Specifically, you need:
Reading Z: fps * width * height * scene complexity * AA samples * bytes per pixel (Z) [* compression factor (Z)] [* hierZ reject factor]
Writing Z: fps * width * height * overdraw * AA samples * bytes per pixel (Z) [* compression factor (Z)]
Writing color: fps * width * height * overdraw * AA samples * bytes per pixel (color) [* compression factor (color)]
Texturing: fps * width * height * overdraw * avg texel per pixel * bytes per texel [* compression factor (texture)]
Downsampling/RAMDAC:
either: width * height * AA samples * bytes per pixel (color) * refresh rate
or: width * height * bytes per pixel (color) * [(AA samples + 1) * fps + refresh rate]
 
Oh, this is far more complex than that. You've missed plenty of potential 'bandwidth' consumption.
 
Yeah, loads of stuff, and some is even ATI or nVidia specific!
For example, I think that when you load a recently used pixel shading program on the NV3x, it's loaded from memory, not AGP.

And ignoring no bandwidth saving features is really practical - as long as you aren't speaking of having better than a TNT2 M64 :p

IMO, if you wanted to get something with less than a 10% margin of error, you'd have the benchmark compression techniques in as many ways as possible to try to see how good they are, and loads of other stuff.

And the question is: Is it worth it to do all that work for such a, let us say, boring thing? Don't get me wrong, I like theorical things, but this got no real world uses...


Uttar
 
And the question is: Is it worth it to do all that work for such a, let us say, boring thing? Don't get me wrong, I like theorical things, but this got no real world uses...

I´ve a very quick and simple method for calculating max theoretical bandwidth (more for laymen like me and probably not accurate for all occassions):

128bit SDRAM: memory clockspeed * 16
128bit DDR: memory clockspeed * 32
256bit DDR: memory clockspeed * 64

It saves me time so shoot me ;)
 
Uttar said:
And the question is: Is it worth it to do all that work for such a, let us say, boring thing? Don't get me wrong, I like theorical things, but this got no real world uses...


Uttar

yes,you re right,I know it has little use to the real world. But I just want to have a way to calculate the approximately needed bandwidth. benchmark can tell me what's happening,but it can't tell me why it happens(at least not the reasons for all phenomenon) :D
 
Dio said:
Oh, this is far more complex than that. You've missed plenty of potential 'bandwidth' consumption.

since we don't know a lot about the cache implementation,what's the used bandwidth without on-chip cache and any bandwidth saving technology? and how much can those missing parts affect the final result? if it's a big one,I won't spend more time on this topic.
 
well alright lets consider overdraw reduction... without it youre going to have width * heighth * 4 (assuming average depth complexity of four) but with, say, a PowerVR card you are going to have width * heighth * 1. thats one hell of a difference matey, and it's just the start
 
991060 said:
since we don't know a lot about the cache implementation,what's the used bandwidth without on-chip cache and any bandwidth saving technology? and how much can those missing parts affect the final result? if it's a big one,I won't spend more time on this topic.
It's a big one. Without having decent figures for those you won't get within a factor of 2.
 
Sorry to ressurect the thread, but thought it worth mentioning that the calculation for the RAMDAC contribution is not quite right.

Whilst
bytes per pixel (color) * refresh rate
May be correct over the period of a second, the bandwidth whilst the actual frame is being drawn is higher, the bandwidth requirement over a second is brought down by the periods of inactivity during h and v blanking.
To calculate it correctly you need to use the pixelclock for a given mode.

bytes per pixel (color) * dotclock


Relatively speaking a small difference, but salient nontheless.

CC
 
Hm, since I'd expect pretty much any memory access to go through some caching/reordering/block transferring mechanism to avoid page breaks, I don't think that is very relevant. Texture blocks and framebuffer tiles are also only read and written "every once in a while"(though very often ;) ), and not in every cycle.
 
It is difficult to have a tiled memory that is efficient for both Texture and DAC reads. Textures are typically much smaller, whereas the DAC has to be read in a linear manner, so does break page frequently. Generally we try and put textures into a different bank than the active display as this means that the texture reads/writes do not cause page breaks for the DAC, as holding off DAC memory requests can cause nasty display disturbances.

Using 1280x1024@85 Hz as an example
the Pixel clock for this mode is typically 157.5 MHz , multiply that by your bitdepth and you get the peak bandwidth for the dac at a given point.
e.g. 32BPP = 630 MB/s

If you use the refresh rate you end up with 1280x1024x85Hz = 111.4 Mhz, and then multiply by 4 to get memory bandwidth = 445.6 MB/s


a difference of 185 MB/s
CC
 
Very true. The one advantage with the DAC reads is that they are independent and predictable - which is not remotely true for the 3D block...
 
Almost OT but not quite :)
is it safe to say that each time a R9700 is faster than a R9500P, the latter is bandwidth limited? Same core and clock, only half the bandwidth available. In benchmarks, it's very rare that both run at the same speed (except in synthetic VS/PS tests), so almost all situations are bandwidth limited on the 9500P, right?
I ask because I own one and I wonder how it will handle HL2 and the likes and if 128bit memory will heavily affect the efficiency of the 8 pipes or if I should consider moving to some 256bit memory cards (cheap used 9700P will become available when 9900 gets out ;) )
 
nyt said:
is it safe to say that each time a R9700 is faster than a R9500P, the latter is bandwidth limited?

It is safe to say that some part of the scene is bandwith limited.

I ask because I own one and I wonder how it will handle HL2 and the likes

If HL2 will be shader limited (more specifically arithmetic limited in the shaders), than the difference btw the R9700 and R9500P might get smaller.
 
well... that depends. See, HL2 is going to look really bad if you use multisapling FSAA because you end up sampling texels that dont belong to the polygon that you want, and if it's a lightmap you're doing the sampling on then you can end up with horrible lighting errors. As Gabe explained- you have those errors when multisampling in any game that uses lightmaps but proabbly dont notice it as much because HL2 uses a whole lot of complicated lighting and they've really cut down on the other graphical errors which makes it stand out more. The way to fix the problem is to use centroid sampling which limits the samples to the polygon that you want, but there's a problem with that- DX requires PS3.0 compliance in order to use centroid sampling even though the R3x0 series does have the hardware to do so. Unless ATi and Valve can get around DX's requirement of PS3.0 then you'll have to use the PS2.0 units to clamp the samples which takes more pixel shader power which is already the limit. On nVidia cards, the PS2.0 path is the only option as they don't have the hardware for centroid sampling which means FSAA without horrible graphical errors is going to be pretty much impossible given how much the NV3x is already limited by its DX9 shader performance.
 
Back
Top