How to calculate the required video card memory bandwidth?

991060 · Aug 24, 2003

to begin with,everyone knows that better card is equiped with faster memory,and higher GPU<->memory bandwidth always means higher performance(as far as the same GPU architecture is concerned). I'm wondering how to calculate the required memory bandwidth which can satisfied the GPU's need.

I have a coarse equation here:
required bandwidth=(scene height)*(scene width)*FPS*(scene complexity)*[(average number of texture used for each pixel*4bytes)+(average number of Z-read/write for each pixel*8bytes)]

did I miss sth important?

K.I.L.E.R · Aug 24, 2003

(1600 * 1200 * 4 * (1 (front buffer) + 1 (back buffer) + 1 (Z buffer) + 6(AA back buffer) + 6 (AA Z buffer)))/2^20(mebibytes) = 109 MB

OpenGL Guy's equation for Ati cards.

Xmas · Aug 24, 2003

Bandwidth, K.I.L.E.R...

I'd say:

fps * width * height * [((overdraw * 2 + scene complexity) * bytes per pixel * # of AA samples) + (overdraw * avg # of texel per pixel * bytes per texel)] + bandwidth for downsampling/RAMDAC

where scene complexity is the average number of "layers" per pixel, i.e. the number of triangles you'd intersect if you shoot a ray through that pixel. And overdraw is the average number of times a pixel is written to the frame buffer. Note that overdraw <= scene complexity, because of the depth test.

This is of course ignoring any bandwidth saving features such as hierarchical Z, compression, etc. And it assumes "simple rendering" with no special passes like stencil shadows.

Specifically, you need:
Reading Z: fps * width * height * scene complexity * AA samples * bytes per pixel (Z) [* compression factor (Z)] [* hierZ reject factor]
Writing Z: fps * width * height * overdraw * AA samples * bytes per pixel (Z) [* compression factor (Z)]
Writing color: fps * width * height * overdraw * AA samples * bytes per pixel (color) [* compression factor (color)]
Texturing: fps * width * height * overdraw * avg texel per pixel * bytes per texel [* compression factor (texture)]
Downsampling/RAMDAC:
either: width * height * AA samples * bytes per pixel (color) * refresh rate
or: width * height * bytes per pixel (color) * [(AA samples + 1) * fps + refresh rate]

Dio · Aug 24, 2003

Oh, this is far more complex than that. You've missed plenty of potential 'bandwidth' consumption.

sonix666 · Aug 24, 2003

Not to mention on-chip caches that have an impact on the used bandwidth.

Dio · Aug 24, 2003

That was one of my 'many things'

Arun · Aug 24, 2003

Yeah, loads of stuff, and some is even ATI or nVidia specific!
For example, I think that when you load a recently used pixel shading program on the NV3x, it's loaded from memory, not AGP.

And ignoring no bandwidth saving features is really practical - as long as you aren't speaking of having better than a TNT2 M64

IMO, if you wanted to get something with less than a 10% margin of error, you'd have the benchmark compression techniques in as many ways as possible to try to see how good they are, and loads of other stuff.

And the question is: Is it worth it to do all that work for such a, let us say, boring thing? Don't get me wrong, I like theorical things, but this got no real world uses...

Uttar

Ailuros · Aug 25, 2003

And the question is: Is it worth it to do all that work for such a, let us say, boring thing? Don't get me wrong, I like theorical things, but this got no real world uses...

IÂ´ve a very quick and simple method for calculating max theoretical bandwidth (more for laymen like me and probably not accurate for all occassions):

128bit SDRAM: memory clockspeed * 16
128bit DDR: memory clockspeed * 32
256bit DDR: memory clockspeed * 64

It saves me time so shoot me

991060 · Aug 25, 2003

Uttar said:
And the question is: Is it worth it to do all that work for such a, let us say, boring thing? Don't get me wrong, I like theorical things, but this got no real world uses...

Uttar

yes,you re right,I know it has little use to the real world. But I just want to have a way to calculate the approximately needed bandwidth. benchmark can tell me what's happening,but it can't tell me why it happens(at least not the reasons for all phenomenon)

991060 · Aug 25, 2003

Dio said:
Oh, this is far more complex than that. You've missed plenty of potential 'bandwidth' consumption.

since we don't know a lot about the cache implementation,what's the used bandwidth without on-chip cache and any bandwidth saving technology? and how much can those missing parts affect the final result? if it's a big one,I won't spend more time on this topic.

Sage · Aug 25, 2003

well alright lets consider overdraw reduction... without it youre going to have width * heighth * 4 (assuming average depth complexity of four) but with, say, a PowerVR card you are going to have width * heighth * 1. thats one hell of a difference matey, and it's just the start

Dio · Aug 25, 2003

991060 said:
since we don't know a lot about the cache implementation,what's the used bandwidth without on-chip cache and any bandwidth saving technology? and how much can those missing parts affect the final result? if it's a big one,I won't spend more time on this topic.

It's a big one. Without having decent figures for those you won't get within a factor of 2.

Captain Chickenpants · Aug 29, 2003

Sorry to ressurect the thread, but thought it worth mentioning that the calculation for the RAMDAC contribution is not quite right.

Whilst
bytes per pixel (color) * refresh rate
May be correct over the period of a second, the bandwidth whilst the actual frame is being drawn is higher, the bandwidth requirement over a second is brought down by the periods of inactivity during h and v blanking.
To calculate it correctly you need to use the pixelclock for a given mode.

bytes per pixel (color) * dotclock

Relatively speaking a small difference, but salient nontheless.

CC

Xmas · Aug 29, 2003

Hm, since I'd expect pretty much any memory access to go through some caching/reordering/block transferring mechanism to avoid page breaks, I don't think that is very relevant. Texture blocks and framebuffer tiles are also only read and written "every once in a while"(though very often

), and not in every cycle.

Captain Chickenpants · Sep 1, 2003

It is difficult to have a tiled memory that is efficient for both Texture and DAC reads. Textures are typically much smaller, whereas the DAC has to be read in a linear manner, so does break page frequently. Generally we try and put textures into a different bank than the active display as this means that the texture reads/writes do not cause page breaks for the DAC, as holding off DAC memory requests can cause nasty display disturbances.

Using 1280x1024@85 Hz as an example
the Pixel clock for this mode is typically 157.5 MHz , multiply that by your bitdepth and you get the peak bandwidth for the dac at a given point.
e.g. 32BPP = 630 MB/s

If you use the refresh rate you end up with 1280x1024x85Hz = 111.4 Mhz, and then multiply by 4 to get memory bandwidth = 445.6 MB/s

a difference of 185 MB/s
CC

Dio · Sep 1, 2003

Very true. The one advantage with the DAC reads is that they are independent and predictable - which is not remotely true for the 3D block...

nyt · Sep 1, 2003

Almost OT but not quite

is it safe to say that each time a R9700 is faster than a R9500P, the latter is bandwidth limited? Same core and clock, only half the bandwidth available. In benchmarks, it's very rare that both run at the same speed (except in synthetic VS/PS tests), so almost all situations are bandwidth limited on the 9500P, right?
I ask because I own one and I wonder how it will handle HL2 and the likes and if 128bit memory will heavily affect the efficiency of the 8 pipes or if I should consider moving to some 256bit memory cards (cheap used 9700P will become available when 9900 gets out

)

Hyp-X · Sep 1, 2003

nyt said:
is it safe to say that each time a R9700 is faster than a R9500P, the latter is bandwidth limited?

It is safe to say that some part of the scene is bandwith limited.

I ask because I own one and I wonder how it will handle HL2 and the likes

If HL2 will be shader limited (more specifically arithmetic limited in the shaders), than the difference btw the R9700 and R9500P might get smaller.

Radea2 · Sep 3, 2003

Doesnt that also mean that the FSAA hit will be smaller, as well?

Sage · Sep 3, 2003

well... that depends. See, HL2 is going to look really bad if you use multisapling FSAA because you end up sampling texels that dont belong to the polygon that you want, and if it's a lightmap you're doing the sampling on then you can end up with horrible lighting errors. As Gabe explained- you have those errors when multisampling in any game that uses lightmaps but proabbly dont notice it as much because HL2 uses a whole lot of complicated lighting and they've really cut down on the other graphical errors which makes it stand out more. The way to fix the problem is to use centroid sampling which limits the samples to the polygon that you want, but there's a problem with that- DX requires PS3.0 compliance in order to use centroid sampling even though the R3x0 series does have the hardware to do so. Unless ATi and Valve can get around DX's requirement of PS3.0 then you'll have to use the PS2.0 units to clamp the samples which takes more pixel shader power which is already the limit. On nVidia cards, the PS2.0 path is the only option as they don't have the hardware for centroid sampling which means FSAA without horrible graphical errors is going to be pretty much impossible given how much the NV3x is already limited by its DX9 shader performance.

How to calculate the required video card memory bandwidth?

991060

K.I.L.E.R

Retarded moron

Xmas

Porous

Dio

sonix666

Dio

Arun

Unknown.

Ailuros

Epsilon plus three

991060

991060

Sage

13 short of a dozen

Dio

Captain Chickenpants

Xmas

Porous

Captain Chickenpants

Dio

nyt

Hyp-X

Irregular

Radea2

Sage

13 short of a dozen

Similar threads