How to calculate theoretical max throughput in GB/s?

Frontino

Newcomer
Wich elements of a GPU do I have to consider and what operation do I need to perform to know how much maximum theoretical throughput in bytes per second does, say, a G80 handle?

For the older NV30, I theorized: 256bit (wich was the internal architecture of the chip, right?), so 32 bytes * 500MHz = 16,000 MB/s, wich corresponded to the bandwidth of the card's RAM.

Was that a correct operation?
 
Bandwidth would be the term you're looking for.

You can derive a card's bandwidth to its video RAM by taking its bus-width and dividing by 8 (to derive bytes per second rather than bits), you then take this number and multiply by the transfer rate of the RAM eg GTX 285: 512-bit bus (divide by 8 yields 64) multiply by RAM transfer rate of 2.484 billion (RAM clock 1242MHz DDR = 2484 million). 64 x 2.484 billion ~= 159 GB/s
 
Yes that is correct .

To Calculate Memory Bandwidth , you need to know the width of the bus and multiply it with the effective frequency , for instance GTS 250 has a 256-bit bus and a frequency of 2200MHz , so the memory bandwidth would be :

256/8 (32) X 2200 = 70.4 GB/s

For calculating the theoretical mathematical output you need to know the number of shader cores and their frequencies + the number of operations each core is able to do in one clock cycle , for example:

Each RV870 core is able to do one multiply + one add operation each cycle , ie 2 operations , each core is clocked at 850MHz , and there are 1600 of them .

The equation would be :
2 (number of operations) X 1600(number of cores) X 850(frequency of cores) = 2.72 GigaFlops
 
Sorry. I didn't explain myself correctly.

What I'm asking is how to calculate the GB/s of the GPU, not the video RAM.
Since the GFLOPS could vary depending on the various types of operation made, a throughput in GB/s would be more general and equal for all situations.

So, HOW DO I CALCULATE THE MAX THROUGHPUT IN GB/S OF THE GPU?
 
You're confusing terms. Instruction throughput is measured in operations per second, not gigabytes. You may be able to calculate some approximation of how many bytes are consumed by the instructions but you'd have to look at the entire instruction stream to get anything more significant that a gross approximation.
 
What I'm asking is how to calculate the GB/s of the GPU, not the video RAM.
Since the GFLOPS could vary depending on the various types of operation made, a throughput in GB/s would be more general and equal for all situations.

So, HOW DO I CALCULATE THE MAX THROUGHPUT IN GB/S OF THE GPU?

You simply can't do that , because the GPU usually have multiple different units that operate on different frequencies , and have different parameters too !

For instance , the GPU contains :
1-A hardware Rasterizer , it outputs one triangle every clock cycle , so the unit here is (Mtri/s) Millions of Triangle/S .
2-Texture Units , which outputs one texel every clock cycle , so the unit here is G Texel , or Giga Texel .
3-ROPS , which output one pixel every clock cycle , so the unit here is G Pixel , or Giga Pixel .
4-Shader Cores : already explained .

And for each type , you calcualte it's throughput , by knowing how many units are there , and there frequnecies , for example in GTS 250 :

There is one Rasterizer , operating at 738MHZ so : 1 X 738 = 738 MTri
There are 64 Texture Units , operating at 738MHZ , so : 64 X 738 = 47.4 GTexel
There are 16 ROPS , operating at 738MHZ , so : 16 X 738 = 11.8 GPixel
There are 128 Shader Core , operaing at 1836MHZ so : 2 X 128 X 1836 = 470 GFLOPS
 
Look.
Aren't "triangles", "texels" and FLOPS just a series of zeros and ones? Of digital bits?
Then I'm sure it's possible to figure out how many bytes pass through each unit of the GPU in a second.
Is there a way to know the datapath width of every single component of the chip, so I can make the math by myself?
 
Look.
Aren't "triangles", "texels" and FLOPS just a series of zeros and ones? Of digital bits?
Then I'm sure it's possible to figure out how many bytes pass through each unit of the GPU in a second.
Is there a way to know the datapath width of every single component of the chip, so I can make the math by myself?
:LOL:
The thing is , that byte math is really applicable on memory only , all processors (multipliers and adders and dividers ..etc) are sequence changers (ie they change sequences of ones and zeroes to new sequences) , people just use pretty names for them , these processors receive bits from RAM to store them in caches or registers , so we must know the exact number of those in each GPU , and then we need to know their width .. (how many bits inside each register) , after receieving .. these bits are changed into new bits and are imprinted on new registers then RAM , you shouldn't try to measure how many are bits during the changing of sequence process , you should only try to measure the outcome and the income of bits , and that is where knwoing the register cound and size for each GPU come in handy .

However , I remember something about the RV870 having a cache system that is able to manitpulate 1 terabyte or more of data .
 
Look.
Aren't "triangles", "texels" and FLOPS just a series of zeros and ones? Of digital bits?
Then I'm sure it's possible to figure out how many bytes pass through each unit of the GPU in a second.
Is there a way to know the datapath width of every single component of the chip, so I can make the math by myself?
You'll need to narrow down what you are looking for. Some of those units you are using don't have a fixed or single binary length to begin with.

It is possible to guesstimate things that are disclosed or software-visible. There are theoretical peaks and various special cases that make a single hard number difficult to arrive at.

Using Cypress as an example, the bandwidth of components like data path from the texture caches, shared memory, or the read crossbar are available because those are disclosed.

ALU operand bandwidth can be calculated based on disclosed data about the number of register file ports per unit.
For example, a single 1/16 portion of a Cypress SIMD is capable of pulling in 12 operands even though there are 5 ALUs capable of a three-operand FMADD. The fifth unit must piggy-back on a shared or unused operand port if it is to be used.

The more specialized hardware like ROPs and TMUs can have some rough estimates based on the precision of the units and their disclosed rates, but anything done internally would not be figured into this.
The other stuff like the command processor, various state machines and their buffers, setup pipeline, rasterizer, and special function hardware can have a lot of implementation-specific internal data traffic that we would not be privy to.
 
Also you can try to extrapolate from the previous texels/pixels/triangles/flops data , by knowing the word size for each one of them , for example : pixels can be of 32-bit size (they are 128-bit in HDR), vertices can be 64 or even 128-bit , texels could be the same , and FLOPS are usually 64-bit in size .
 
Sorry , I didn't see 3dilettante's post , I was surprised to know that :
Some of those units you are using don't have a fixed or single binary length to begin with.
That explains a lot ! , but shouldn't registers have a fixed length though ? like maybe 128-bit or whatever ?
 
Frontino,

I think it's next to impossible to sum up all the datapaths within a GPU without some access to company secrets from the manufacturers. Sometimes they might tell you that their internal ringbus/crossbar/whatever is capable of transferring x GiByte/s, but there's much more traffic going on and even if all the relevant maximum datapaths would be disclosed, there's no way to know which ones might be sharing routing, i.e. block each other for concurrent transfers.
 
Sorry , I didn't see 3dilettante's post , I was surprised to know that :
That explains a lot ! , but shouldn't registers have a fixed length though ? like maybe 128-bit or whatever ?

If you mean the physical sections of a register file that store an operand are of a fixed length, yes. They are physically set in silicon.
A FLOP, however, is not a physical object.
For example, a single-precision FLOP is going to use 32-bit operands, while double-precision will use 64-bit operands.
Internally, a GPU would treat two separate 32-bit register locations as a combined 64-bit value.
A floating-point ADD or MUL takes 2 operands, and two of them would take 4.
However, the general usage of FLOPs counts an FMADD as two FLOPs even though it takes only 3.

A software-visible "register" is not necessarily physically represented in a 1:1 manner.
Cypress uses 4x32-bit registers, which can be parceled out as individual 32-bit operands, subject to a lot of restrictions.

Many programmable operations will by default upgrade the data they work on to the width of the register file components or data paths, but special-purpose hardware operations can use reduced width data paths (or expanded in special-cases like the FMA instruction).
Some items may be a certain size at one point in the pipeline and have a different size altogether later, depending on factors like graphics mode, format, and what kind of work is being done.
These representations can involve quite a bit of internal state and data that aren't really discussed because programmers can't or shouldn't be able to see them directly.
 
Back
Top