The Official NVIDIA G80 Architecture Thread

Second, the texture fill rate number given is 18400M/s while nVidias site (as well as many other sites) has 36800M/s listed. My own testing shows that it's 18400M/s (the numbers I actually got where between 18114M/s and 18192M/s). This is because there are only 32 Texture addressing units. nVidia's number comes from the 64 Texture filtering units. I think it's confusing the matter somewhat... I think the best rating number here is the fastest rate at which the pixel/fragment shader can recieve texture samples (which is 18400M/s like this site lists).
Yeah they have been confusing people on that.

http://enthusiast.hardocp.com/article.html?art=MTIxOCw2LCxoZW50aHVzaWFzdA==

NVIDIA whitepapers said:
In essence, full speed bilinear anisotropic filtering is nearly free on GeForce 8800 GPUs. FP16 bilinear texture filtering is also performed at 32 pixels per clock (about 5x faster than GeForce 7x GPUs), and FP16 2:1 anisotropic filtering is done at 16 pixels clock. Note that the texture units run at the core clock, which is 575MHz on the GeForce 8800 GTX .

At the core clock rate of 575MHz, texture fill rate for both bilinear filtered texels and 2:1 bilinear anisotropic filtered texels is 575MHz x 32 = 18.4 billion texels/second. However, 2:1 bilinear anisotropic filtering uses two bilinear samples to derive a final filtered texel to apply to a pixel. Therefore, GeForce 8800 GPUs have an effective 36.8 billion texel/second fill rate when equated to raw bilinear texture filtering horsepower.

They'd like to say it's 36.8 MT/s but clearly realworld performance has revealed that it isn't. Those extra TF units only come in handy during AF and trilinear situations it seems.
 
If the filterers are decoupled from the addressing units, I think 36.8MT/s is the fairest number to be honest, because they ought to get excellent overall efficiency out of it. If each filterer is hardwired to an addresser, then you'd have a pretty good arguement for 18.4MT/s, I guess.

Honestly, I'm not even sure how we could test that. You'd need to vary your sampling rate between 1 and 4+ bilinear samples per pixel, and given the adaptivity of any modern anisotropic filtering algorithm, that seems rather hard to manage to me.

I do see one way to do it, potentially, but it's rather complicated: First, create a single octo-textured quad (two tris, since the hardware doesn't natively handle quads) with the appropriate slope so that you can be sure a high level of anisotropic filtering is being applied, and write down the performance. Then, do the same but with blatant texture magnification. Write down the performance again. Then do the same but with the quad highly subdivided and one triangle on two using magnification. In theory, if the addressing and filtering units are decoupled, the performance should be the average of the two previous scenarios...

Maybe I should bother implementing that one of these days... :)


Uttar
 
And now that we've determined what they don't have to do with. . .? ;)

The G80 has pretty much done away with quads right? Do we expect the R600 to do away with quads?
 
Last edited by a moderator:
SIMD and scalar are orthogonal. You can have one without the other, or you can have both (or neither).

Well if we were to describe the smallest processing unit of G80 as capable of working on the same channel of four different elements how does that fit a scalar definition?
 
Well if we were to describe the smallest processing unit of G80 as capable of working on the same channel of four different elements how does that fit a scalar definition?
From the point of view of a single fragment or vertex it's scalar. That's what is important to get maximum utilization in a programming model that supports operations on scalars, vec2, vec3 and vec4.
 
I get that but I thought Bob's comment about scalar and SIMD being orthogonal was a more general statement. If we're defining scalar based on the number of components per element involved in a given instruction then it matters little whether g80 is a 4x4 or 16x1 configuration.
 
I get that but I thought Bob's comment about scalar and SIMD being orthogonal was a more general statement. If we're defining scalar based on the number of components per element involved in a given instruction then it matters little whether g80 is a 4x4 or 16x1 configuration.

Well, I was wondering if Bob was going to say that quads are still doing their thing in G80, and the reasons why. Because, as you point out, the logic of his answer seems to suggest there is no reason they couldn't use quads consistent with their descriptions. . .and his answer also suggests there might possibly be several reasons unrelated to scalar/SIMD why they would still use quads. It was a question we asked, and did not receive an answer to.
 
The G80 has pretty much done away with quads right?

I don't know. NV certainly does not talk about them in the G80 paradigm they are trying to teach all of us. However, if I'm reading Bob correctly above (and there's always the possibility I'm not --I hope we've determined by now that I'm *not* Rys and/or Uttar at this level of detail) then nothing in the new paradigm excludes quads.
 
If the filterers are decoupled from the addressing units, I think 36.8MT/s is the fairest number to be honest, because they ought to get excellent overall efficiency out of it. If each filterer is hardwired to an addresser, then you'd have a pretty good arguement for 18.4MT/s, I guess.
The former case would be akin to saying that NV10 had the same texture rate as NV15. It only has 32 pixels worth of addressing capabilities, and the full texturing capabilities can only be accessed in certain scenarios that invloved 2x sampling per pixel (much like NV10).
 
...then nothing in the new paradigm excludes quads.

Hmm...

B3D article by Rys said:
There are 128 such processors (called SPs by NVIDIA) in a full G80, grouped in clusters of 16, giving the outward appearance of an 8-way MIMD setup of 16-way SIMD SP clusters.

...
Each cluster has its own data cache and associated register file

Doesn't sound like quads to me.
 
Doesn't sound like quads to me.

Nobody's suggesting that the single SP and the 16 SP cluster don't exist --just whether there's another intermediate functional grouping in between. In other words, I don't see Rys description excluding the possibility. And looking at the way Tridam explains it upstream, you get a sense of the malleability of the language.
 
If quads are not there anymore how do they compute textures LOD every time we fetch a texture using as texturing coordinate a value that has not been derived from one or more interpolants?
Basicly you'd need to run your shader 3 times per pixel..would not make much sense to not have quads anymore IMHO
 
trinibwoy said:
Well if we were to describe the smallest processing unit of G80 as capable of working on the same channel of four different elements how does that fit a scalar definition?
Uhh, scalar SIMD?

Scalar != MIMD. Take G71's vertex shaders for example. They're MIMD, yet operate on 4+1 vectors. And yet we know we can have vector SIMD machines. This can only be explained by "SIMD vs MIMD" being a separate concept than "scalar vs vector".
 
Uhh, scalar SIMD?

Scalar != MIMD. Take G71's vertex shaders for example. They're MIMD, yet operate on 4+1 vectors. And yet we know we can have vector SIMD machines. This can only be explained by "SIMD vs MIMD" being a separate concept than "scalar vs vector".

The relationship between scalar/vector and SIMD/MIMD isn't what I was getting at. Just trying to understand the definition of "vector". It seems the definition of vector is dependent on an execution unit operating on multiple components of the same element - which is stricter than the more general definition I had in mind.
 
It seems the definition of vector is dependent on an execution unit operating on multiple components of the same element - which is stricter than the more general definition I had in mind.

That's just because the SIMD shader cores that have used so far have operated like this.

Calling G80 a bunch of scalar cores is really pushing it. Really, G80 has 8 massive vector cores, each with 16 ALUs.

The difference is in the way these cores are sequenced by the driver/compiler. The old way was AOS, G80 is SOA.

Cheers
 
That's just because the SIMD shader cores that have used so far have operated like this.

Calling G80 a bunch of scalar cores is really pushing it. Really, G80 has 8 massive vector cores, each with 16 ALUs.

The difference is in the way these cores are sequenced by the driver/compiler. The old way was AOS, G80 is SOA.

Cheers

AOS? SOA?

Could you expand the acronyms please ;)
 
Array of structures, structure of arrays
Code:
struct vec4 {
  float x, y, z, w;
} aos[32];

struct vec4soa {
  float x[32];
  float y[32];
  float z[32];
  float w[32];
} soa;
 
Back
Top