NV30 pixel shader, very interesting...

Luminescent · Jan 24, 2003

A question to start: when a company like Nvidia does a flop count for their processors (supposedly 200 gflops for NV30), do they count only 1 type of flop operation (fmad), or do they also count special purpose floating point ops in addition to the fmads?

I recall 51 gflops for the pixel shader in NV30. With each virtual pipeline having 4 fmads running at half precision, the number should have been obtained through counting something like- 2 ops per mad*2 half-float ops*4 fmad units a pipe*8pipelines*400 million cycles a clock=51.2gflops. It seems Nvidia is not counting the special purpose pixel units which should be able to execute at 1 op per cycle and be included in the 8 pipelines. I am sure they have to be in there, or sin, cos, log2, ddx, ddy, etc. support would be missing, or not feasible. This caught my attention. Any thoughts?

Arun · Jan 24, 2003

Well, as you said yourself, it's 200 gflops for the NV30 and 50 gflops for the NV30 Pixel Shader

I don't quite see how they could get to 200 gflops if they only counted traditional floating point operations. The Pixel Shader probably takes more than 1/4 of the traditional floating point operations, at least 1/3...

Or is my reasoning just plain wrong :?:

Uttar

K.I.L.E.R · Jan 24, 2003

Nice numbers.

Wait a few months and those numbers will look like the number 1 does compared to 10K.

Tahir2 · Jan 24, 2003

REF: GFFX, where art thou?

1 is better than 0

megadrive0088 · Jan 24, 2003

Remember Nvidia's LUDICRIS floating point numbers for all of their chips.

Riva128=16 GFLOPs (thus more than 2x that of Playstation2's 6.2 GFLOPs)

GeForce256 = 50 GFLOPs

GeForce3 = 76 GFLOPs

Nvidia's GFLOPs are what I'd like to call "NVFLOPs" they seem about 10x trumped up beyond most standards that even SONY goes by.

g__day · Jan 24, 2003

The only FLOP I see is the tardy launch - very Bit Boyish...

A reasonable FLOP measure would be an average of composite 3D instructions typical in process 3D geometry surely. Even if it was as dumb as count all maths instructions by their frequency of occurrence in 3D Mark and just put this make up thru your shaders for 10 minutes.

Also are their figures maximum burst rates for 1/1000 th of a second or their minimum sustained rates under continued load for hours?

DemoCoder · Jan 24, 2003

megadrive0088 said:
Remember Nvidia's LUDICRIS floating point numbers for all of their chips.

Riva128=16 GFLOPs (thus more than 2x that of Playstation2's 6.2 GFLOPs)

They are just counting how much raw FLOP power is used up in the rasterization pipeline. This number corresponds to how much CPU power you'd have to have if you wanted to render the scene in software.

3dfx made similar claims in their ads of over a billion ops.

Remove the GS from the PS/2 and let's see if the EE could rasterize in software faster than the Riva128. Or a Pentium4 for that matter.

Don't get caught up in comparing HW accelerated fixed functions with "general purpose FLOPS"

The fact is, 3D hardware is enormously parallel and does an enormous amount of computation for each pixel. Just think of all the perspective divides, setup, and LERPs going on.

KimB · Jan 25, 2003

megadrive0088 said:
Remember Nvidia's LUDICRIS floating point numbers for all of their chips.

Riva128=16 GFLOPs (thus more than 2x that of Playstation2's 6.2 GFLOPs)

GeForce256 = 50 GFLOPs

GeForce3 = 76 GFLOPs

Those weren't "FLOPS" in previous chips. They were "operations," and before the GF256, they were all integer (the 256's T&L pipe had to use 32-bit floats...).

That, and those numbers are far from unbelievable. Take a quick look at texture filtering, for example.

Let's take the TNT, at 90MHz and two bilinear textures per clock. Each bilinear filter could be done like a 4-vec dot product (one vector being the weighting, the other the samples), for a total of 7 integer operations. That makes a total of 14 integer ops between the two pipelines. Then the TNT2 could do a weighted blend between the two texture stages, requiring 3 more integer ops. That's a total of 17 ops per clock, at 90MHz, for 1.53 billion low-precision integer operations per second.

Anyway, just remember that for the older processors, most of those operations were at very low precisions, and all for very special cases. This means that it would likely take a generalized processor capable of many gigaops to emulate one of these in realtime. But that is meaningless, as we all know that dedicated hardware is much faster, as shortcuts can be made all over the place since the datapath is known.

But now, with the latest programmable graphics chips, the processing power is really exciting, because the usage is relatively open-ended, and the processors are incredibly powerful.

KimB · Jan 25, 2003

g__day said:
Also are their figures maximum burst rates for 1/1000 th of a second or their minimum sustained rates under continued load for hours?

Depends on what they include. Including all specialized execution units in a GPU (texture compression, anti-aliasing, anisotropic filtering, sin/cos/log/exp/ddx/ddy, etc.), and the amount can be inflated drastically. I would suspect, however, that it's not far from those numbers reported when doing normal rendering. I doubt that all of the specialized execution units will ever be run in parallel together in a GPU.

Anyway, one last thing. You must realize that it's different when designing hardware to do a task than it is to program a general-purpose CPU to do the same task. Due to the differences, directly translating one to the other is foolish, so the number of ops cannot be directly compared in any case.

arjan de lumens · Jan 25, 2003

Chalnoth said:
That, and those numbers are far from unbelievable. Take a quick look at texture filtering, for example.

Let's take the TNT, at 90MHz and two bilinear textures per clock. Each bilinear filter could be done like a 4-vec dot product (one vector being the weighting, the other the samples), for a total of 7 integer operations. That makes a total of 14 integer ops between the two pipelines. Then the TNT2 could do a weighted blend between the two texture stages, requiring 3 more integer ops. That's a total of 17 ops per clock, at 90MHz, for 1.53 billion low-precision integer operations per second.

You have to do the bilinear filtering for each of the 4 color components R,G,B,A, so you have to multiply that op number by at least 4, for a total of 17*4=68 operations per clock ...

KimB · Jan 25, 2003

Oh, that's right. Shoot. I knew I was forgetting something. Yes, that makes for about 6 billion operations per clock from texture filtering alone, which is what I remember calculating not long ago.

bbot · Jan 25, 2003

Talking about flops, Sony is claiming that Playstation 3 will achieve 1 Teraflops. I wonder how this will compare to what Nvidia has out in 2005, when Playstation 3 will be out?

OpenGL guy · Jan 25, 2003

bbot said:
Talking about flops, Sony is claiming that Playstation 3 will achieve 1 Teraflops. I wonder how this will compare to what Nvidia has out in 2005, when Playstation 3 will be out?

Do you mean the claimed number or actual number?

Panajev2001a · Jan 25, 2003

I'd say quite close to it... still with the programmability and raw speed...

also can you say 1,024 bits DRAM bus ( CPU, Broadband Engine ) running at > 1-2 GHz speed ?

say cheeese

P.S.: before anyone comments on that remember the DRAM uses a cross-bar switch mechanism ( was it 8 memory controllers administering 8 memory banks each... ? )

Panajev2001a · Jan 25, 2003

They are just counting how much raw FLOP power is used up in the rasterization pipeline. This number corresponds to how much CPU power you'd have to have if you wanted to render the scene in software.

Democoder, are you saying that even without the Visualizer ( the GPU chip ), the Cell chip should run close to 4x the speed of NV30 running everything purely in software ?

hehe sorry I had to say it

Crusher · Jan 25, 2003

Isn't the terraflop calculation taking into account the additive processing power of the maximum 16 Cell processors in a single system?

Edit: nope, just found a PC magazine article where they say IBM claims it's for one chip.

DemoCoder · Jan 25, 2003

I'll believe Cell when I see it. 1 Teraflop on your desktop by 2005? That's two years away. They haven't got much time.

Panajev2001a · Jan 25, 2003

there are 16 pages of argumentation Democoder... ( 1st page contains the link to the patent and from page 8 forward the discussion really picks up

)

and again, it seems that they are targeting 4GHz with a 65 nm (SOI) manufacturing process... 2005 is not THAT near and the specs have been pretty much finalized at this point ( projected specs )...

that is one chip...

Panajev2001a · Jan 25, 2003

heh I am re-reading the thread again...

DemoCoder · Jan 25, 2003

I've read the papers, I still don't think they'll pull it off by 2005. It's not that they won't be able to build it by 2005, but I doubt they'll be able to build it cost effectively. Just look at the issues we're having with .13um process, and you expect in 2 years, a process that isn't mature yet, they will produce the most complicated chip in history with no hitches along the way?

NV30 pixel shader, very interesting...

Luminescent

Arun

Unknown.

K.I.L.E.R

Retarded moron

Tahir2

megadrive0088

g__day

DemoCoder

KimB

KimB

arjan de lumens

KimB

bbot

OpenGL guy

Panajev2001a

Panajev2001a

Crusher

Aptitudinal Constituent

DemoCoder

Panajev2001a

Panajev2001a

DemoCoder

Similar threads