Historical GPU arithmetic performance

Confirmed, I tried timing a large loop of interleaved, independent ADDPS and MULPS on a 3.2 GHz Prescott and on a 2.0 GHz Northwood and I get 1 instruction / cycle throughput on both.
I stand corrected. Thanks a lot for the conclusive testing!
 
So in what ways do Core2 and Nehalem improve on P4 then in terms of floating point throughput? On the face of it it would seem a high end PentiumD should be faster in floating point than a high end C2D since they each push the same FLOPS / clock and the PD clocks higher. i know thats not the case in the real world but i'm curious as to the difference.
 
So in what ways do Core2 and Nehalem improve on P4 then in terms of floating point throughput? On the face of it it would seem a high end PentiumD should be faster in floating point than a high end C2D since they each push the same FLOPS / clock and the PD clocks higher. i know thats not the case in the real world but i'm curious as to the difference.
Core 2 (and i7) have twice the SIMD execution unit width. Simply put, NetBurst has two multipliers and two adders, which as confirmed can work simultaneously so it can do 4 FLOPS/clock. Core 2 has four multipliers and four adders so it can do 8 FLOPS/clock.
 
Core 2 (and i7) have twice the SIMD execution unit width. Simply put, NetBurst has two multipliers and two adders, which as confirmed can work simultaneously so it can do 4 FLOPS/clock. Core 2 has four multipliers and four adders so it can do 8 FLOPS/clock.

Oh yeah, sorry I was just being dumb, for some reason I thought the C2D and the P4 were doing the same number of ops per clock/per core in the previously posted figures. Obviously they weren't though.

Cheers for the explanation anyway!
 
This might not be entirely on-topic but I found an interesting benchmark from SiSoft Sandra comparing CPU vs. GPGPU performance
Oh, nice. Some observations:
- their 9600GT has 96 shaders...
- Performance of the cpus is basically halved with double vs. single float (as you expect). Performance of GTX260 is 1/6th (theoretical: 1/8th) so makes sense (though as usual I have to wonder about nvidia's dedicated DP unit, what's the point if you get beaten by a cpu anyway). Performance of 9600GT is 1/12th or so, I don't know how the code looks like to do it but seems reasonable. However, performance of the HD4870 virtually stays the same, and the HD3850 also has a very small hit, way below the theoretical hit (should be 1/4th probably). So any ideas why the SP results are so low for these, what's the limitation here? Scalar code?

btw there's also memory bandwidth comparison:
http://www.sisoftware.net/index.html?dir=qa&location=cpu_vs_gpu_mem&langx=en&a= - again the HD4870 fails to live up to its potential (hardly faster than HD3850)
 
Last edited by a moderator:
Didn't they use Brook+ to get the code running on ATI? Thought I read that somewhere.

Brook+ performance can be far short of theoretically obtainable - both due to basic compilation quality and due to the use of the memory hierarchy.

I think CUDA tends to be more controllable, as it's a less abstracted approach to programming the GPU.

Even on pure scalar code an HD4870 should be faster than a 9600GT - 160 ALUs x 750MHz = 120G instructions/s versus 64 ALU lanes x 1625MHz = 104G instructions/s. That's all the sanity check one needs to dismiss the quality of the so-called arithmetic code being executed by this benchmark...

GTX260 should be 3x faster than 9600GT, too. Even in this respect the benchmark is void in some fundamental way.

Jawed
 
Not that I think that any game is going to use doubles anytime soon, but with that said, it's easy to run out of precision on a 32bit float, so it's not like it's useless for games. You can usually work around float precision problems though.
 
Not that I think that any game is going to use doubles anytime soon, but with that said, it's easy to run out of precision on a 32bit float, so it's not like it's useless for games. You can usually work around float precision problems though.

Aye, that's the general gist of what I was implying. A game could possibly benefit from double precision but it isn't needed. And the hit you would take to performance would hardly justify using double precision rather than using a workaround.

Regards,
SB
 
What kind of improvement would we see between SP vs DP in graphics?
In practice, none. Someone who would use doubles in graphics to solve a precision issue simply isn't using the range of a float optimally or uses formula's that are not robust.

Let's not forgot that merely seven years ago the norm was 8 bits of precision per color channel, now it's 32 bit (of which 23 are mantissa bits). Since every extra bit doubles the precision that's a truely massive increase in little time. So double-precision floating-point numbers, which really is a misnomer, offer nearly unimaginable precision: enough to describe the height of mountains in nanometers, measured from the center of the earth! Only some esoteric scientific applications actually require such precision.

Although single-precision is far less accurate than that it's still a massive leap beyond 8 bit and we haven't reached the limit of its capabilities yet. The reasons why the hardware supported 32 bit so quickly is because lower precisions (16 and 24 bit) are not a standard outside graphics and because extending the precision is relatively cheap. For the same reasons some GPUs already support 64 bit. Not for graphics.
 
Let's not forgot that merely seven years ago the norm was 8 bits of precision per color channel, now it's 32 bit (of which 23 are mantissa bits).

Back in those days they only computed color information though. As soon as you want anything more than that you need a good deal more precision. ATI went to FP24 in R300 simply because that's what was required to get enough subpixel precision from an interpolated texture coordinate. In the vertex pipeline it was always FP32.

It's not hard to get precision issues with FP32. You wonder why you're only getting point sampling despite having enabled a linear filter, turns out you ran out of fractional bits in the texture coordinate and there's nothing wrong with the filter. Pixel shaders get all kinds of inputs these days that unless you take special care can run into precision problems. Something as simple as a world space position can break down as soon as you move some distance away from the origin. Models or even individual vertices start shaking randomly (we call it the "jitter bug" over at Avalanche ;)). It can usually be handled by changing what parameters you send and what order you do operations and so on. Any form of time input need to be reset now and then just so that things don't get jerky after you've played the game for an hour or two.

With that said, I don't think FP64 it going to take over anytime soon. But I wouldn't rule out that at some point some people will start using it for everything just for convenience. Much like how FP32 is the norm in pixel shader today, and FP16 is a thing of the past (except on PS3, where it continues to cause trouble).
 
So, improving precision can reduce bugs and virtualize a more accurate environment?
Take clipping, for instance: is it caused by unprecise calculations?

About HDRR: does it affect only lighting or other things also?

Would it be good having FP precised monitors also? (I can't believe we can see just 16M colors at max)
 
I would much rather expect Extended Precision (32-bit mantissa, I-don't-remember-how-many-exponent-bits) to become standard rather than FP64, if we are to ever move away from FP32 (which I'm slightly skeptical about, frankly, unless we're thinking more than 5 years down the road?) - this is because it allows significant synergies with 32-bit INT for the execution units, although data paths still need to be wider obviously.
 
So, improving precision can reduce bugs and virtualize a more accurate environment?
It's not the solution to making real-time graphics more lifelike, if that's what you mean. Graphical artifacts are usally a consequence of taking shortcuts in the calculations to speed things up. But when using robust unapproximated formulas one can create very realistic images with just 32-bit floats.
Take clipping, for instance: is it caused by unprecise calculations?
Depends on the kind of clipping you're talking about. Intersecting objects are really a collision detection / physics problem. For that there also exist robust FP32 computations, but it's relatively expensive to fully eliminate object intersection.

Z-clipping is either caused by using a bad near / far plane ratio, or a low-precision z-buffer (less than 32-bit).
About HDRR: does it affect only lighting or other things also?
Just lighting. It's based on the physical law of exposure.
Would it be good having FP precised monitors also? (I can't believe we can see just 16M colors at max)
Yes, the range of today's consumer monitors is pretty poor. Pure white is typically set to a value comfortable for reading text, but for movies and games that's very low. We need a standard in which 1.0 is paper white but it can also display higher intensities.
 
Back
Top