Jawed
Legend
30 SIMDs running at 1.5GHz is still pretty fast, though you could argue it's no faster than ~3GHz 4-core SSE with perfect SIMD utilisation (scalar code on the GPU would be 4x faster). The GPU has more bandwidth to play with.I can't see how a G200 can be fast with only 1/32 of it's RAW performance...
Flat or nested? What cycle counts for the alternate paths? What kinds of scatter/gather operations is the code doing?Random.
Looking at the code and from what I remember, the vector CS is faster than the scalar CS on ATI. The PS is scalar, only, I believe.For a trivial mandelbrot it's simple, I will get some data when time permits... Unfortunally I will to work tomorrow...
Just to remeber, PS was faster then scalar CS right?
All the GPUs have real predicates. ATI has a stack of predicates. NVidia has predicate registers. I don't understand the distinction you're trying to make.It's a bool
Current hardware have no support for a real predicate register to be used in a per ALU base, in hardware also allows the implementation of the trick you said doesn't increase performance.
Sorry, you're right.It may avoid a performance hit of up to 75%, so may increase the performance up to 4 times.
It'd be good to work out how Fermi does DP. If it really has 32 SP units and 32 integer units per core that are entirely distinct (sharing register file ports and not overlapped) then something along those lines sounds feasible.So, they did a double precision multiplication using 1 int multiplier and 2 fp multipliers, maybe they already did what I was describing :smile:
It would be funny if Fermi and Larrabee do SP, DP and INT the same way.
http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdfAMD probably did a long multiplier using each of the multipliers in thin ALUs to do part of it, this is the most simple method and requires few extra transistors, but a 2x precision multiplier could be done with just 3 1x precision multipliers, the two methods that mades sense for me are Karatsuba and the long multiplier replacing the lowest multiply by a good guess, both requires some extra hardware, but cheaper than an full multiplier.
The optimal multiple-precision unit described here is 3.7x the size of the single-precision unit. It's 18% bigger than the optimal double-precision only unit. The double-precision only unit is 3.1x the size of the single-precision unit.
Hmm, I guess the 16->4 granularity I was describing is like the 256->64 granularity you're describing.So people already talked about some fo what I described here :smile:
Jawed