SPE vs. GPU Vertex shader

How does a SPE compare with the vertex engines used in today's GPUs?

What's the FLOPS rating of vertex engines from NVIDIA and ATI?

How many vertex operations can SPEs do?
 
Deano Calver said:
I'm close to screaming (partly because I can't say alot about next-gen GPUs) but hopefully this will explain enough to show GPUs are good at maths as well.

GPUs (not surprisingly) are good at what they do, why do it on a CPU that isn't designed for the job? On there home turf (doing maths on lots of seperate bits of data) they are very good.

GPUs are seriously SIMD. They have a number of units (quads are an old name for them but that distinction will go away) each working on lots of data.

So lets compare an SPU SIMD to an imagninary near future GPU SIMD unit (this is meant as a order of magnitude thought experiment, so read nothing into the numbers).

1 instruction FMAC in a SPU will operate on 4 floats per cycle
1 instruction FMAC in a GPU unit will operate on 48 floats per cycle

It we take 4GHz for SPU and 500Mhz for a GPU, then a SPU can preform 8 times as many instructions.
So in 1 GPU cycle
8 FMAC instructions in a SPU will operate on 32 floats
1 FMAC instruction in a GPU unit will operate on 48 floats

So even using this crude back of the envelope calculation show that a GPU isn't exactly outclassed...

Just in case there any doubt, GPUs will ship with multiple 'units' as illustrated here just as Cell ships with multiple SPUs...

And we are not even thinking about all the 'free' data conversion, low latency memory reads and lerps that are the part of the fixed hardware...
 
Deano,

When you say 48FMACs per cycle, how is that partitioned across the vertex and pixel units? ie. How many FMACs come from the vertex shaders and how many from the pixel shaders?

What exactly is a GPU 'unit' by the way? A unified shader unit? ;)
 
JF_Aidan_Pryde said:
How does a SPE compare with the vertex engines used in today's GPUs?
From a hw point of view SPEs and current vertex shader engines are quite different.
SPEs are very high frequency clocked single threaded in order processors, with lots of registers and a 'big' and very fast local memory.
Vertex shader engines are 'low' frequency clocked multi threaded processors with a small amount of registers (I believe much less than what we see from shader abstractions) and a small amount of local memory.

What's the FLOPS rating of vertex engines from NVIDIA and ATI?
This is a wild territory ;)
I believe most vertex shaders implementation are capable of 8/10 ops per clock, but that is not directly comparable with SPEs ops per clock cause some vertex shader hw implementation performs a lot of sub operations for free, like swizzling, negating, masking, and so on.
Dunno what SPE can do 'for free' (a SPE has a secondary pipeline devoted to stuff like that), the base number is 8 ops per clock.
How many vertex operations can SPEs do?
what doest that mean?

SPEs are way more flexible than current vertex shader engines.
A SPE can create or destroy vertices, assemble primitives, and so on..
Vertex shaders are still 'limited' to work on a single vertex at time, but I believe are way more efficient than SPEs in reaching peak utilitazation rate, even if we still don't know if developers that write SPE code currently have some quality tool/compiler.
In the next future I believe vertex shader will get very fast in things like texture sampling while a SPE will have a hard time to efficiently perform the same task.
There's not a clear winner imho, even cause we are comparing a architecture designed to be fast at vertex shading with an architecture designed to be efficient at a quite wide range of applications.
 
If a GPU vertex engine operates on 48 floats per instruction, then that number was surely reached by a lot of creative nvmaths. The chip doesn't even have the load/store bandwidth to sustain ONE such vertex engine, much less the 5-6 of them that are present in top-of-the-line GPUs. Much less any other chip operations either I might add (pixel, zixel, texture, shader programs etc).

Anyway, in what way is this debate even remotely meaningful? Even next-gen titles on next-gen consoles won't be anywhere near vertex-limited simply because there won't be room in console memory to store such incredibly detailed meshes to max out vertex processing performance. Counting low, the expected peak processing might be 3 billion vertices/sec for a GPU (500MHz x 6 engines), and certainly a comparable number from a 8-SPU CELL. Anyone thinks software will actually come anywhere near this silly high figure? :LOL:
 
Using higher ordered surfaces would alleviate the memory problem. PSP hinted at this direction.

I started this thread because my gut feeling tells me that SPEs are too general purpose to compete effectively against vertex engines, especially given how much other work they also have to do. If 4 SPEs are used for physics, AI etc then that leaves 4 left for vertex processing. I'm not sure that's very competative.
 
Guden Oden said:
If a GPU vertex engine operates on 48 floats per instruction, then that number was surely reached by a lot of creative nvmaths. The chip doesn't even have the load/store bandwidth to sustain ONE such vertex engine, much less the 5-6 of them that are present in top-of-the-line GPUs.

Can you elaborate on this? A modern GPU with over 20GB/s of bandwidth can't sustain one vertex engine?
 
Guden Oden said:
If a GPU vertex engine operates on 48 floats per instruction, then that number was surely reached by a lot of creative nvmaths. The chip doesn't even have the load/store bandwidth to sustain ONE such vertex engine, much less the 5-6 of them that are present in top-of-the-line GPUs. Much less any other chip operations either I might add (pixel, zixel, texture, shader programs etc).
Most of the bandwith needed to sustain vertex processing is INTERNAL bandwith.
GPU loads/stores per thread live registers from a on chip memory, loads/stores vectors from registers file and so on.
The only external bandwith you need is due to vertex fetch, everything else is internal (except texture sampling from a vertex shader)
 
48 32-bit floats in and out, that's three kilobits per cycle (or in other terms, three times the memory capacity of the ZX81 home computer :D). What memory interface delivers that kind of performance? :p
 
JF_Aidan_Pryde said:
If 4 SPEs are used for physics, AI etc then that leaves 4 left for vertex processing. I'm not sure that's very competative.

Does it have to be?

How much raw vertex performance do we actually need before you start hitting diminishing returns?

It wouldn't matter if Cell couldn't match the raw benchmark performance of next-gen GPU with some or all of its shaders working on vertices, IMO. As long as it's "enough" so that there isn't a visible difference (and there wouldn't be if you're talking about 1bn polys versus 2bn polys and the like ;)).

Low-poly square "Doom 3" heads aside, Carmack was on the right track with his "i don't need more polys, i need better polys" argument. Pixel shading power will be far more important next-gen, imo.
 
_phil_ said:
With virtual displacement mapping I don't think polys will be much of an issue

In no way virtual displacement could be a satisfacting replacement for REAL displacement.

I think you missed the bigger picture of that post.

It was by PC-Engine.

And it wasn't anti-Sony.

In fact it could be interpretted as a positive post.

Now that's something! :)
 
nAo said:
Vertex shader engines are 'low' frequency clocked multi threaded processors with a small amount of registers (I believe much less than what we see from shader abstractions) and a small amount of local memory.
Surely they have got to have ~16 4-float vectors for Shader Model 1 and, something like 32(?) for models 2 and 3.
 
Simon F said:
Surely they have got to have ~16 4-float vectors for Shader Model 1 and, something like 32(?) for models 2 and 3.
I'm thinking that a vertex/pixel shader processor could theorically work with the max number of registers that can be addressed executing one instruction at time, even if it would be more efficient to have more real registers than that.
 
nAo said:
Simon F said:
Surely they have got to have ~16 4-float vectors for Shader Model 1 and, something like 32(?) for models 2 and 3.
I'm thinking that a vertex/pixel shader processor could theorically work with the max number of registers that can be addressed executing one instruction at time, even if it would be more efficient to have more real registers than that.
I'm not quite sure I understand what you are saying.
 
Simon F said:
I'm not quite sure I understand what you are saying.
If a processor(s) switch thread each clock cycle it could prefetch data from live registers temporary store into real registers. You woldn't need more real registers than the maximun number of registers you can address within a single instruction.
Obviously I'm not saying that current hw works that way, but it seems it could, at least from a theoretical point of view.
 
_phil_ said:
In no way virtual displacement could be a satisfacting replacement for REAL displacement.
I don't see why it couldn't be. Considering the memory and performance savings using virtual displacement I sure hope it sees extensive use. It's also very effective in most situations (have you seen it in action? I'm sure you'll agree!).

A combination with denser poly meshes than now and virtual displacement would seem to me to be the way to go, without going all-out with modelling every bump and scratch in a surface with polys.
 
(have you seen it in action? ).

yes ,that's why i said that.Also it doesn't work for edge/border like any 'faking geometry' method.
Still this isn't an un-useful feat ,and it has some use ,but won't totaly make up for aditionnal geometry trickery.
 
Back
Top