View Full Version : VPs integrated onto a x86 CPU
Anyone who can estimate the cost/usefullness/complexity for Intel or AMD to integrate a bunch (16 or more) of P10-like Vertex Processors on the core of the P4/Athlon?
How many gates would be needed? Could the VPs be run at full speed (2 - 3 GHz)? Would AGP be the limiting factor?
I think this could be a cool solution, and a way for especially AMD to differentiate itself from Intel.
Could such an architecture be able to reach 1 billion triangles/s (using 0.13 or 0.09 um process tech)? I do understand that this would choke AGP, so what kind of occlusion culling algorithm could be implemented on the CPU to prevent this?
By the way, how many triangles/s can todays P4 and Athlon handle (using the same method to calculate it as NVidia use)?
To be quite honest, I don't think this would be very feasible. Having a bunch of VUs chomping through a gazillion vertices per second would starve the CPU of bandwidth for other tasks (like running actual game code), that's why such things were off-loaded onto a GPU to begin with. :)
It would also need software support, which is always slow to gather. Especially since 3DNow! and SSE(2) already exists and they'd literally be competing against themselves... AMD is putting its weight behind SSE2 these days, and they even doubled the available registers in their Hammer series of CPUs (from 8 to 16), so I would not expect this to happen soon.
There are far better and more straight forward ways for AMD and Intel to parallelize their architecture to make better use of their transistor count for their CPUs.
The first is to put more than one CPU on a die. This should happen with AMD's Sledgehammer, but not Clawhammer, at least in the beginning.
The second is to use hyper-threading for multi-tasking. This is already happening with Intel's CPUs.
The third is to come up with an arbitrary vector length version of SSE2, call it say SSE3. In this case some new instructions would be added and some old ones updated that would allow the same binary to automatically make use of any vector length (rather than have a hardcoded implied vector length in the instruction as it is now). This would allow future CPUs to increase the SSE register size to any size desired and to take full advantage of the new vector lengths for all previous SSE3 binaries without recompilation. A new CPU could simply say double the vector size and then run SSE3 binaries (roughly) twice as fast.
One final note about multiple CPUs on a die. The time is fast approaching that it will become realistic to put multiple CPUs (most likely on the same die) in standard desktop machines. The CPU manufacturer that does this first is likely to get a good jump over the competition.
Actually Slot 1 and probably also Slot A supported multiple CPU's per cartridge (but only one L2 cache per.).
Slot 1 supported I think two CPU's and Slot 2 supported up to 4...
I do agree that putting two CPUs (fully or through hyperthreading) on a single die is a likely next-step. For Intel, this will probably be realized fully (compared to the more experimental support that already exists in the P4) next year in the form of Prescott. AMD should be able to do something similar when moving Hammer to their 0.09 um process. And they may to be forced by competition from Intel to do it not only in their Opteron series, but also the Athlon (the Hammer-version). The problem here is that few of todays applications make use of multi-threading.
Extensions/improvements to the SSE-units we will likely see. But if I understand things correctly, SSE2 is still very simple (and low performing) compared to the VUs of the PS2. I guess the same is true if you compare SSE2 to the VPs of the 3DLabs P10. And a problem with SSE2 is, because it shares a lot of hardware with the other parts of the CPU, it stalls the application if used extensively.
My initial suggestion was a standalone multimedia processor (MMP) integrated onto the same die as the x86 CPU. The MMP would consist of 16 or more simple floating point processors, a command processor, its own on-die program memory, access to the x86 MMU, and some path the CPU for communication. The MMP would be able to run the program code located in the on-chip program memory without the x86 needing to be involved.
As far as application support goes, I guess this could to a large extent be solved through native support in DirectX. Otherwise it could be supported through the drivers. Of course a native API would be needed.
I can see a big use for this MMP. In 3D it could work on the culling, vertex processing etc. The level of involvement would be dictated by the power of the GPU. It could also be used for ultra-advanced 3D audio algorithms - remember some games use the VU1 of the PS2 for audio.
For professional audio the MMP could be used for realtime effects. Musicians/producers are dying for this level of performance as the CPU never seems to be fast enough. There are PCI DSP cards with 4 Sharc DSPs costing $1000+ for this purpose. The MMP would crush this card.
I can see a lot of other uses for the MMP. 100s of tasks where a limited range of instructions are to be applied to a stream of data. The MMP would probably be an order of a magnitude faster than the x86 with SSE2 on these tasks.
I see the cons, but I still think the pros outweigh them.
You still fail to see the fundamental problem: bandwidth.
A CPU doesn't have the memory subsystem needed to support 16 multi-GHz VUs, especially not while running ordinary program code too.
Also, what's a 'cool thing to have' in a CPU (16 VUs running at core clock), and what's a cost-efficient thing to have in a CPU considering how much real-world impact it's likely to have is a totally different thing. Using a set of CPU-located VUs to cull polys, calculate vertexes etc, is POINTLESS when every GPU released from now til the end of time will be optimized to do the same job, and do it better since it will not consume any CPU resources in the process.
SSE2 is unlikely to cause the whole CPU to stall, no matter how much it's utilized since it'll only require FPU resources. Why would you think that?
On an Athlon XP, SSE instructions use the 'complex instruction' pipe (not sure if that's the official name). The load/store pipe and 'simple instruction' pipe are free to accept new instructions, so there's no stalling going on. FPU instructions also do not cause stalls in the integer pipes. I'd assume Hammer to work the same way or at least similar regarding SSE2 since the basic architecture supposedly is quite similar between the two chip families.
Well, right now the idea of putting 256 bit, full-speed DDR bus on an ordinary CPU (with VUs) seems a bit insane, but when you actually do it, you'll get something like a "super console" - system with a bit lower performance compared to standalone VPU but without AGP bottlenecks. At this point in time we have 20 GB/s for VPU and ~2 GB/s for CPU. Merging them would impede graphics bandwidth just by 10%, so it isn't much compared to profit of eliminating AGP. And ofcourse right now we could achieve something very similar by building P10 card with large onboard memory. We'll simply forget about CPU - let it handle joystick movement or keep Windows "thinking" that it is still needed here - the rest of work could be moved to the card. In fact I think that they are starting in this direction already - after all, 3Dlabs is promising things like hardware acceletated photoshop filters... Maybe even oldskool assembly skills will gain some value again.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.