This is something pretty off topic I posted in the ATI GPU Transcoding thread over in the algorithm forum. It ended up being mainly a critique of the current state of GPGPU, versus host processing and the need for a new paradigm. I think it might be more relevant here:
Based on all these quality questions and uncertainty about what is actually being done on the GPU I'm starting to think GPGPU video encoding may just be a non-starter at this point.
I was thinking last night that I think the really optimal system architecture needs something more that just a strong multicore OoOE CPU and a fast GPU with tons of shaders. I mean, this video compression on the GPU idea has been around since the first AVIVO converter for Radeon X1800 cards. But it's still little more than an ugly hack that sacrifices quality to a great extent.
What you can do on DX10 class shader hardware just seems to be too limited to replicate the quality possible on a CPU. I'm not a programmer, so I couldn't tell you if it's the programming model, memory constraints or what. We all know the more specialized execution hardware is, the less flexible it is. The more flexible it is, the less performance potential.
I'm starting to think there needs to be an intermediary level of hardware standing between a good general purpose and a high performance GPU. What I'm thinking of is something like a chip integrating a gang of Cell-style SPEs connected directly to both the CPU and GPU by a very high speed interconnect. Since both AMD and Intel are operating without a FSB anymore this is actually more possible in the PC space than ever before. I must admit, though, I was at first thinking of the console space where custom interconnects are the norm.
I actually got to thinking about this because of things like the Physx PPU and Toshiba's SPURs Engine. We even sometime get threads here asking questions about a game being "totally hardware accelerated" or the introduction of an "AI Accelerator". What I'm thinking about would actually make those questions moot while gaining significant performance for certain tasks which could be a lot faster given the right kind of vector unit, but which still don't work too terribly well on the GPU. Besides, why would you want to take performance away from the graphics anyway?
When I look at Larrabee I wonder if it will really deliver performance competitive with nVidia or ATI's ALU monsters. But it should be gangbusters for things like video processing and encoding or physics simulation or sound processing.
And who was it that was gonna market a SPURs Engine PCI-E add in card with video processing software. I look at that and wonder why someone doesn't write a HAVOK or Physx driver for it. I wonder why it doesn't have digital audio outputs and sound drivers that do positional audio and encode Dolby Digital and DTS output in real time like sound is handled on the PS3.
I also look at the Physx acceleration on nVidia GPUs in Mirror's Edge for the PC and wonder how some really pretty flapping fabric or plastic will have any effect on fundamentals of game play. I think GPU physics look pretty and that's pretty much as far as it goes.
So here is what I envision as an optimal form of system topology.
Code:
2-4 Core OoOE Processor
with large budget of L2/L3 cache.--------integrated-----------------DDR3 Memory
Effects strong general purpose--------memory controller----------128-256 bit bus
performance and coordination.
l
l
l---------------------QPI/HT/FlexIO class interconnect.
l
l
Vector Multi-Processing Unit, Ring bus/EIB high speed on chip connection
Larrabee or SPURs Engine---------------that also makes it possible for quick CPU-GPU
or other ASIC in that mold communication bypassing Processing Elements.
l
l
l---------------------PCI-E 2.0 or whatever is in vogue. (faster is better)
l
l
High performance AMD or
nVidia DX10/DX11 class------------------Whatever VRAM setup up they want
GPU and video output. 256-512bit GDDR3-GDDR5 and up.
I think you get the idea. An important factor is for the vector chip to have high speed access to both pools of Ram, lacking it's own memory beyond local store/cache. As a note, the VMPU is basically replacing the classic north bridge position. Sitting at the center it makes the most obvious place to hang an I/O chip with a modest HT or PCI-E connection for USB/Firewire/Ethernet/SATA/Audio (though audio should also be possible through the GPU for HDMI).
I think the advantage should be obvious over the silliness of putting PPUs, SPURs Engines and Sound cards on add in cards which limit their ability to combine with the CPU or GPU in the way that is possible in something like the PS3. Obviously that system has deficiencies in PPE and GPU performance, but we've seen cool uses of the SPUs for post processing effects, vertex processing, etc, due to a tighter integration of the GPU and Cell.
As I mentioned before, I'm not a programmer or an electronic engineer, so maybe this is pie in the sky, but I think this kind of architecture offers so many cool possibilities beyond the current paradigm. For example, if the physics simulation is centralized on the VMPU, both the GPU and CPU should be able to access those results very quickly to both enhance visuals and enhance game play. It should offer a superior audio solution and realtime DTS/DD, a feature that should be simple in software even on today's systems but which is still far too rare. Like the oft wished for Cell plus Xenos architecture, if you are shader limited you could move vertex calculations to the VMPU and use all your ALUs for pixel shading. It flexible enough and powerful enough that the possibilities should be enormous. Ray tracing, Distributed computing, etc, all stand to benefit from this kind of tiered performance architecture. And finally, to bring this back on topic, it should allow for high quality transcoding at a speed far greater than possible even on the fastest Quad Cores. Not to mention it would not peg every core at 100% for hours making multitasking nicer.