Software Rendering Will Return!

I was going to news this but still haven't - anyway, PowerVR's SGX turns out to be a 100% MIMD GPU according to their latest developer SDK. No branching granularity requirements; it just works at full performance (although it's still faster to use predication for small branches for a variety of reasons).
That's neat, but on the other hand you can't discount the additional overhead of additional PCs and most seriously load balancing such a design when comparing to more "multi-core SIMD" structures like GPU, or even CPUs(SSE)/Cell/etc. It's just another bullet point on the continuous curve, and it only really proves its utility if it is similarly powerful while being more flexible than GPU-like designs.

So certainly if PowerVR's stuff is like G92-like speeds w/o branch granularity penalties I'll be really interested! But conversely if they can run at "100% efficiency" and achieve 1/8 or 1/16 the throughput of G92, then you haven't bought anything ;) I suspect the reality is somewhere in between which implies that we can't really throw out either of the extremes in favour of the other.
 
I was going to news this but still haven't - anyway, PowerVR's SGX turns out to be a 100% MIMD GPU according to their latest developer SDK. No branching granularity requirements; it just works at full performance (although it's still faster to use predication for small branches for a variety of reasons). If you're really smart and you're not forced to use a retarded ISA like x86, it *is* possible to create a good MIMD GPU (as PowerVR has proven, much to my surprise). And the implication then is that many-core CPUs become completely redundant.

You wouldn't happen to have a direct url to this latest dev SDK? I would enjoy the read.

As for the issues of MIMD vs SIMD,

MIMD works as long as the MIMD cores don't have to goto global memory. The second MIMD cores need to access main memory, you need either caching or something which coalesces reads and writes into the large (banked?) chunks needed for good memory throughput. In either case, even though you have MIMD your data structures and algorithms need to be SIMD like in order to achive good performance (unless they arn't memory bound, which as parallelism increases, nearly everything becomes memory bound right). In the case of caching you need to have great data locality and use all data in the cache lines, and in the non-cached global memory case you need to be coalesce aware.

So in this way I not sure MIMD is a win, because all you gain is better branching granularity but still constrained by memory access patterns.

However, if your MIMD elements can work mostly in a very small and fast local store of memory or work accessing hardware queues between other scaler cores, then global memory access can be less of an issue. Personally I find this exciting, however at this point, my guess is the lure of MIMD for most programmers goes away when they realize that a critical part of the algorithm is designing memory access patterns and the "software floorplan" (ie distribution and location of stages of the algorithm on each core).
 
Nehalem is a ~265mm² chip on 45nm.
K10 is 283 mm² and is sold for as little as 119 € retail (19% VAT inc.). In a year and a half a 45 nm K10.5 is going to cost less, clock higher, and consume less. Intel's only logical response will be to make Yorkfield cheaper (if that doesn't already happen much sooner). Nehalem will be expensive, but only for the first year.

And let's not forget the tri-core AMD's and quad-core mobile Intel's. The race for more CPU cores is absolutely unstoppable and competition keeps the prices dropping at a steady rate.
Even after a straight shrink to 32nm it should be as big as Conroe was on 65nm, and with higher wafer costs (although it should clock much higher)...
It has integrated memory controllers, so while the chip might be relatively more expensive the chipset gets cheaper. Without IGP you can toss out the northbridge altogether.
It could be in mainstream PCs with less cache in late 2010, but that depends a lot on your definition of mainstream...
2010, 2011, 2012... I don't care much. It's going to happen.
Compared to nothing. This is what they should do, not what they have done.
Do what? You haven't specified what Larrabee lacks to be a "Game Processing Unit". You have to be comparing it to something. A completely MIMD architecture?
You won't be able to run quite a few parts of AI on a stream processor. Larrabee is better than a stream processor there, but not by that much; most of the processing grunt is still in the SIMD unit.
As far as I know A.I. only becomes processing intensive if you have crowds of hundreds or thousands of entities. But that's still way less than the number of pixels on your screen. And with SMT each core can be doing graphics on the SIMD units and A.I. on the scalar units. You'll also always have a CPU to do the less parallel workloads. I have a really hard time believing game A.I. is ever going to be a bottleneck anyway. The only times I've ever seen it being an issue is when they use an interpreted scripting language. :oops:

Also, as has been discussed before in relation to physics, these things don't really scale with gameplay. Some games can add some interactive fluids, some intelligent butterflies, etc. but it really stops at some point. Look at sound for instance. First we had discrete sound cards, then integrated sound, and nowadays sound codecs that have no influence on benchmarks whatsoever.
I was going to news this but still haven't - anyway, PowerVR's SGX turns out to be a 100% MIMD GPU according to their latest developer SDK. No branching granularity requirements; it just works at full performance (although it's still faster to use predication for small branches for a variety of reasons).
It has to be at least 4-way SIMD to compute texture coordinates per quad for mipmapping, right?
If you're really smart and you're not forced to use a retarded ISA like x86...
It has its quirks but it's not retarded. Feel free to convince me though by suggesting an ISA that is going to be Intel's doom in no time.
...it *is* possible to create a good MIMD GPU (as PowerVR has proven, much to my surprise). And the implication then is that many-core CPUs become completely redundant.
They won't go back to single-core. Ever. The MHz race is over but Intel is well on its way to convince us that you're a loser if you don't have a quad-core yet. You're also mistaking the average computer user to be a gamer. We need multi-core CPUs just to keep up with the needs of applications. Developers are getting used to the idea of multi-threading and they're not suddenly going to move their concurrent general-purpose code to another processing unit. Just to make sure you get the picture: Every system sold today is at least dual-core. It's safe to invest in multi-threading techology. A minority of system equipped with a "Game Processing Unit" and a single-core CPU beyond 2010 is not going to make anyone happy. Game developers rely on baseline performances. Taking steps back is never going to happen. Several soon to be released games even claim to make good use of quad-core and up...
 
Even Celerons are dual-core today.

That only happened just recently, in late January/early February.

If newegg is to be believed, there is only one dual-core Celeron out right now. To my knowledge, this has been the case since January/February.
 
That only happened just recently, in late January/early February.

If newegg is to be believed, there is only one dual-core Celeron out right now. To my knowledge, this has been the case since January/February.
Your point being?
 
You wouldn't happen to have a direct url to this latest dev SDK? I would enjoy the read.
http://www.imgtec.com/powervr/insider/powervr-login.asp?SDK=WindowsXPOGLES2 - you have to register first though.

MIMD works as long as the MIMD cores don't have to goto global memory. The second MIMD cores need to access main memory, you need either caching or something which coalesces reads and writes into the large (banked?) chunks needed for good memory throughput.
Yeah, that's perfectly correct; in the end, you just can't get away from coherence requirements at one level or another.

So in this way I not sure MIMD is a win, because all you gain is better branching granularity but still constrained by memory access patterns.
Much more importantly though, a massively parallel MIMD core can run any program a CPU could run and it is the best possible kind of architecture for branching; excluding one-time overhead, branching is effectively free as it can be hidden by other threads.

And being able to handle anything a CPU could handle isn't just useful for games. It also makes CPUs largely useless for things like video encoding... Y'know, the last major justification to buy a better CPU in all mainstream desktop CPU reviews?

However, if your MIMD elements can work mostly in a very small and fast local store of memory or work accessing hardware queues between other scaler cores, then global memory access can be less of an issue.
Yes, then you're changing the programming model a fair bit but sometimes you've gotta do what you've gotta do. Finding the best approach in terms of software isn't an easy task though obviously.

Personally I find this exciting, however at this point, my guess is the lure of MIMD for most programmers goes away when they realize that a critical part of the algorithm is designing memory access patterns and the "software floorplan" (ie distribution and location of stages of the algorithm on each core).
Yeah, that kind of thing is what I was refering to when I said SGX+Ambric would be awesome. But as you said, most mainstream programmers likely wouldn't share that point of view - which is why figuring out smarter architectures is still necessary.
 
Sigh, lost a long reply due to a browser crash. Either way:
  • Pricing for Barcelona isn't a very representative data point in my book given AMD's likely margins at such a price. And don't forget the strong euro...
  • AMD has said again and again that they don't see the point of more than tri/quad-core for desktops even in the next several years. Intel isn't that aggressive either; AFAIK, we'll still see dual-cores being very common in the 32nm generation.
  • Wrt IMC: point granted, that does make the prices harder to compare.
  • Wrt timeframes: Any year you don't see quad-cores being mainstream is a year that other things could happen. Time isn't on anybody's side in this industry.
  • Yes, full MIMDness is what I was refering to. Ideally, you want to be able to transcode x86 instructions (or vice-versa) too.
  • AI and sound can both be very expensive depending on the level of complexity. Let's put it this way: an UT3 bot is much more expensive than a Medieval: Total War AI unit, and there's no reason to believe it couldn't get even more expensive in the future. As for sound, you can do quite expensive things with materials and ray/wavetracing... (and no, I'm not convinced it's overkill either)
  • Doing both AI and graphics on the same Larrabee core at the same time would reduce the level of latency that can be hidden for the SIMD-based GPU workloads.
  • Regarding physics, I think many of the apparent limits are more related to workflows/toolsets/implementations than designer creativity.
  • Yes, 4-way SIMD is required for texturing, so the threads are likely grouped 4-by-4 and synced (with no performance penalty except maybe not being able to hide as much latency for a few cycles) around texture instructions.
  • x86 is far from the worst ISA imaginable, but standard ISAs in general really aren't good, especially when you have specific workloads or architectures in mind that could benefit from specific tricks.
  • Going back to single-core isn't the point. The point is, are we going to go farther than tri/quad-core, ever, in the mainstream market? And are we going to ever need more single-thread performance than a 3GHz Conroe? Certainly we'll need more aggregate performance for some things, but those can come from not-really-CPUs cores.
  • Intel might be trying to convince everyone you need to buy a quad-core, but they certainly haven't convinced me (I recently bought a E8400 and not due to budget limitations).
  • And NVIDIA is certainly working pretty hard to convince OEMs and consumers that you're a loser if you actually buy a quad-core, unless you need to do plenty of video encoding. And the amount of money attributed to a GPU compared to CPUs is increasing at most OEMs.
  • $300-375 PCs at all major OEMs in the USA are single-core, not dual-core, and the ratio of single-core PCs is obviously higher outside of the USA/Western Europe. But yes, dual-core is becoming very mainstream indeed.
  • As I said, being able to run that 'game processing unit' code efficiently on a x86 CPU is necessary for commerical success; ideally, you want to be able to take x86 code and run that on it too, but that's much more problematic already. Middleware is also important.
 
And are we going to ever need more single-thread performance than a 3GHz Conroe?

Yes. We need as much single thread performances we can get or much better ways to master concurrency.

As I said, being able to run that 'game processing unit' code efficiently on a x86 CPU is necessary for commerical success; ideally, you want to be able to take x86 code and run that on it too, but that's much more problematic already. Middleware is also important.

Don’t walk another long time ISA dependency route. A more neutral byte code would be a better solution for this job.
 
It has to be at least 4-way SIMD to compute texture coordinates per quad for mipmapping, right?
I don't think so. Just consider how you would solve it on a standard CPU running multiple threads.
 
Pricing for Barcelona isn't a very representative data point in my book given AMD's likely margins at such a price. And don't forget the strong euro...
Point taken. However, it does prove that quad-cores are not necessarily overly expensive. Once AMD gets its act together again and some games and other applications start using quad-core, Phenoms will be pretty popular and Intel has to drop its prices as well. On a mature 45 nm, I'm sure they'll still have a good margin. So it's just a matter of time before it becomes really attactive to buy a quad-core instead of a dual-core.
AMD has said again and again that they don't see the point of more than tri/quad-core for desktops even in the next several years.
For several years. Sure. They still have trouble producing quad-cores that are reliable and fast at acceptable margins. So I don't expect them to already make statements about octa-cores. They want to sell Phenom X3 and X4's now and not have people wait for K11.
Intel isn't that aggressive either; AFAIK, we'll still see dual-cores being very common in the 32nm generation.
For handhelds, yes. For everything else it really doesn't make sense to have a 35 mm² chip that sells for 20 €. Quad-cores will be as ubiquitous in the 32 nm generation as dual-cores are in the 45 nm generation.
AI and sound can both be very expensive depending on the level of complexity. Let's put it this way: an UT3 bot is much more expensive than a Medieval: Total War AI unit, and there's no reason to believe it couldn't get even more expensive in the future. As for sound, you can do quite expensive things with materials and ray/wavetracing... (and no, I'm not convinced it's overkill either)

Regarding physics, I think many of the apparent limits are more related to workflows/toolsets/implementations than designer creativity.
These kind of tasks scale a little, but not by much. Let me put it this way: if we suddenly had a 10x faster monster graphics card, we could easily increase resolution, anti-aliasing, blur, etc. With a 10x higher computational budget for something like physics you can't instantly put that to use. Indeed you're limited by a lack of software. And while I'm sure game designer creativity is limitless it becomes increasingly more absurd to throw more resources at it. If you keep adding features it will never be finished. Duke Nukem Forever syndrome.

So I'm really convinced that the steady evolution of CPU performance is fine to keep up with any increase in A.I./physics/sound complexity.
Yes, 4-way SIMD is required for texturing, so the threads are likely grouped 4-by-4 and synced (with no performance penalty except maybe not being able to hide as much latency for a few cycles) around texture instructions.
So you're enthousiastic about an SGX's 4-way SIMD but not about software rendering on a multi-core CPU with 4-way SIMD?
x86 is far from the worst ISA imaginable, but standard ISAs in general really aren't good, especially when you have specific workloads or architectures in mind that could benefit from specific tricks.
Sure, it's not the best ISA for a very specific workload. But x86 is quite good for running a very large range of workloads. Developers (like Tim Sweeney) are screaming for processors that don't limit their creativity by making only a few select workloads efficient. Games look too much alike because every time you try something other than what the competition is doing performance decimates.

Things are definitely improving with every generation, but I think Larrabee will be a significant leap forward in allowing developers to try new things. On systems without Larrabee, a multi-core CPU will be a good fallback for the non-graphical workloads.
Going back to single-core isn't the point. The point is, are we going to go farther than tri/quad-core, ever, in the mainstream market? And are we going to ever need more single-thread performance than a 3GHz Conroe?
Why do you mention a 3 Ghz Conroe and not a 2 GHz Conroe? Exactly, because it's never enough. Question answered.
Certainly we'll need more aggregate performance for some things, but those can come from not-really-CPUs cores.
Why not? We're all going to have 100+ GFLOPS CPU's in several years, but other hardware will continue to exists in all sorts and sizes (including being absent). The CPU is a the only processor developers can rely on. So it better get faster as technology advances.

If there's no texture sampling involved, a multi-core CPU almost always beats IGPs at GPGPU tasks...
 
I don't think so. Just consider how you would solve it on a standard CPU running multiple threads.
Sure, SIMD isn't necessary to be able to execute a pixel shader, but it's crucial for good efficiency / area. Four independent scalar pipeline are going to have to execute the same operations for a quad, so why not make that one SIMD pipeline with less control logic?
 
  • AI and sound can both be very expensive depending on the level of complexity. Let's put it this way: an UT3 bot is much more expensive than a Medieval: Total War AI unit, and there's no reason to believe it couldn't get even more expensive in the future. As for sound, you can do quite expensive things with materials and ray/wavetracing... (and no, I'm not convinced it's overkill either)
  • Doing both AI and graphics on the same Larrabee core at the same time would reduce the level of latency that can be hidden for the SIMD-based GPU workloads.
  • Regarding physics, I think many of the apparent limits are more related to workflows/toolsets/implementations than designer creativity.

Expanding on the issues of SIMD and GP stuff like Physics, AI, and Sound :

From my perspective SIMD/GPUs are perfect for those problems, even without fine branch granularity. My guess is resistance towards SIMD/GPGPU stuff in non-graphics areas has to do with developers rather using the GPU for better graphics, lack of necessary GPU API features (which I think DX10 finally provides but isn't useful because consoles don't support it), lack of easy to program GP SIMD languages (devs and compilers often don't work well in terms of SIMD C intrinsics for example), lack of SIMD working as well on X86 because of no texture fetch hardware/latency hiding (gather is too expensive), and lack of experience solving problems in ways which work on SIMD/GPUs. It simply requires a different way of thinking and solving problems.

Audio is an obvious case which is easily SIMDable, just GPU mixing requires more latency because of GPU->CPU->audio device data transfers. Like image processing algorithms, audio algorithms would mostly be TEX bound (well cached, but limited on the latency of filtering units), and would greatly benefit from something similar to the CUDA programming model (shared local store, for reuse) or the vertex stream out approach (where you can take one point as input and output many points, allows reuse of TEX reads to output multiple INLINE samples at one time, something which you don't get with MRTs).

Physics is also very easily SIMDable, with some very good examples out there on CUDA and the SPUs. GPU/SIMD is near ideal for many algorithms. One problem here is that traditional CPU approaches are geared towards limiting collision tests, and thus usually employ sorting. GPU/SIMD it is better to remove sorting completely at the expense of doing more collisions. Again different methods to solve the same problem, GPU/SIMD much faster than CPU for same chip area and process I'd wager.

As for AI, most research is serial in nature right? I'd bet all games, given the nature of scripting development are serial as well. Even if I showed a method in which you could SIMD AI calculations, the problem of the development pipeline comes up, it is all new incompatible ideas and foreign to 99% of the devs out there. You cannot simply toss out everything and start from scratch with unproven technology.

My feeling is that SIMD/GPU/GPGPU stuff is simply currently stuck in uncharted waters because of practical/market issues rather than the lack of solutions for the hardware.
 
So I'm really convinced that the steady evolution of CPU performance is fine to keep up with any increase in A.I./physics/sound complexity.
I think you are underestimating the number of ways some non-graphics tasks such as AI and sound can explode in complexity, once you get past relatively small examples.
It's often a matter of the number of components in an AI agent's state machine or representation and the number and complexity of the relationships between them.
For increasingly complex decision making in more dynamic environments, the AI component could take up more than what most games use now.
The AIs we see now that you don't see taking up much CPU time don't take up much CPU time because developers often avoid the implementations that do.

I could very well create an AI state machine that could expand the computational complexity for a very minor increase in the size or relationship between the state elements.
I'm not certain how much GPUs can do for this, but it depends on what stage of the AI's calculations we're looking at.


Sure, it's not the best ISA for a very specific workload. But x86 is quite good for running a very large range of workloads. Developers (like Tim Sweeney) are screaming for processors that don't limit their creativity by making only a few select workloads efficient.
It's not clear to me that the advent of more specialized hardware somehow constrained the creativity of developers, the last decade of 3d graphics advancements might back me up on this.
I'd blame market factors and the rising cost of development versus the risk of not making back the investment for throttling far more creativity than the fact that GPUs don't run the file system.
The problem is that there's no such thing as an architecture that can avoid making only select workloads efficient without making all workloads less efficient.

Games look too much alike because every time you try something other than what the competition is doing performance decimates.
I disagree with the broadness of this claim. Performance doesn't falter just because you code a procedure that isn't written the way a competitor would. If it falters, the big drops are often due to algorithmic or physical reasons, which are universal.

If a game looks a lot like another, one shouldn't also count out the very likely case that non-technical reasons ($$$$) made it so.


If there's no texture sampling involved, a multi-core CPU almost always beats IGPs at GPGPU tasks...
There was an example for financial modeling where the more specialized hardware on the GPUs allowed for some impressive gains. In part, it was due to the special function hardware capable of performing certain mathematical functions that general hardware would have taken dozens to hundreds of cycles to match.
The rate of uptake on even more complex special instructions that don't run through the microcode engine is extremely slow on CPUs because they can't guarantee it will ever be used.
 
Sure, SIMD isn't necessary to be able to execute a pixel shader, but it's crucial for good efficiency / area. Four independent scalar pipeline are going to have to execute the same operations for a quad, so why not make that one SIMD pipeline with less control logic?
Sometimes but not always. Once there are "if-then-elses" or, worse, "loops" in the code, MIMD will be far preferable. IMHO

Please note that I'm not implying anything about the features of SGX by these statements.
 
I think you are underestimating the number of ways some non-graphics tasks such as AI and sound can explode in complexity, once you get past relatively small examples.
I'm well aware of algorithmic complexity. But the question is, do we really need it? Or, better, do we really need it now? In my opinion it's better to just modestly increase complexity with newer games, instead of attemping to make a sudden jump using hardware only a fraction of people will ever have. Multi-core is making CPUs capable of exciting new things for a very wide audience.

Games that get criticised for bad A.I. won't be saved by just throwing more cycles at it. And I have yet to see a game where you can dial the A.I. down or up. The gameplay has to be the same for everyone. So even if specialized hardware for it existed you can't just enable it for the minority of people who have it.
The AIs we see now that you don't see taking up much CPU time don't take up much CPU time because developers often avoid the implementations that do.
Exactly. It's called optimization. If I can't see the difference between a simple heuristic and a million node neural network then I'm not going to buy hardware that accelerates the latter.
There was an example for financial modeling where the more specialized hardware on the GPUs allowed for some impressive gains. In part, it was due to the special function hardware capable of performing certain mathematical functions that general hardware would have taken dozens to hundreds of cycles to match.

The rate of uptake on even more complex special instructions that don't run through the microcode engine is extremely slow on CPUs because they can't guarantee it will ever be used.
Which CPU against which IGP? And how much effort did they put into writing SSE optimized transcendental functions and multi-threading?

It's true that some some transcendental instructions are missing, but I don't think that not having any guarantees of it being used is the issue. What's really missing is a specification. Intel added a 12-bit single-cycle division to SSE, and AMD's implementation is 14-bit accurate. This lead to some software running nicely on an AMD and acting funky on an Intel. They could try matching DirectX 10 requirements, but is that enough for the foreseeable future? What about double precision? I don't think it's a technical matter of implementation or transistor budget. They would have to mobilize an IEEE group to create an industry-wide standard.

Anyway, a parallel table lookup instruction would already work wonders. But then they might as well take the effort to implement a gather instruction to optimize texture sampling as well... :D
 
Sometimes but not always. Once there are "if-then-elses" or, worse, "loops" in the code, MIMD will be far preferable. IMHO
You still need four pixels taking the exact same code path to have texture coordinate gradients for mipmapping.

The only exception would be non-dependent texture lookups, in which case you interpolate the LOD. But having two approaches would make things unnecessarily complicated.

With 4-way SIMD you just have a granularity of four elements, which is excellent.
 
I'm well aware of algorithmic complexity. But the question is, do we really need it? Or, better, do we really need it now? In my opinion it's better to just modestly increase complexity with newer games, instead of attemping to make a sudden jump using hardware only a fraction of people will ever have. Multi-core is making CPUs capable of exciting new things for a very wide audience.

It's not necessarily a sudden jump. Many big titles that have or might benefit from significant AI usage already presuppose a certain level of graphics capability.
If we're going to project that massively multicore CPUs are to become mainstream, it is far more likely to see 1 IGP or GPU core become standard before 8 or more additional CPUs get pasted in.

Games that get criticised for bad A.I. won't be saved by just throwing more cycles at it. And I have yet to see a game where you can dial the A.I. down or up. The gameplay has to be the same for everyone. So even if specialized hardware for it existed you can't just enable it for the minority of people who have it.
Good AI doesn't come without a cost in cycles. If it did, they'd use it.
The hardware doesn't necessarily need to be all that specialized, as certain phases of AI calculation can be fitted to hardware already present.

Exactly. It's called optimization. If I can't see the difference between a simple heuristic and a million node neural network then I'm not going to buy hardware that accelerates the latter.
It's called prioritization. In a number of games it's extremely obvious that the AI is weak. The constant litany of complaints about stupid AI and the faint praise for games that don't have AI that does something astoundingly stupid is evidence enough that the problem is noticeable.

If good AI weren't intensive, more robust methods would be used.
The problem is that AI's computational demands are not easily predicted and can take up more CPU and memory resources than a dev is willing to devote. Whether they are willing does not mean the lack thereof is not noticeable.

Which CPU against which IGP? And how much effort did they put into writing SSE optimized transcendental functions and multi-threading?
Sorry, I misread your sentence to say GPU as opposed to IGP.
I can't find the example, though I'm pretty sure it was mentioned somewhere in the forum. I can't seem to find it now, but the GPU's strength in transcendentals and other functions was mentioned for simulations.
 
You still need four pixels taking the exact same code path to have texture coordinate gradients for mipmapping.
Yes but they don't have to all be in lock step in order to achieve this.
With 4-way SIMD you just have a granularity of four elements, which is excellent.
The problem is that once you have branching the efficiency plummets. I have designed a SIMD system and the number of idle units when branching/looping/subroutine calls occurred, worried me.
 
It's not necessarily a sudden jump. Many big titles that have or might benefit from significant AI usage already presuppose a certain level of graphics capability.
I play Crysis on my laptop's Quadro FX 350M (G72). It isn't great but it's perfectly playable. Now, even if that GPU was suited for A.I., the dual-core CPU would still be a much better fit. Let low-end GPUs do what they're best at (read: graphics) and use the versatile power of the CPU for everything else or you're going to lose much of your audience.

In the 32 nm era when quad-core makes it into mainstream there will still be a majority of systems equipped with low-end graphics cards. But the CPU has four powerful cores at your disposal...

I might as well rewrite your phrase and replace "graphics" with "CPU".
If we're going to project that massively multicore CPUs are to become mainstream, it is far more likely to see 1 IGP or GPU core become standard before 8 or more additional CPUs get pasted in.
Exellent. But that core isn't going to be signfiicantly faster at anything other than graphics. It will have SIMD units very much like the other cores, and texture samplers. And unless texture samplers can vastly accelerate A.I. you'd better let the other cores handle it.

Also, heterogenous cores again make it more difficult for developers. What's Intel going to do and what is AMD going to do? Will all CPUs have an IGP or just the mobile or low-end ones? They can't even deliver reliable graphics drivers for X3*00 so how attractive do you think it's going to be for developers to use that core for GPGPU? What they can however rely on is that quad-core will be ubiquitous, octa-core will make its entry, and in the case someone still has a dual-core the application will still be compatible. Just like noone created a game that can only run with a PhysX P1 (except AGEIA), noone's going to create a game that will only run with a specific heterogenous architecture.
Good AI doesn't come without a cost in cycles. If it did, they'd use it.
Yes it costs cycles, but it's nonsense that this is why some games have poor A.I. Do you honestly think that extra cycles alone is going to fix that? It has a lot more to do with programmer skills, budget, time, and creativity than just raw cycles.
 
Yes but they don't have to all be in lock step in order to achieve this.
That's what I said. But you do have to use SIMD to get rid of redundant control logic and optimize performance / area. The only reason not use at least 4-way SIMD would be if you don't aim the chip primarily at graphics.
The problem is that once you have branching the efficiency plummets. I have designed a SIMD system and the number of idle units when branching/looping/subroutine calls occurred, worried me.
Yes, with 4-way SIMD in the worst case you're computing four quads for just four pixels. But again, each of these pixels need gradients for mipmapping. So while this is unfortunate (for this extreme case) I can't think of any better approach. Also, unless you truely use random data there is going to be some coherence between pixels and efficiency goes up rapidly for 4-way SIMD. You could also combine it with 2x2 supersampling and force all fragments to take the same branch (you get 4x AA and per-pixel branch granularity).

If you know how to improve the worst case in an elegant way I'd love to hear about it! :)
 
Back
Top