It looks like the sweet spot for monolithic multicore chips right now is around four. After that, Intel and the other manufacturers have been talking about putting in smaller ones.Ok, fair enough.
But don't you think that a chip with, say, 16 x86 cores is, well, overkill for most tasks, and pretty inefficient for the things that might want the speed? As you said yourself, using two cores to run a current task gives generally barely 40% improvement.
So, we need to come up with a better way to run programs. Throw von Neuman in the garbage bin, and Amdahl with him, so to say. And build something that does work in this changed world.
By 2010 the 32nm process will be available http://www.eet.com/news/semi/showAr...d=VVYS20XLZI3CIQSNDLSCKHA?articleID=193100380I've said this before and I'll say it again.
I don't think anyone is aguing that large scale parallelism isn't the future. The interesting question isn't what will the processor look like, it's for a system with say 100 cores.
What will be communication between the cores look like?
What will the memory system look like?
200mm2 is still in the 175mm2 range. And maybe with after test mapped memory banks the yields could improve making a big chip viable.
The latency with edram will be significant lower than external DRAM, and the bandwith will also be much higher.
Heat problems.I was writing that based on you're first post (talking about an ~80mm^2 die). 128mb of edram or sram would be kind of nifty. Also are you taking into account how much more compact memory generally is mm^2 wise? (I can't tell and I don't want to try to think about it!)
This is why I am talking about how it could be managed (VMS or manually).You could likely fit almost all your game data in there... at least for a game made now days -- who knows in 5+ years. Interesting idea none-the-less -- edram for GPUs seemed like sort of an obvious choice, but I never thought about it for CPUs.
The problem with OS models is that scheduling becomes harder and harder as the number of cores goes up. How will you efficiently schedule work and maintain your mutex'ed data on 80 cores? With something like software transactional memory, you might pay a performance penalty of 20-50% depending on implementation (hardware support could knock this way down), but your actual performance will, even under high contention, scale very well on a high number of cores, and what's more, your code will be incredibly simple with practically zero deadlocks, race conditions, or over-contended locks.
I think the Windows model and the SMP PC/server model for scheduling is a is a very poor one for a very large number of cores, and parrallel applications. SMP servers with a large number of nodes aren't really used to run parallel applications, they merely run many independent processes at the same time.
For really massively parrallel processing look at the supercomputer programming model where you have control nodes which control the processing and farm out tasks to compute nodes which do the processing and return the results to the control nodes. The nodes are self contained independent execution engines each with their own cpu core and local memory. Gene blue for example uses 130,000 such self contained "CPU cores" with "localstore RAM" connected together by communication links. That is proof that the method scales, and delivers performance, even though nobody is claiming programming is easy.
What do you think a supercomputer programming model runs if not many independent processes? These big shared-none clusters only work well with parallel tasks that do not intercommunicate often since their communication links are so much slower than their local RAM.
On the contrary SMP is relatively good at the kinds of tasks that requires close synchronization between many parallel tasks, because of it's fully shared coherent memory. The bread and butter of an SMP machine is single process multiple threads.
The tradeoff for having so much sharing is scaling an SMP to thousands of nodes is impractical. But for a console, you're not going to see thousands of nodes any time soon, and most of the problems solved in a game engine are not of the hugely parallel shared-none type of tasks that run well on a huge supercomputer cluster (at least not today anyway).
Btw, for games, does it really matter if all the tasks finish?
Say, you are calculating the new states of your objects, and you have your data spatially partitioned and indexed. And when it is time to start making draw calls to render the next screen, you first send the new states of all the objects that finished processing, and then the old states of the objects that didn't. They have to update the global object states from their local storage when finished, so there is no problem there.
Even better: you don't even have to wait for objects to finish processing altogether. You simply have a different, unsynchronized managing thread that continuously adds new object update jobs to cores when there is processing available. And you only time the screen updates and I/O.
In that way, you essentially make your game cycle asynchonous, and eliminate the worst problems and stalls.
Yes. But then again, the simulation isn't perfect even when you sync all of it. Simply because you always have to pick a moment to freeze everything and do a render.
Things don't need to be done at the render speed, they just have to be consistent. DOOM 3's internal simulation runs at a fixed number of tics per second, the renderer goes up and down by tens of frames per second.And that generally means: doing everything in steps of the supposed render speed. But, when you take too long to update all your objects, you lag. And when you take too short, you get out of sync as well. And most games that use a single loop don't have correction mechanisms either.
In real life, things just happen. And you only care when you interact. That's the bottom line. And when you calculate states independently (asynchronously), you have less "AI artifacts", so to say.
You just need a different model, in which all your objects roam free until interaction.
And colissions and IO should be no problem (again: as long as you partition and index everything spatially). They have to happen in either case, and if you have good locality, you need to do much less calculations.
In a simulated system, all information is by default completely visible and completely correct in so far as the simulation is capable of representing it.The key assumption that needs to be broken is that actors have neccessarily correct and perfect information about the state of the world. In the real world, insects, animals, human beings, neither have an accurate and complete view of their surroundings, nor do they neccessarily have a correct one (they can have inacurrate data)
You are not aware of the state of the food, but the state still exists, and by virtue of whatever this thing called reality, one second to you is approximately one second to the food that may or may not be cooking too long.One of the key reasons we are even able to function is that we are able to ignore vast quantities of information and focus only what is salient to the task at hand. We even ignore information immediately in our visual field. In writing this, I have ignored the state of some food cooking on the stove. I do not know its current state. I have a rough idea where it is, and using mental models, I can estimate whether I think it might be done or not, but that's it. I have also ignored flashy icons on my desktop indicating IMs.
My concerns are not with just AI, but simulation integrity. A pathfinding algorithm can work fine with incomplete data. No algorithm works fine if the very bases of time and space can flit around due to factors not present in the simulated world.Most of the lock-step AI parallelization arguments come about from the flawed assumption that for AI to work, a correct and consistent database of the world must be available to the AI algorithms. Real world pathfinding and tracking, for example, does not depend on correct knowledge of where everyone else is and the lay of the landscape. People can pathfind in completely new environments. Sometimes they fail and get lost. That's reality. They use mostly local information, insects, bacteria, white blood cells, pathfind via chemical gradients.