Future console CPUs: will they go back to OoOE, and other questions.

ERP · Oct 4, 2006

Duplicate Post

3dilettante · Oct 4, 2006

DiGuru said:
Ok, fair enough.

But don't you think that a chip with, say, 16 x86 cores is, well, overkill for most tasks, and pretty inefficient for the things that might want the speed? As you said yourself, using two cores to run a current task gives generally barely 40% improvement.

It looks like the sweet spot for monolithic multicore chips right now is around four. After that, Intel and the other manufacturers have been talking about putting in smaller ones.

So, we need to come up with a better way to run programs. Throw von Neuman in the garbage bin, and Amdahl with him, so to say. And build something that does work in this changed world.

There are alternatives to the Von Neumann architecture, but a lot of them have never had much of a record when it comes to implementation.
The current way processing is done has a lot of advantages when it comes to how well silicon can be made to support it.
Von Neumann's architecture is also much more understandable and maintainable for designers and programmers than a lot of the parallel models.

Amdahl isn't going away. It's not a question of instructions or machine architecture. It's the nature of all problems that there will be some portion of work that cannot be run simultaneously with everything else.

pascal · Oct 4, 2006

ERP said:
I've said this before and I'll say it again.
I don't think anyone is aguing that large scale parallelism isn't the future. The interesting question isn't what will the processor look like, it's for a system with say 100 cores.
What will be communication between the cores look like?
What will the memory system look like?

By 2010 the 32nm process will be available http://www.eet.com/news/semi/showAr...d=VVYS20XLZI3CIQSNDLSCKHA?articleID=193100380
This means the designs can be reduced 8 times.

People could have 24 Xenos core at 9GHz but it will generate to much heat and bandwith will be a problem. Lets prioritize more vertical performance increase (more monothread performance) than horizontal increase (more multithreads).

This can better balance the software reducing the risks of diminished returns of eventual serial code as expressed by Amdahl´s Law.

Then lets have only 8 (eight) OoOE RISC with SIMD cores in SMP configuration.
This could means:
- 8 x 9GHz OoOE RISC with strong SIMD cores in SMP (groups of 4 cores ?)
- some L2 caches (not much, maybe 4MBytes)
- 144 Giga issues/sec
- 72 Giga dot products/sec
- 576 GigaFlops/sec (or double if you add a second vector unit to each core)
- under 30 watts

Die size:
- 8 cores 35mm2 (4.4mm2 each core)
- 4mb l2 cache 20mm2
- glue logic 25mm2
- total size 80mm2

Now how to sustain half a teraflop of performance?

And please without much heat.

My idea is lots of Embedded DRAM - EDRAM,

If you add some 64MBytes and this means more 60mm2, but you could have a half a terabyte of memory bandwith.

The edram management could be explicit managed (with programmers full control) or automatic managed like a Virtual Memory System management, or a mix of both.

Maybe with large pages (4kbytes) and the WSClock this edram could work.

pascal · Oct 4, 2006

And if you add 128MB it means 120mm2 of EDRAM and 200mm2 of total die size.

But the edram part could be designed with some extra memory banks which could be mapped after testing phase and then increasing chip yields and reducing overall cost.

By 2015 (16nm process) we could have 512MBytes of EDRAM and 2 TERAFLOPS

Bobbler · Oct 4, 2006

That seems a bit conservative, pascal.

I'd think they'd still be aiming for a 175mm^2 +/- 25mm^2 die size. (unless they go the wii route in the future)

I wonder about a 4mb per core cache. Why not a larger shared cache? Maybe ~16mb of shared L2 if you're talking about 8 cores and possibly even some L3.

It seems >8 cores is likely too. And non symmetrical at that.

Why do you consider bandwidth such a big problem? I'd say latency will continue to be the biggest issue. In the PC space it doesn't seem to be a huge issue at this point, and in the console market there seems to be an abundance of CPU ram bandwidth -- never heard any dev complain about ram bandwidth on the cpu side, only latency.

pascal · Oct 4, 2006

200mm2 is still in the 175mm2 range. And maybe with after test mapped memory banks the yields could improve making a big chip viable.

The latency with edram will be significant lower than external DRAM, and the bandwith will also be much higher.

pascal · Oct 4, 2006

NEC edram designs work with 4.7ns at 130nm.
Just imagine how fast it could be at 32nm, this means one to two order of magnetude in latency reduction compared to 70ns to 45ns PCs are experiencing.
http://www.necel.com/process/en/edramaccess.html

Bobbler · Oct 4, 2006

pascal said:
200mm2 is still in the 175mm2 range. And maybe with after test mapped memory banks the yields could improve making a big chip viable.

The latency with edram will be significant lower than external DRAM, and the bandwith will also be much higher.

I was writing that based on you're first post (talking about an ~80mm^2 die). 128mb of edram or sram would be kind of nifty. Also are you taking into account how much more compact memory generally is mm^2 wise? (I can't tell and I don't want to try to think about it!)

You could likely fit almost all your game data in there... at least for a game made now days -- who knows in 5+ years. Interesting idea none-the-less -- edram for GPUs seemed like sort of an obvious choice, but I never thought about it for CPUs.

pascal · Oct 4, 2006

Bobbler said:
I was writing that based on you're first post (talking about an ~80mm^2 die). 128mb of edram or sram would be kind of nifty. Also are you taking into account how much more compact memory generally is mm^2 wise? (I can't tell and I don't want to try to think about it!)

Heat problems.
80mm2 cores+cache+gluelogic
120mm2 EDRAM
200mm2 total die size

Bobbler said:
You could likely fit almost all your game data in there... at least for a game made now days -- who knows in 5+ years. Interesting idea none-the-less -- edram for GPUs seemed like sort of an obvious choice, but I never thought about it for CPUs.

This is why I am talking about how it could be managed (VMS or manually).

Also my guess 1T-SRAM has even lower latency.

And GPUs never really used percentually so much die space for EDRAM.

SPM · Oct 4, 2006

DemoCoder said:
The problem with OS models is that scheduling becomes harder and harder as the number of cores goes up. How will you efficiently schedule work and maintain your mutex'ed data on 80 cores? With something like software transactional memory, you might pay a performance penalty of 20-50% depending on implementation (hardware support could knock this way down), but your actual performance will, even under high contention, scale very well on a high number of cores, and what's more, your code will be incredibly simple with practically zero deadlocks, race conditions, or over-contended locks.

I think the Windows model and the SMP PC/server model for scheduling is a is a very poor one for a very large number of cores, and parrallel applications. SMP servers with a large number of nodes aren't really used to run parallel applications, they merely run many independent processes at the same time.

For really massively parrallel processing look at the supercomputer programming model where you have control nodes which control the processing and farm out tasks to compute nodes which do the processing and return the results to the control nodes. The nodes are self contained independent execution engines each with their own cpu core and local memory. Gene blue for example uses 130,000 such self contained "CPU cores" with "localstore RAM" connected together by communication links. That is proof that the method scales, and delivers performance, even though nobody is claiming programming is easy.

The supercomputing programming model/architecture is also eerily similar to the Cell and the AMD/Intel massively multicore designs.

aaaaa00 · Oct 5, 2006

SPM said:
I think the Windows model and the SMP PC/server model for scheduling is a is a very poor one for a very large number of cores, and parrallel applications. SMP servers with a large number of nodes aren't really used to run parallel applications, they merely run many independent processes at the same time.

For really massively parrallel processing look at the supercomputer programming model where you have control nodes which control the processing and farm out tasks to compute nodes which do the processing and return the results to the control nodes. The nodes are self contained independent execution engines each with their own cpu core and local memory. Gene blue for example uses 130,000 such self contained "CPU cores" with "localstore RAM" connected together by communication links. That is proof that the method scales, and delivers performance, even though nobody is claiming programming is easy.

What do you think a supercomputer programming model runs if not many independent processes? These big shared-none clusters only work well with parallel tasks that do not intercommunicate often since their communication links are so much slower than their local RAM.

On the contrary SMP is relatively good at the kinds of tasks that requires close synchronization between many parallel tasks, because of it's fully shared coherent memory. The bread and butter of an SMP machine is single process multiple threads.

The tradeoff for having so much sharing is scaling an SMP to thousands of nodes is impractical. But for a console, you're not going to see thousands of nodes any time soon, and most of the problems solved in a game engine are not of the hugely parallel shared-none type of tasks that run well on a huge supercomputer cluster (at least not today anyway).

psurge · Oct 5, 2006

Here's an alternative to traditional super-scalar type architectures that appears to adapt quite well to different types of workloads. Single threaded performance is comparable to Alpha 21264, multi-threaded throughput comparable to Niagara, and perf/mm^2 is dramatically better.

http://wavescalar.cs.washington.edu/wavecache.shtml

This paper looks at various configurations of the architecture, from 16 to 512 PEs.
http://wavescalar.cs.washington.edu/WaveScalarISCA06.pdf

Since it's a dataflow architecture, it can devote all 512 of it's PEs to a single thread (assuming that such parallelism exists in the code). Check out the single threaded (SpecINT) performance - scaling with PE count is less than optimal and nicely illustrates some of 3dilettantes points vis a vis Amdahl.

On another note: does any hardware currently support transactional memory? Is implementing it going to be a huge cost versus what we have today?

SPM · Oct 5, 2006

aaaaa00 said:
What do you think a supercomputer programming model runs if not many independent processes? These big shared-none clusters only work well with parallel tasks that do not intercommunicate often since their communication links are so much slower than their local RAM.

On the contrary SMP is relatively good at the kinds of tasks that requires close synchronization between many parallel tasks, because of it's fully shared coherent memory. The bread and butter of an SMP machine is single process multiple threads.

The tradeoff for having so much sharing is scaling an SMP to thousands of nodes is impractical. But for a console, you're not going to see thousands of nodes any time soon, and most of the problems solved in a game engine are not of the hugely parallel shared-none type of tasks that run well on a huge supercomputer cluster (at least not today anyway).

The way tasks are scheduled and the way memory locality is organised is very different.

The OS does preemptive scheduling between SMP cores is used on Desktop PCs and file/database servers, and memory is shared between cores.

In supercomputers OS managed sheduling for pre-emptive multi-tasking isn't used to allocate tasks to compute nodes, instead this is done explicitly by a process on the control node which then waits for the results to be returned. With Linux/Unix (which are the only OSes used for supercomputing applications), it is possible to use OS managed sheduling for pre-emptive multi-tasking over a network cluster (OpenMosix) and this works efficiently. However OpenMosix is used only for desktop clusters. Also unlike the PC OS model, the individual compute node CPUs themselves don't time share lots of processes on the same CPU. Instead they run one one assigned task to completion, return the results and then run the next one. The program running on the control node manages the scheduling of tasks to compute nodes, not the compute node OS. The reason is no doubt that OS managed scheduling of tasks makes their completion time unpredictable and indeterminate (since individual tasks can be interrupted by any other task that happens to be running on the same CPU at the same time), and therefore slow and inefficient if one parrallel task has to complete or it's results collated before starting the next.

As for the SMP shared memory model, supercomputers don't use them - at least not as part of the memory access model used for parrallel processing. Instead they use a Cell type local store model (but with a much bigger "local store" of say 256MB to 1GB per compute node). There are no SMP supercomputers, although to increase density, reduce power consumption, and reduce heat dissapation (in order to save on floor space, electricity supply and airconditioning operating costs) , the compute nodes may use commercially available SMP machines locally.

For SMP, the performance sweet spot is 4 CPUs. For NUMA, it is something like 64 CPUs. Above about 256 CPUs, you really need to go for the Cell type PPE as controller plus SPE as compute node architecture. The massive 80 core+ chips Intel and AMD are planning will have to follow the Cell model since NUMA on with a single chip multi-core design is not practical due to each core requiring it's own separate memory bus.

Frank · Oct 5, 2006

Btw, for games, does it really matter if all the tasks finish?

Say, you are calculating the new states of your objects, and you have your data spatially partitioned and indexed. And when it is time to start making draw calls to render the next screen, you first send the new states of all the objects that finished processing, and then the old states of the objects that didn't. They have to update the global object states from their local storage when finished, so there is no problem there.

Even better: you don't even have to wait for objects to finish processing altogether. You simply have a different, unsynchronized managing thread that continuously adds new object update jobs to cores when there is processing available. And you only time the screen updates and I/O.

In that way, you essentially make your game cycle asynchonous, and eliminate the worst problems and stalls.

3dilettante · Oct 5, 2006

DiGuru said:
Btw, for games, does it really matter if all the tasks finish?

Say, you are calculating the new states of your objects, and you have your data spatially partitioned and indexed. And when it is time to start making draw calls to render the next screen, you first send the new states of all the objects that finished processing, and then the old states of the objects that didn't. They have to update the global object states from their local storage when finished, so there is no problem there.

Even better: you don't even have to wait for objects to finish processing altogether. You simply have a different, unsynchronized managing thread that continuously adds new object update jobs to cores when there is processing available. And you only time the screen updates and I/O.

In that way, you essentially make your game cycle asynchonous, and eliminate the worst problems and stalls.

It really depends on how important correctness is for the simulation or for a given situation.
The more relaxed the constraints are, the better performance can be.

For some simulations, the difference is minor or can be corrected with successive approximations. The engine needs to be designed to handle it, otherwise accumulated errors will ruin the output.

I'm not sure this can be used too heavily in a game. It could be worse than bad lag, it would be inconsistent lag. If the AI is asynchronous to rendering or the simulation, it could just as easily be behind or ahead of where it should be. It might not know things it must, or know things it shouldn't.

With physics, objects might collide when they shouldn't, or pass through each other when they should hit.

Since IO can influence the turnout of the simulation, a delay in processing objects can cause their old states to be applied in an invalid way to the new state.

You can just sit there and hope it won't happen, but it eventually will at 60 fps.

Synchronization isn't there to slow things down, it's there to make sure things make sense. The trick is knowing when things can make sense even when they aren't in lockstep.

Frank · Oct 5, 2006

Yes. But then again, the simulation isn't perfect even when you sync all of it. Simply because you always have to pick a moment to freeze everything and do a render.

And that generally means: doing everything in steps of the supposed render speed. But, when you take too long to update all your objects, you lag. And when you take too short, you get out of sync as well. And most games that use a single loop don't have correction mechanisms either.

In real life, things just happen. And you only care when you interact. That's the bottom line. And when you calculate states independently (asynchronously), you have less "AI artifacts", so to say.

You just need a different model, in which all your objects roam free until interaction.

And colissions and IO should be no problem (again: as long as you partition and index everything spatially). They have to happen in either case, and if you have good locality, you need to do much less calculations.

psurge · Oct 5, 2006

DiGuru - I really like the "aesthetics" of your distribute work according to spatial locality idea. In a game environment though, won't interaction almost always be in the vicinity of the player? How would you handle cases where there are lots of interactions within a very tiny space (say a building exploding into thoasands of pieces)?

Also, so I can get a handle on your asynchronous state updates - how would you handle a case where calculating a particular entities state consistently takes longer to compute than the local simulation timestep? Are you suggesting to arbitrarily drop some interactions when necessary to keep entity clocks semi-synchronized?

3dilettante · Oct 5, 2006

DiGuru said:
Yes. But then again, the simulation isn't perfect even when you sync all of it. Simply because you always have to pick a moment to freeze everything and do a render.

No simulation is perfect because time has to advance in steps anyway. That's the only way machines can simulate anything.
If there is no synchronization, the more easily simulated portions will pull ahead of the more difficult parts.

And that generally means: doing everything in steps of the supposed render speed. But, when you take too long to update all your objects, you lag. And when you take too short, you get out of sync as well. And most games that use a single loop don't have correction mechanisms either.

Things don't need to be done at the render speed, they just have to be consistent. DOOM 3's internal simulation runs at a fixed number of tics per second, the renderer goes up and down by tens of frames per second.

What is important is that that second 1 for object A is shown in the same second as second 1 for object Z.
Object A must behave properly if in second 3 it collides with object C. If they are allowed to go off on their own good time, object A will check for collisions in second 3, but object C hasn't finished processing second 2. The collision detection will fail.

If the events and behaviors aren't consistent, the simulation ceases to make sense.

In real life, things just happen. And you only care when you interact. That's the bottom line. And when you calculate states independently (asynchronously), you have less "AI artifacts", so to say.

Asynchronous does not mean independent, it means there is no single step or beat that keeps everything in lockstep. Independent means things in the simulation have nothing and will have nothing to do with one another, or the user.

If objects are not given a sense of time within the simulation they will not behave as they would in reality.

An object that takes 50 instructions to simulate will be twice as far ahead as an object that takes 100 instructions if the simpler object is not told to wait.
It would make perfect sense to the object to go as fast as it is going, it doesn't know the difference. If there isn't something keeping track of both objects, nothing is going to know there was a problem.

The observer (player) would know something is wrong if this isn't carefully dealt with.
An object can work ahead so long as it keeps a history of its states and the rest of the simulation accesses the right part of the history.

That of course has its own problems, since every time step an object stores means its memory footprint is greater, and the further ahead something works, the more likely it is going to be wrong.

If temporal order is violated only rarely and not in an important area, it can be ignored. If it happens too much, or it happens somewhere important, the game is broken.

I won't care if one fish in a school of herring in the background jumps ten inches. I will care if an enemy sapper manages to squeeze off 30 rounds a second from a bolt action carbine.

You just need a different model, in which all your objects roam free until interaction.

I need you to define exactly how that would be implemented. How do objects roam? How do they know if they need to interact? What tells them? A simulation doesn't just know something, it needs to get data from somewhere, and it needs to know it's valid.

And colissions and IO should be no problem (again: as long as you partition and index everything spatially). They have to happen in either case, and if you have good locality, you need to do much less calculations.

Can you elaborate on the model being used to maintain consistency? If a simulation environment changes, it must be applied consistently or things will not make sense.

Things won't know unless there is something they can look to for coordination or they are kept from going out of step.

DemoCoder · Oct 5, 2006

The key assumption that needs to be broken is that actors have neccessarily correct and perfect information about the state of the world. In the real world, insects, animals, human beings, neither have an accurate and complete view of their surroundings, nor do they neccessarily have a correct one (they can have inacurrate data)

One of the key reasons we are even able to function is that we are able to ignore vast quantities of information and focus only what is salient to the task at hand. We even ignore information immediately in our visual field. In writing this, I have ignored the state of some food cooking on the stove. I do not know its current state. I have a rough idea where it is, and using mental models, I can estimate whether I think it might be done or not, but that's it. I have also ignored flashy icons on my desktop indicating IMs.

Most of the lock-step AI parallelization arguments come about from the flawed assumption that for AI to work, a correct and consistent database of the world must be available to the AI algorithms. Real world pathfinding and tracking, for example, does not depend on correct knowledge of where everyone else is and the lay of the landscape. People can pathfind in completely new environments. Sometimes they fail and get lost. That's reality. They use mostly local information, insects, bacteria, white blood cells, pathfind via chemical gradients.

AI and physics can be parallelized as long as you build your world database according to salience principles - spatial locality, incomplete and sometimes inaccurate, and non-local transfer of information in a macroscopic statistical sense. (for my character to visualize a non-local pheneomena, I don't need to know where every particle is in a tornado on the horizon, or an explosion, just macroscopic variables like local wind speed, temperature, pressure, etc)

3dilettante · Oct 5, 2006

DemoCoder said:
The key assumption that needs to be broken is that actors have neccessarily correct and perfect information about the state of the world. In the real world, insects, animals, human beings, neither have an accurate and complete view of their surroundings, nor do they neccessarily have a correct one (they can have inacurrate data)

In a simulated system, all information is by default completely visible and completely correct in so far as the simulation is capable of representing it.
It's up to the simulation and the elements within it to define what it means for data to be perceived innacurately. Having data that's just wrong doesn't lend itself to anything.

One of the key reasons we are even able to function is that we are able to ignore vast quantities of information and focus only what is salient to the task at hand. We even ignore information immediately in our visual field. In writing this, I have ignored the state of some food cooking on the stove. I do not know its current state. I have a rough idea where it is, and using mental models, I can estimate whether I think it might be done or not, but that's it. I have also ignored flashy icons on my desktop indicating IMs.

You are not aware of the state of the food, but the state still exists, and by virtue of whatever this thing called reality, one second to you is approximately one second to the food that may or may not be cooking too long.
If you wait for an hour, the food will be burned, regardless of your awareness. Unless you're on crank, and then it just seems like you were waiting for an hour.

In a simulated system, if there is no implicit or explicit timekeeping, the entity that represents you could go through dozens of cycles, while for some reason the food object has not. If nothing keeps track of that, you could wait an hour and come to a ice-cold pot.

Most of the lock-step AI parallelization arguments come about from the flawed assumption that for AI to work, a correct and consistent database of the world must be available to the AI algorithms. Real world pathfinding and tracking, for example, does not depend on correct knowledge of where everyone else is and the lay of the landscape. People can pathfind in completely new environments. Sometimes they fail and get lost. That's reality. They use mostly local information, insects, bacteria, white blood cells, pathfind via chemical gradients.

My concerns are not with just AI, but simulation integrity. A pathfinding algorithm can work fine with incomplete data. No algorithm works fine if the very bases of time and space can flit around due to factors not present in the simulated world.

We haven't invented an AI that understands its world when it's stable, much less question the axioms of its own existence when it's not.

Without AI, there are still things that must make sense even if they make no decisions.
Collision detection doesn't work if one of two objects that should have hit each other is three update cycles behind.

Future console CPUs: will they go back to OoOE, and other questions.

ERP

3dilettante

pascal

pascal

Bobbler

Shazbot!

pascal

pascal

Bobbler

Shazbot!

pascal

SPM

aaaaa00

psurge

SPM

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

psurge

3dilettante

DemoCoder

3dilettante

Similar threads