Future console CPUs: will they go back to OoOE, and other questions.

Contention will happens in any system configuration. The problem is the Level of contention.
If you have enough memory bandwith and low latency then an small SMP configuration willl experience good performance.

Lets say you have a external memory much slower than the CPUs inside the chip, and that caches are not that big, then there will be some contention.

But now think about a big enough, low latency, high bandwith EDRAM inside the chip.

In fact this EDRAM could be divided into some banks individually addressable with some fast 512 bits ring bus inside the chip (I thing I saw it somewhere :) ) connecting 8 RISC processors, edram 8 x 16MBytes memory banks and external memory interface (with 128bits 800MHz).

Contention will happens but much smother.

And this is SMP because you share a single real memory space.
 
Contention will happens in any system configuration. The problem is the Level of contention.

of course it is. and of course i expect you to know that ; )

point being, highway-traffic-level contention hitting you full-strength at the moment you step out of your doorway (i.e. L2 cache) makes simple tasks like getting your newspaper from beneath your doorsteps rather complicated ; )
 
Last edited by a moderator:
point being, highway-traffic-level contention hitting you full-strength at the moment you step out of your doorway (i.e. L2 cache) makes simple tasks like getting your newspaper from beneath your doorsteps rather complicated ; )
My point is you can go higher than 4 cores with SMP before any serious contention. Just dont be hold by old paradigms.

I agree that for todays supercomputing levels with hundreds or thousands of nodes SMP is good only for the individual nodes architectures. But I dont see it (hundreds or thousands of cores) happening in the desktop or console environment in the next two generations.

My guess for console heat and price will impose a limit on what can be reasonablly done.
 
Last edited by a moderator:
You still haven't defined what such synchronization is. To say that my Pentium CPU can't run a general purpose simulation of itself that runs faster than itself is patently obvious, the existence of such a simulation could lead to infinite computation power (keep running the sim recursively). However, there is a difference between asserting this, and asserting that X "takes no time" in the universe.
Synchronization, the act of establishing relative order or simultaneity. If five billion events occur at the same time, one after the other, or the events occur in a mathematically complex situation, the flow of time does not slow down to indicate additional processing is going on to figure out how they are placed relative to each other.

For a system running a simulation, establishing that order exists in the context of the simulation takes additional time. It does not take additional time for a given reality.

This trivializes the very real problem of information loss in physics, namely -- Do Black Holes Erase Information? (convert pure states into mixed states and violate unitary evolution) This is an intense debate in physics today. You simply do not know whether the universe has exactly the amount of storage it needs to store an ever increasing amount of entropy, and maybe black holes are the natural garbage collection mechanism. And as I said, on the other side is the holographic principle which implies it has too much storage for what it does.
Black holes are an imperfect and non-permanent garbage collector, if Hawking radiation is substantiated.
I don't even care if no causal connection can be made between the information that went into the black hole and what came out, just whether the quantity is larger. By measure of mass and energy, it will be exactly the same. I do not know whether the minimum representation will be larger or not.

I only care about the sum total of the symbols needed to represent the state of the universe.

I don't believe that's what I claimed. I claimed only that the time needed to compute step n, call it T(N) can grow smaller. This is true for the majority of the inputs. In fact, for some inputs, it doesn't even need to keep growing space.
Define exactly what you mean by that. T(N) for HashLife can grow smaller versus T(N) for a standard version, or T(N) for HashLife versus T(N/2) for HashLife?

The way I read one of your previous posts, it sounded like you were saying it got faster and faster the larger the problem size got. That would be wrong.

However, we have been talking of game simulations, and I'm not sure I need my game simulation to exhibit universality or correctly deal with it without loss.
Unverality not so much, that any loss is managed or controlled is more important.

I disagree with your definition of a random outcome, which sounds suspiciously like bogus definitions of free will in philosophical arguments. It can't even be converted into a formal mathematical logic.
No, it's an argument for the desired kind of randomness for a simulation.
That if something is based on random chance, nothing in the state of the system in time t will indicate how it will turn out in t+1.

Since the examples of randomness you showed are non-computable, I do not see how well they could be applied to a simulation that must transition from t to t+1 in finite time via some form of calculation.

That is of course assuming such completely random and totally independent phenomena exist. I didn't say it was proven that they do, only that if they do we lack the means to accurately make it so.

You can't rule out, for example, that a given outcome is simply the result of an unknown cellular automaton and unknown seed (which could in fact, be the sum total of all information in the simulation). For any "random outcome" by your definition, I can simply propose an underlying mechanism responsible coupled to inaccessable state.
The problem is that if the unknown seed per outcome is not totally random, then some pattern will emerge. Seed values are used by psuedorandom generators, so there is a way to figure out how the input is mapped to output.

If we're in the business of simulating the world, that might indicate another metaphysical restriction:
Nothing random in a given reality is random to an external reality.

Randomness has nothing to do with determinism. It has to do with whether the sequence of output can be compressed smaller than the sequence itself.
I admit I'm not up on algorithmic definitions of randomness.
But if a random number generator outputs a string of ten binary zeroes, it's compressible.
But how does that make the generator non-random? The probability that any given combination of outputs is the same, even if it can be run through a compressor to produce a smaller representative string.

If randomness is not just a backward-looking measurement of compressibility and there exist ways to produce it, then why is it that no random number generators exist?
The applications for it exist: cryptology, simulation, etc.

When this occurs, we throw up our hands and say "I can't seem to find a mechanism to compute the series of outcomes other than just to list them".
In which case, it may mean that nothing in any of our simulations will ever meet our criterion for randomness, since we have the entire state and can see how the outcomes come about. An execution trace would show how every outcome was computed.

Try reading Hans Moravec's _Mind Children_ in the chapter "Newway and the Cellticks" for a gedanken experiment, or more recently, the discovery of a possible mechanism in the cosmic microwave background for about 10k bits to have been encoded by a creator, or otherwise, as input from the external system.
Couldn't the external system have just wiped any such signature away as a matter of basic function?
 
Contention will happens in any system configuration. The problem is the Level of contention.
If you have enough memory bandwith and low latency then an small SMP configuration willl experience good performance.

Lets say you have a external memory much slower than the CPUs inside the chip, and that caches are not that big, then there will be some contention.

But now think about a big enough, low latency, high bandwith EDRAM inside the chip.

In fact this EDRAM could be divided into some banks individually addressable with some fast 512 bits ring bus inside the chip (I thing I saw it somewhere :) ) connecting 8 RISC processors, edram 8 x 16MBytes memory banks and external memory interface (with 128bits 800MHz).

Contention will happens but much smother.

And this is SMP because you share a single real memory space.
Are you talking about local storage, or huge level 1 caches? In the first case it isn't SMP, while in the second you might be better off using all that memory as your main memory, and treating all external memory as "slow storage". Like a harddisk, but faster. Because if you don't, you will still create a huge traffic jam when all that local memory is synchronized with each other through main memory.
 
Are you talking about local storage, or huge level 1 caches? In the first case it isn't SMP, while in the second you might be better off using all that memory as your main memory, and treating all external memory as "slow storage". Like a harddisk, but faster. Because if you don't, you will still create a huge traffic jam when all that local memory is synchronized with each other through main memory.
I wrote this before in this same thread about the possibility of have large edram inside the chip acting as the main memory and using the Virtual Memory style of management.

This idea is the same, just adding what people do in large mainframes and dividing it (the main memory) in large banks individually addressable/acessable (more bandwith, less wait, less contention) but with a continous real address space gluing them making it (the group of banks) the main memory.
Virtual memory management and TLB update could be done be hardware (like in the old days) just to make sure it will be low latency (IIRC nowdays when there is a TLB miss the OS do the TLB update). Maybe implement WSClock in hardware. Eventually the programmer could request (to the OS) a number of pages or maybe a bank to act as local memory.

Also add a crossbar switching or ring bus or whatever you have to the memory bus.
For small number of cores ( < 8 cores) my guess this will work great.


Use cheap commoditie external memory.
Imagine not page anymore in disk :)
 
Last edited by a moderator:
the stark simple reason why you don't get SMP in supercomputing is that at the levels of parallelism found in supercomputers, SMP would go in the corner and die sobbing.

the more the cores, the more contention kills your SMP utopia, as simple as that. and OS scheduling and time-sharing have nothing to do here - you're already dead by the moment you get your gazilion threads up and running each on its private SMP core, even if those were nicely synchroized at boot up and with no subsequent scheduling interference whatsoever.

I absolutely agree, but you are talking about SMP meaning "SMP hardware", which symmetrical cores sharing common RAM memory. There is also "SMP OS scheduling" meaning automatic OS task scheduling where tasks are assigned to cores in a symmetrical way - ie evenly distributed between all cores to symmetrically balance load between cores. There is a Linux kernel extension (OpenMosix) which allows this to be done to any networked cluster of computing nodes with local memory. In effect this makes any supercomputing cluster or any cluster of desktops work like a big SMP machine which will run any Linux program without modification. The way it works is when the kernel scheduler is handed a process to schedule, it migrates and re-migrates the process to another networked computer if necessary to balance load. On Linux and Unix this is possible in a completely transparent way, because of the Linux/Unix concept of all device drivers being accessed as files, IPC is mainly done through pipes, and the X Window system which is completely a client server architecture. This means processes can run completely independently of hardware locality and device access, IPC, and where it displays because they can all be streamed over the network. The only restriction is that processes/threads that share memory space need to be migrated to the same node. Linux is also king of network and clustered filesystems.

Because the network is relatively slow, this is of most benefit for compute intensive processes - like the applications that run on supercomputers, and those that are relatively long lived. The ideal type of application would be something like ray tracing. The fact is that the easy Windows/Linux type pre-emptive multi-tasking OS scheduler where you simply start a process and it is just allocated to a core concept is available for supercomputers, but nobody in the superconputer industry uses it, choosing to use the more difficult (programming wise) manually assigned scheduling instead. I am pretty sure the reason is the difficulty of making randomly distributed parrallel tasks finish at the same time due to unpredictable delays caused by other tasks, and having to wait for other processes as a result - the speed of the task being the controlled by the speed of the slowest thread.

http://openmosix.sourceforge.net/
 
Synchronization, the act of establishing relative order or simultaneity. If five billion events occur at the same time, one after the other, or the events occur in a mathematically complex situation, the flow of time does not slow down to indicate additional processing is going on to figure out how they are placed relative to each other.
For a system running a simulation, establishing that order exists in the context of the simulation takes additional time. It does not take additional time for a given reality.

I realize this is kinda OT, and I guess I'm a little bit late to the party, but what about time dilation? If you have matter moving relatively fast (relatively as in Relativity Theory), i.e. changing state quickly, local time slows down.
If you've lots of matter clumped together (creating a large gravity well), meaning lots of stuff to synchronize with each other, local time also slows down.

Since time dilation works asymptotically, the universe actually has a very bad scaling algorithm, just with really small starting values!

*waits for DC or Chalnoth to smack me because I've understood nothing*
 
I realize this is kinda OT, and I guess I'm a little bit late to the party, but what about time dilation? If you have matter moving relatively fast (relatively as in Relativity Theory), i.e. changing state quickly, local time slows down.
Local time appears to slow down to an observer outside of the matter's frame of reference. The matter itself sees no change, and--depending on the exact circumstances of the experiment--if it could look back at the observer, it would think that the observer was slowing down.

Which one is slowing down, exactly?

edit: In addition, just moving fast doesn't really mean the local state of the system is changing all that much.

If you've lots of matter clumped together (creating a large gravity well), meaning lots of stuff to synchronize with each other, local time also slows down.

There's a gravitational trend towards time appearing to move slowly to an outside observer, but it is based on the mass of the system, and it does not directly correspond to the mathematical complexity of the system.

An aggregate of 50 atoms is astoundingly complex to model, but if it were compared to the time frame of a single lonely atom, there would be little if any difference.

If both the aggregate and the atom were accellerated in the same way, their time dilation would be approximately the same, not counting gravitational effects that are not proportionate to the complexity of the system.

A singularity is astoundingly massive, but it is from our point of view on the outside, incredibly simple. It's an ideal point of mass, for all we know, yet it may bring time to a stop relative to the outside universe.
 
Last edited by a moderator:
Local time appears to slow down to an observer outside of the matter's frame of reference. The matter itself sees no change, and--depending on the exact circumstances of the experiment--if it could look back at the observer, it would think that the observer was slowing down.

Which one is slowing down, exactly?

Of course, an observer moving at that speed would not notice himself slowing down, since all perception slows down as well.
I'm not sure how the slowing down of the outside observer can be tied into this model though :)

edit: In addition, just moving fast doesn't really mean the local state of the system is changing all that much.

Well, it changes fast relative to the rest of the universe it moves fast too...

There's a gravitational trend towards time appearing to move slowly to an outside observer, but it is based on the mass of the system, and it does not directly correspond to the mathematical complexity of the system.

Isn't mass proportional to number of protons in atoms? So the larger the atoms, the more protons to calculate, the higher the complexity.
If I've got a chunk of hydrogen atoms, then everything is dandy, because the atoms are relatively simple. Change that to something more heavy like lead, and the amount of interactions needed to be synchronized on a Proton/Electron level are much higher.

An aggregate of 50 atoms is astoundingly complex to model, but if it were compared to the time frame of a single lonely atom, there would be little if any difference.

That's what I meant with
"Since time dilation works asymptotically, the universe actually has a very bad scaling algorithm, just with really small starting values!"
The effect is usually negligible, but as soon as it kicks in it increases rapidly.
 
Well, it changes fast relative to the rest of the universe it moves fast too...

Unless there's some kind of accelleration or interaction with another particle, the state is still the same. Motion alone doesn't mean anything.

Isn't mass proportional to number of protons in atoms? So the larger the atoms, the more protons to calculate, the higher the complexity.
The key factor is the strength of the gravity field in the frame of reference, and the same overall strength can be arrived at with solutions of different mathematical complexity.

On larger scales than individual particles, to an observer in the center of ten dwarf stars clustered together will have a time flow similar to one in the center of a trinary star system of equal mass.

Getting nearer to any gravity well leads to a disproportionate effect based on gravitational strength, but the effect of every other atom everywhere else is still the same mathematically. They may be weaker, but they entail the same required calculation.

If I've got a chunk of hydrogen atoms, then everything is dandy, because the atoms are relatively simple. Change that to something more heavy like lead, and the amount of interactions needed to be synchronized on a Proton/Electron level are much higher.

Two widely spaced lead atoms can slow time as much as a chunk of more tightly-packed hydrogen atoms. Probably not much, but still.

If a theoreticaly massive but short-lived Higgs Boson were there, it would slow time in occordance to its mass, even though it is a single particle.

Millions of neutrinos going by the same spot would have a similar effect, despite their weakly interacting with each other gravitationally on the scale of millions of checks, time will slow in proportion to the strength of gravity in the area.
A single electron would do the same.

I could be wrong, since gravity's effect at such small scales is frustratingly small, but nothing seems to indicate that gravity doesn't track with time flow much more than other possible measures.

If a black hole were at a given point, it would stop or nearly stop time, and it as far as we know is essentially a single point or very close to it.

"Since time dilation works asymptotically, the universe actually has a very bad scaling algorithm, just with really small starting values!"
The effect is usually negligible, but as soon as it kicks in it increases rapidly.

The effect doesn't need to track with complexity, however.
 
...
If a theoreticaly massive but short-lived Higgs Boson were there, it would slow time in occordance to its mass, even though it is a single particle.
...
This is an perfect example how a thread about OOE in consoles´ CPUs can flow to a particle physics in a coherent way:LOL:
 
Sun's Rock CPU details finally revealed

I talked about hardware scouting vs OoOE in this thread earlier and how it is likely Sun's next-gen will have scouting. I then talked about the concept "Out of Order Commit" and posted some links to papers.

Well, it appears Sun's Rock will feature Out Of Order Commit (Sun calls it Out of Order Retirement) in addition to Hardware Scouting, and 8-16 cores (multithreaded), with a bumped up clock rate too.

A 2-3Ghz 8-16 core CPU with 4 threads per core, and much improved serial thread performance will be very nice to see.
 
Well, it appears Sun's Rock will feature Out Of Order Commit (Sun calls it Out of Order Retirement) in addition to Hardware Scouting, and 8-16 cores (multithreaded), with a bumped up clock rate too.

I'd love to see some details about how they do this. From the article above it sounds like store buffers found in normal OOO CPUs.

Cheers
 
I provide a link to the IEEE paper here on how it's done.

Yes, they will have a store buffer, but they don't have a ROB. Instead, they use a checkpoint table and "undo" transactions which are wrong, like in transactional memory or database designs. They argue (not Sun, the original IEEE academic paper) that checkpoint tables scale better than ROBs, and of course, hardware scouts can look much further head in prefetching than an OoOE. OoOC designs aim to have thousands of instructions in flight.

I'm sure there will be naysayers and skeptics, as there seem to be everytime there was any novel innovation in microarchitecture. Atleast for TLP based workloads which are heavily I/O bound, these processors, given equivalent process and transistor budgets, should blow x86 OoOE price/perf and perf/watt out of the water. Companies like Azul have already done so in the Java world.
 
Back
Top