Future console CPUs: will they go back to OoOE, and other questions.

aaaaa0,
You'll get no argument from me against SMP. I view SMP, like CPU cache, virtual memory, OOOE, branch prediction, transactional memory, etc. They are programming models that vastly simplify coding, and in many cases, offer good performance on a range of problems without as many pathological "worse" case gotchas as distributed supercomputing.

It comes down to the purpose of a architecture and the reality that a successful platform requires development of content, and the lower the barrier and costs, the more content you get.

One can't depend on game programmers, or any developer, having to make Phd-level breakthroughs in algorithms in order to extract sustained high performance from an architecture. Overtime, like with GPU Gems and SIGGRAPH, one would hope that enough "breakthrough" algorithms get published for NUMA style architectures that developers can pick them up "off the shelf" and reuse them, like they do with many rendering techniques today. One doesn't expect every developer to invent Spherical Harmonics themselves, for example.

The only reason I often seem to be arguing on the side of CELL is I see far too many arguments from opposition that start out: "Amdahl's law, SMP synchronization. Algorithm X is therefore hopeless except if serial CPU execution speed increases." That may be true in some cases, but I think in a lot of the cases, the problem is the assumption that Algorithm X (which may or may not be parallelizable over a distributed bus) is the only way to compute result Y, rather than looking at what Y is trying to achieve, and tapping into immense literature already out there for solving Y.

Essentially, don't go looking for papers "how to accelerate my n^2 shared everything algorithm on SMP" and then conclude "NUMA sucks for computing these kinds of results" Go looking for papers "how to efficiently compute these results on a NUMA cluster"
 
A local pipe is a FIFO structure in local memory with two pointers, one for access to each end of the pipe.

Which operating system? Some local pipes may be sharedmem, others use copy. It is true that you can implement a FIFO by having the writer have a pointer to the end of a shared memory buffer, and the reader having a pointer to the last read location (eof = readPtr >= writePtr), and it is also true that this system has no contention until EOF or buffer full, in which case the reader or writer must wait.

It is also true that some algorithms run faster on pipe implementations paradoxically than some shared memory code due to the FIFO discipline imposed on the producers and consumers. On the other hand, it is also true that some shared memory implementations whip the ass of FIFOs due to the ability to randomly access and selectively lock portions of the buffer -- better cache locality and no need to re-request data via some protocol. Instead, the request protocol is substituted with a lock protocol instead (hopefully, one of low contention)

To get around that problem, some FIFO bound algorithms will end up copying and caching data, and exchanging notifications to distribute update information to retain coherency of local copies. On SMP, this is a waste, because you're not using the CPU caches to their fullest potential.

Using a message passing infrastructure to coordinate processing of global shared data structures is tricky business and highly dependent on the nature of the algorithm being run. What works for Algorithm A on NUMA won't work for Algorithm B. Unlike in SMP, where a few primitives like concurrent data structures or mutexes/locks/condition variables/et al pretty much work for everything at reasonable performance, however become the source of of many bugs later.
 
That's precisely the answer I was looking for from SPM. I was also looking for the realization that local IPC with shared memory is way faster because it avoids memory copies, which is why it's such a common optimization for pipes.

I was also looking for SPM to realize that an RPC or pipe over the network is thousands and thousands of times slower than a local shared memory RPC call or pipe implementation.

The network communication is a thousand times slower, not the pipe implementation. a local pipe is a very fast serial connection. However supercomputers use network connections to communicate between control and compute nodes because SMP architecture performance begins to drop off after about 4 cores due to bus contention, with the result that supercomputers can be 1000s of times faster that is possible with shared memory SMP machines.

The key fact you are missing is that the network and pipes are not used in the same way as shared memory - they are used to transfer commands, responses and results form control nodes to compute nodes. The very act of NOT using an SMP architecture with shared memory between all cores and using network connections instead, is exactly what allows supercomputers to achieve their massive increase in speed. The processes running on the individual nodes (which is where the memory access speed is required), can access memory without any bus contention.

I was also looking for SPM to realize that the underlying implementation of a pipe requires all those "complex wait states" he was complaining about.

What does a "complex wait state" have to do with a pipe? The delays due to wait state occurs because if automatic scheduling by the OS is used in an SMP system running a pre-emptive multitasking OS (like Linux or Windows) running different processes, as opposed to explicit scheduling of tasks to specific cores or nodes, then it is inevitable that some of a set of parrellel task threads get assigned to cores where other processes take up more time, and hence the parrellel threads finish at different times, and you have to wait for the last thread to finish before you can collect the results and carry on to the next stage. The same problem will happen with processes using shared memory - you still have to wait for the process to complete before you can get it's results. As I said you can access data located at different locations, but not at different times. For example if you are doing a matrix inversion, and you have a hundred processes running on a hundred cores and you have one other process running on one of the cores, and all of the cores finish in 5s but the one with the other process running finishes in 10s. Then the time you need to invert the matrix has doubled from 5s to 10s. Is this so hard to understand?

SPM claimed and claimed again in this thread that "SMP machines are poor at single process, multiple threads with lots of shared state."

In this, he is just wrong. "Single process (ie shared address space) multiple threads" is SMP's bread and butter.

You have misread what I said. What I said was SMP machines are poor when it comes to running single processes on a PC, since applications are not usually threaded in a way that will allow the workload to be evenly distributed between cores, as in the case of PC games and most PC Desktop applications. I other words if you run a single process process on a dual core machine you will get maybe 1.2 x not 2x single core performance. A faster single core processor for desktop applications and current games. Where SMP machines perform well is on servers, on which you have many similar independent processes that run assyncronously. These get distributed evenly between cores and so you get 2x the performance on a dual core machine. Again is this so difficult to understand?

I said that SMP machines using pre-emptive OS scheduling are poor at running tightly bound parrellel tasks. Again this is correct. It is nothing to do with shared memory or pipes. It has to do with other processes which you have no control over (because the OS schedules it) delaying the processing of some the parellel threads relative to others.

However it's clear that SPM has a fundamental misunderstanding of what SMP actually is and what its strengths are. He constantly asserts that SMP is related to OS scheduled pre-emptive multitasking, which it does not -- SMP describes the arrangement of processors and the memory model, and has nothing to do with the scheduling policy of the OS running on it.

I think you have the misunderstanding. The term SMP can refer to multiple symmetrical cores. It can also refer to automatic scheduling of processes by the OS where the processes are symmetrically distributed between, as opposed to having certain process allocated to certain processors which may be associated with certain hardware locality. I have made this clear in my posts by calling it "automatic" or "preemptive" "OS scheduling".

The OS running on an SMP machine is free to schedule one thread per CPU and never preempt it, or any other arrangement it desires.

Sure, who said anything different? However when comparing SMP server and PC architectures with supercomputer architectures, you should realise that the applications running on the former, even games, are hardly if ever use anything other than automatic OS scheduling, while the latter hardly ever uses anything but explicit scheduling, even though as I made clear, both options exist for both architectures. There is a very good reason for that, as you would understand if you read my post more carefully.
 
We can proceed to memoize results just as was done in HashLife, as long as we can design a suitable hash algorithm (as I showed in another message, such an algorithm has already been created for N-body gravitmetric problems) Physics within the minimal partition size will be run as the base case.

Demo - that was a fascinating link and thank you for sharing it. I'd never heard of HashLife before. I'm having trouble seeing how this type of algorithm can be practical unless the number of states assignable to a cell is very small. Anyhow, if you have a chance, could you link to how this technique was applied to N-body problems? I'm extremely curious to see it and couldn't turn up anything that sounded remotely like it through Google. Do they assign some kind of density to a cell and then ignore gravitational influence of cells sufficiently far away? Are the particles still explicitly represented?
 
with the result that supercomputers can be 1000s of times faster that is possible with shared memory SMP machines.

Only for problems that work well with distributed shared-none architectures.

What does a "complex wait state" have to do with a pipe?

"Complex wait state" is your words not mine. You need to synchronize access to a pipe. It doesn't just magically work by itself.

explicit scheduling of tasks to specific cores or nodes, then it is inevitable that some of a set of parrellel task threads get assigned to cores where other processes take up more time, and hence the parrellel threads finish at different times, and you have to wait for the last thread to finish before you can collect the results and carry on to the next stage.

You cannot rely on the tasks on two different processors being completely in lockstep, if you have a shared data structure between two processors, you must synchronize access, even if no OS was running at all and you were just programming on the bare metal. This has nothing to do with SMP, it occurs any time you have any shared data structure between ANY processors in ANY particular machine.

If you farm out workitems, then you need to know when they finish. This means you need synchronization.
If you queue the results coming back, then access to the queue needs to be synchronized.
If two processors both need to update some state, then that state needs to be synchronized.

Having your OS pre-emptively schedule things can make it worse, but it is NOT safe to assume n processors will be in lockstep and will finish tasks at the same time no matter what.

You have misread what I said. What I said was SMP machines are poor when it comes to running single processes on a PC, since applications are not usually threaded in a way that will allow the workload to be evenly distributed between cores, as in the case of PC games and most PC Desktop applications.

That is not what you wrote. This is what you wrote:

SPM said:
SMP machines are poor at single process, multiple threads with lots of shared state.

SMP is better than any other alternative for applications with this sort of design.
 
Last edited by a moderator:
Demo - that was a fascinating link and thank you for sharing it. I'd never heard of HashLife before. I'm having trouble seeing how this type of algorithm can be practical unless the number of states assignable to a cell is very small.

It's not really the per-cell state that matters that much but the volume of the phase space swept out by the "base case". HashLife's "base case" is a 4x4 grid which has 65,536 different states.

Anyhow, if you have a chance, could you link to how this technique was applied to N-body problems? I'm extremely curious to see it and couldn't turn up anything that sounded remotely like it through Google. Do they assign some kind of density to a cell and then ignore gravitational influence of cells sufficiently far away? Are the particles still explicitly represented?

Look at the paper I linked to on Parallel Hash Oct Tree. It's not "HashLife", but demonstrates how macroscopic state representations in aggregate plus clever hash table and tree traversal design can drastically return the problem of network latency and bandwidth in a highly parallel cluster algorithm. It applies this technique to N-Body in that paper.
 
Is this so hard to understand?
Is it so hard to understand that there are more things you can do, even on a PC, than blindly spawning threads and waiting for them all to finish?

For starters, separate processes are often used instead of threads on *nix. They don't share memory as such.

But, the main thing is, that you can use a process as a virtual, specialized core that runs workloads for you as long as you need it to. And run those workloads to completion. That is the streaming model. And that is what you use in *nix for inter-process/thread communication.

So, things only keep on going as long as you feed the pipes more data. And you do load balancing by adding new workloads to processes that are (nearly) done processing.

To reduce congestion and locks, you want to partition your data, so you can have all the local objects be processed by the same process. And you can do load balancing by increasing and decreasing the radius of that locality.

And when you use large timesteps and large locales, for areas that are far away and invisible to the player, you can skip things like collision detection altogether. So there is no reason to actually process the grapical models, you only need to run coarse AI rules.
 
Which operating system? Some local pipes may be sharedmem, others use copy. It is true that you can implement a FIFO by having the writer have a pointer to the end of a shared memory buffer, and the reader having a pointer to the last read location (eof = readPtr >= writePtr), and it is also true that this system has no contention until EOF or buffer full, in which case the reader or writer must wait.

On Linux. Standard i/o pipes (that can communicate with child processes and threads spawned by the process) are in the memory space of the parent process. Named pipes are FIFOs implememted in the memory space of the process that created it, and are visible in the filesystem.

It is also true that some algorithms run faster on pipe implementations paradoxically than some shared memory code due to the FIFO discipline imposed on the producers and consumers. On the other hand, it is also true that some shared memory implementations whip the ass of FIFOs due to the ability to randomly access and selectively lock portions of the buffer -- better cache locality and no need to re-request data via some protocol. Instead, the request protocol is substituted with a lock protocol instead (hopefully, one of low contention)

To get around that problem, some FIFO bound algorithms will end up copying and caching data, and exchanging notifications to distribute update information to retain coherency of local copies. On SMP, this is a waste, because you're not using the CPU caches to their fullest potential.

Using a message passing infrastructure to coordinate processing of global shared data structures is tricky business and highly dependent on the nature of the algorithm being run. What works for Algorithm A on NUMA won't work for Algorithm B. Unlike in SMP, where a few primitives like concurrent data structures or mutexes/locks/condition variables/et al pretty much work for everything at reasonable performance, however become the source of of many bugs later.

With pipes you have to always allow for the overhead of copying since you would access it programmatically as a serial device. However this overhead is no more than for example function calls passing arguments via the stack, and is no more reason not to use it than avoiding use of functions because they have to copy arguments/results to/from the stack. If you want ramdom access to local memory, any serial access method is less efficient than direct access, so what you would do is put the code that needs to do random access in the process that has access to the local memory space and use the pipe to pass commands, references/indices etc. and get results/replies back from that code.

Local shared memory isn't a panacea. If you have 8 cores sharing it, then you have one eighth of the memory bandwidth per core. This will dramatically reduce performance if the programs running do a lot of local memory access - as for example trying to invert a matrix would do. Using separate memory space for the cores and using pipes for communicating commands/results over a network will seriously speed up parrellel applications like this that intensively use local memory.

What I am trying to say is that a shared memory SMP machine with 80 cores will be seriously slowed by bus contention on shared memory even on run of the mill applications that can make heavy use of the cache, and NUMA architecture isn't practical on a single chip due to pinout limitations, so future architectures will use a Cell like architecture - for performance reasons. The SMP shared memory architecture and SMP automatic scheduling isn't a panacea as some seem to think. They work very well in some cases and very badly in others.
 
Last edited by a moderator:
What I am trying to say is that a shared memory SMP machine with 80 cores will be seriously slowed by bus contention on shared memory even on run of the mill applications that can make heavy use of the cache, and NUMA architecture isn't practical on a single chip due to pinout limitations, so future architectures will use a Cell like architecture - for performance reasons. The SMP shared memory architecture and SMP automatic scheduling isn't a panacea as some seem to think. They work very well in some cases and very badly in others.
I agree SMP isn´t a panacea, but for the 2010 and 2015 console generations probably custom SMP will work well. Definitelly small SMP of custom cores with strong SIMD can work well.

With a single chip it is a compromisse between:
- middle complexity of programming (between single core and sea of independent cores/memory/)
- middle to upper level of performance (between single core and sea of cores)

By 2010 generation: Nintendo may go SMP, Probably MS will keep SMP (and maybe port Windows for it), Sony and Cell destiny is unKnow for now.

By 2015 generation: Nintendo probably will have SMP, MS will probably Keep SMP if windows run on it, Sony ???

My guess the problem with Cell is that is harder to program now. Programmers like Carmack have already stated that it is dificult to program.

And the 235mm2 90nm Cell chip is not every attractive. With that die space range and process you could have four Xenos Cores and 2MB L2 cache with a total 155 Gigaflops, 12.8 Giga dots products (FP not much diferent from Cell and with much better scalar performance). And if you look from a architecture point of view external memory latency is more a problem (than the cores distribution) for high volumes of randomically distributed data at this point of time.

And the VAX/VMS had lots of shared memory. It was one of the best and finest implementations of computing systems with small 512 bytes pages and a 1GigaBytes of virtual memory shared (where the OS keep installed/running). One could improve upon it and have a 1Giga virtual space previouslly shared in the memory architecture. And schedulling can be customized.

Shared virtual address space, memory pointers, multiple heaps management, low latency memory, this can make people very happy :smile2:
 
My guess the problem with Cell is that is harder to program now. Programmers like Carmack have already stated that it is dificult to program.
Perhaps they should write it on Linux and port it to the other systems, then. ;)
 
Perhaps they should write it on Linux and port it to the other systems, then. ;)

What would that change ? The fundamental programming model of CELL is completely different.

Id has been one of very few companies to consistently release their games on Linux.

Cheers
 
What would that change ? The fundamental programming model of CELL is completely different.

Id has been one of very few companies to consistently release their games on Linux.

Cheers
Duh.

But the model isn't so different as you think.
 
I can give some definitions for those.
Many thanks for writing in great detail your response to my *amateurish* questions. It is nothing short of brilliant. Sorry that I didn't reply your post many days ago, but lately my life is quite busy. Sometimes I would like to perform a time warp just to be well timed, thus keeping up with the times.:smile:

I don't know how much you already know, so I'm going to set out some basic concepts, as far as I know them.
Frankly speaking, I would define myself as an interested gamer who tries to find out about almost everything in my universe of interests, wich is varied and somewhat eclectic (life is complex by definition, isn't it?), but since this is a forum about consoles and modern technology...
Forgive me if it's too basic
Nothing doing! You don't need to ask for forgivement, how could I forget (and not forgive) anyone who once, or even sometimes, tried to help me? Your definitions make for an interesting, well-explained and informative read.

I'm in the computer field, but I'm not an engineer, at least not yet.
I'm still working on that.
Well, good luck.:yep2:
The fun part about computational machines is that any machine can be made to simulate another machine. The only question is how much time you have to waste, and whether you want to get sued.
It's unreachable for current processors, and probably unreachable for many,many years. Being special-purpose allows Xenos to focus its resources on only a few tasks. General-purpose chips have to worry about a lot of other workloads and problems Xenos never needs to worry about.
I find this a bit contradictory.., I mean, from your words looks like it makes sense in the long run, but in the short run, although OoO chips seem to be wonderful processors capable of making sense of the huge amount of data input most modern chips receive nowadays, they are not (will not be) able to match Xenos and Cell FP performance, thus saying BYE BYE forever to backwards compatibility, for either the X720? and PS4, with their analogous current consoles.

It is not just the instruction set that has been maximized to obtain FP stratospheric numbers, much of the OoO simplicity in modern chips is gone! It's my personal evaluation from the information I've gathered reading and observing through the forum. I hope my theory and its consequent interpretation doesn't fall to pieces.

Cheers 3dilettante.
 
By 2010 the 32nm process will be available http://www.eet.com/news/semi/showAr...d=VVYS20XLZI3CIQSNDLSCKHA?articleID=193100380
This means the designs can be reduced 8 times.

People could have 24 Xenos core at 9GHz but it will generate to much heat and bandwith will be a problem. Lets prioritize more vertical performance increase (more monothread performance) than horizontal increase (more multithreads).

This can better balance the software reducing the risks of diminished returns of eventual serial code as expressed by Amdahl´s Law.

Then lets have only 8 (eight) OoOE RISC with SIMD cores in SMP configuration.
This could means:
- 8 x 9GHz OoOE RISC with strong SIMD cores in SMP (groups of 4 cores ?)
- some L2 caches (not much, maybe 4MBytes)
- 144 Giga issues/sec
- 72 Giga dot products/sec
- 576 GigaFlops/sec (or double if you add a second vector unit to each core)
- under 30 watts

Die size:
- 8 cores 35mm2 (4.4mm2 each core)
- 4mb l2 cache 20mm2
- glue logic 25mm2
- total size 80mm2

Now how to sustain half a teraflop of performance? :oops:
And please without much heat.

My idea is lots of Embedded DRAM - EDRAM,

If you add some 64MBytes and this means more 60mm2, but you could have a half a terabyte of memory bandwith.

The edram management could be explicit managed (with programmers full control) or automatic managed like a Virtual Memory System management, or a mix of both.

Maybe with large pages (4kbytes) and the WSClock this edram could work.
Believe me, that's so close to my dreamed console.

I guess they have different roles and ideas, but I think it would be nice to see 3dilettante, ERP, Fafalada, Fran, nAo, DeanoC, Gubbi, Shifty Geezer, Acert93 and so many others forum members working together in the design process of a new console, if not by their own hands, at least with MS actually hearing them, allowing potentially good minds to express their ideas.

I would add to the list someone whose nick is Megatonator, but this person doesn't post here.

I smell a design full of AAs (AA filter, AAA games and AF) and users being as happy as a lark.

Well, just joking.

Cheers

________________________________________

"Unlucky in games, lucky in love"

Gamertag: Cyaneyes or Arctic
 
If we are just going to repeat tired old arguements Ill just repeat my truism, most of the compute intensive work in games boils down to simulation ... and reality is massively parallel.

I missed this truism before, so I just wanted to make a quick note of this.

True, reality may be massively parallel, (or not, it could just be a miraculously fast single thread) but as far as we know reality:

a) Synchronization is free, if something is not synchronized, it's because it's not meant to be
b) Storage is exactly as big as it will ever need to be
c) Storage is accessible exactly as fast as it needs to be (not just speed of light, things may possibly go faster)
d) If phenomenon A and B take 1 second to complete, it doesn't matter if B is 10 times as mathematically complex.
e) Reality has at its base behaviors that may be truly random, no amount of massive parallelism will make that work with an algorithm. (Minor aside, not really important for most things)
f) There is no doubt if two things will interact. If they interact, it's because they interact. If they don't, they (by definition) do not. That's not a viable development method.
g) Whoever's at the controls has not been tempted to hit Ctrl-Alt-Delete
h) The hardware apparently doesn't fail after 20 billion years.

I think you may need to read more on how HashLife actually works. HalfLife can and does skip over steps and it does have an increasing performance slope. Indeed, the first 100-steps may take 1 minute to calculate, the next 1000 steps may take 10 seconds, and the next 1,000,000 steps may be calculated in only 1 second. If you were to display a HashLife simulation in progress, you'd see some regions jumping far forward, and even others appearing to move backwards sometimes. However, in any given region, all of the values are provably correct for the local time of that cell.
I think that depends on how you are displaying the simulation. If you cram all the parallel activity together without taking program flow into account, sure it would look wierd. There's still a meaningful flow. Since the algorithm is not in-place, a read from memory locations by following the return address stack can show a very definite order.

As the program runs, it learns, and as more and more configurations of the grid are encountered, it's speed only increases. Simulating arbitrarily large grids into the far future, paradoxically, is easier than simulating in the near term. And yes, the problem does skip over time steps, literally, in a log_2(t) fashion. It can do so, because the locality of prediction guarantees that many intermediate steps are not needed in order to calculate the smaller predictable grid at time t*2. The only major downside is the memory consumed in running the algorithm.
It does not skip a calculation, the simulation just doesn't repeat itself. At some point in the program's execution, that given time step with the given preconditions was done at least once.

I would also note for practical reasons that the "only major downside" is a pretty major downside, especially since it relies heavily on the hardware and run-time environment to enforce its implicit ordering of calculations. If the algorithm were running in-place, things would have been a lot more complicated. Because it is not in-place, place in memory and in the structure supply what naive time-keeping isn't.

Think this only applies to Life? Let us impose a speed of light on our game world of 'C', and let us impose a "planck time" of t, (e.g. 100Hz 1/100th of a second). Let us divide the game world into partitions with the smallest being C*t in size (if C == 1000 m/s, and t = 1/100th s, then C/t = 1 meter). In any partition NxN in size (N = some multiple of Ct), the state of the partition N/2 x N/2 in size at time t*2 is completely determined by the state at t.
I agree this could be useful in certain situations. The rest of my issues lie in running it in real-time.

It becomes very critical that you tell me what a good C would be.
If it's a game where a human player's time frame is inserted into the game world, C can't be whatever we choose, it must be close enough so that it isn't ridiculous to play, but not so fast that it becomes ridiculous to simulate.

The relationship between C, the partition size, and the minimum time tick determine the size of the workload.

You want a fast enough C, since the player doesn't see everything as moving like it's underwater, but
If C is large, the time-tick must be smaller if you don't want the partition size to grow too large, since that prevents culling in a crowded space, and it forces more unnecessary synchronization.

If the time-tick is too small, you wind up looping through things that almost never change.
Worse, time-ticks are have at least a minimal serializing effect. That means synchronization or praying nothing goes wrong too often, because cleanup code takes real-world time.

But the larger the time-tick, the more you must decrease C or shrink partition size.

If the partition size is smaller, you must decrease C or speed up the time tick to limit propogation.
You want fewer partitions, since the more there are, the more state you have to maintain. But you don't want partitions to have too much going on inside of them, since that ups the computation cost. You can't eliminate excess work if the partitioning is too large to allow for adequate subdivision.

There's another problem.
You don't want subdivisions so small that objects span too many of them, but in a game, not everything is a point-sized object like in HashLife.
A free flowing ( or nearly so) 3d world also has a lot more neighbors than a 2d grid that artificially removes a vast set of complications from the simulation.

It becomes a volume problem, worse, a non-integral volume problem if this is a game world and you do not continue to use the fixed points in space idea behind the Life simulation.

A meter-sized cube is a very coarse subdivision of a game-world. Remember, as the "plank length" of the simulation, any and all changes must be in multiples of a meter.
This also means that things must either move at least 1m per time-slice or not move at all.
So unless the minimal time-slice is more than a second, there is no way to go less than a meter per second, or more than a meter per second.

That paradigm works great when time and space units are infinitestimally small.

We can proceed to memoize results just as was done in HashLife, as long as we can design a suitable hash algorithm (as I showed in another message, such an algorithm has already been created for N-body gravitmetric problems) Physics within the minimal partition size will be run as the base case.
The ceiling for perfect predictability in a game with user input is the number of time steps between system reads of user input. Any more and the viability of the speculative results becomes increasingly suspect the more they speculate past a given IO poll.

If the system can be guaranteed not be be perturbed in a fashion that would not follow from the previous time step, then HashLife's assumption would be correct.

What to do? Hash the combination of possible user inputs (input, action, place in the spatial grid,relationship to any number of other time stamps) as well? Invalidation also needs to progagate, so that means that the size and effectiveness of the memoization cache is inversely related to the number of perturbations to the system per a given number of time steps.

If it is 1 perturbation per time step, and it is maintained for longer than the stride of all the layers, then the number of valid memoized results will be little more than what you would get from an initial run and stopping, then starting, then stopping, then starting.

What is the maximal size of the state representation of a body in an N-body gravimetric problem?
Aside from position, the rest is calculated from the positions of everything else. What happens if you aren't worried about point objects that have no health, and don't have a fixed function that determines their next move in the space of perhaps a half a kilobyte?

Once this is finished, we can simulate game physics in arbitrarily large worlds and query for the state of any game partition at any large time in the future without necessarily having to recompute the entire world, or intervening steps. Note, this does not depend on the base case problem being analytically solvable! (e.g. 3 body problem is insoluable analytically) Life, and many cellular automata do not present analytic solutions.
Are you trying to enter real life now? If N=1 and it takes 1 minute, and N=10 it takes .1 minutes, do you really think N=10^64 will take less time than it takes for an electron to move between the atoms in the processor?

Memoization is a win as long as the cost of the memory access is dwarfed by computational complexity. It can't make the cost of memory access zero.

What about practical constraints? Memoizing an arbitrarily large universe means an arbitrarily large amount of memory. HashLife takes more memory to simulate something, it doesn't scale as insanely as a non-canonized quad tree (canonization being somewhat serializing, by the way), at a bare minimum it takes the minimum number of bits needed to represent the unique state of the system. If a larger universe always takes the same number of bits or less than a smaller one, then something is being lost.

If we can agree that it takes some number of bits X to represent the simulation state, and that nothing is being lost, then X will slowly but inexorably increase. It's trivially true that at some point every bit will be traversed by the simulation as it calculates the next time-step. Unless the time per bit magically is always faster, HashLife does not have an eternal speedup.

I'd expect it to have a kind of nifty-looking time function, though.
Depending on the input, it could go faster or slower based on some factors:

The number of unique canonicalzed nodes: the given simulation is simple and it is probably not sustainable to have all canonical nodes per layer, so this one doesn't strike me as being a dominant factor very often.
At level zero, there are two canonical nodes. Unless of course a simulation is entirely empty. (After a generation, it's still entirely empty, but that's just a silly setup.)
What's cool about the canonicalized tree is that it's an incredibly compressed layer, since the number of leaf nodes in a quad tree go up by powers of four of the number of levels.

At level one, there are at most 16 canonical nodes, and as long as the size of the simulated world is such that the number of non-canonicalized nodes is greater than 16 (plus a fudge factor for the 256 extra bits needed in a 64-bit pointer system to point down to the children), it will be more compact.

At level two, there are at most 16^4 canonical nodes, but that's ~65 thousand for what could be billions of nodes otherwise. I'm very interested in seeing the math concerning what the number of canonical nodes vs problem size.

And so on.
Each successive level added to the top of the tree will have this growth for the maximal number of canonical nodes, though for any problem arbitrarily larger, it is a fixed number against any number of additional nodes.

It's a very nifty algorithm, but it won't magically make all the extra work disappear.
Given problem size N, it cannot be said that every problem size from N+1 to infinity will run faster.

What it does guarantee is that thanks to the limited number of canonical nodes, the number of arithmetic operations is unbelievably greater than the number of lookups.

The number of lookups places a floor that continuously grows, but at a sublinear rate. It never goes permanently down. There are inputs that can cause the numbers to oscillate, but not stay down.

The pointer size:
The pointer size determines the size of the hash space (or a quarter of it), once the number of canonical nodes exceeds the number 4*pointer size, something is guaranteed to collide. Thankfully for HashLife, the simplicity of the problem makes it so that there's plenty of room in the space being hashed to for things to work for a long long time.

If the number of bits needed to represent a given cell increases, the hash function runs into problems.
If the rules became that a cell could have 4 states, dead, ailing, normal, robust.
That's a max of 4 canonical nodes at 0, 256 at 1, 256^4 at 2, and so on.

For a two-bit simulation, it's likely something else would get in the way of performance long before the hash became a problem.
I wonder about a simulation in which an object's bit representation exceeds the pointer size, which means that there will likely be a collision once the number of cells got large enough.

It is far more likely, however, that other problems with holding all of this data will get in the way, though this can be minimized with a memory setup where a given location can store the entirety of a node's contents.

The special cases are the following:

1. objects which don't exist completely within one partition or another, but which span two partitions.

If the objects aren't points, or nearly so, then it is very likely to happen. HashLife's objects don't span anything, they and their unit of space are the same.

If that assumption is not changed, the partitions are just too big, and everything is a cube.
If that assumption is changed, then objects could span more than just two partitions. In a full-fledged 3d environment, it's all partitions in a sphere around the object.

2. partitions which include non-linear or non-local inputs: random events, AI controlled characters, human controlled characters

HashLife already handled #2 a long time ago by introducing hash entry invalidation and dirty bit concepts.
I'm curious to see how it affected performance scaling. Especially if invalidations become too burdensome. Beyond a certain number of future steps, it might be pointless.

#1 may or may not be handled by subdividing the object.
That would involve some kind of synchronization, which would be a variable load depending on how often this occurs.

In any case, I'm not advocating that HashLife's algorithm itself beused to solve the large scale simulation problem in games. For one, it burns memory like crazy. What I am trying to show is that monotonically increasing step-by-step time simulations are not neccessarily needed to get results, and that lazy evaluation and "call by need" may offer alternatives.
HashLife is step-by-step. It's just so simple that you can reuse steps.

That doesn't work too nice once the relationships and outcome spaces explode.

HashLife also does assume an implicit form of synchronization. It's present in the function calls and the use of the "new" operator. It's (very intelligently) sacrificing memory space to save on time spent calculating results, and it's using the work the OS would be doing anyway as its synchronization.

I am also trying to stimulate interest in the idea that regions of the game simulation can be partitioned into local predictable cones of influence (which may not include AI actors or players) which need not be simulated at every step.
That's a good idea, but it's not just a cone, not unless the given object is the only one that can move or interact.

It's more of a warped sphere, where the cone juts out in front and the rest of the sphere (smaller in back) is there to catch if other objects' cones may meet up with the object.
The coarser the time step, the larger the sphere portion becomes.

In my "pot roast" example, going and looking at the pot roast 30 minutes later and seeing it replaced with a cooked fascimile is the equivalent of a semantic hash table lookup (hashtable.get(roast, 30 minutes) -> return pre-simulated cooked roast).
My objection was that without some form of coordination, you do not know the value "30 minutes".
How can you without some kind of timekeeper or synchronizing scheme? Where is the time coming from?

HashLife works great because it rests on the assumption that the system handles all of that already. Unless you think it's a good idea that the "new" operator be allowed to write to an address in the heap without checking to see if there's a valid object there, there is some form of synchronization.

I'm arguing that while lock-step is a form of synchronization, it is not the only form.
As long as a system's resources are not infinite and operations take non-zero time, some of it will be necessary.

Depending on level design, one can mark whether an object is "local" or has non-local effects (aroma of meat cooking, sounds of boiling) and thus the simulator can decide whether it needs to micro-simulate that partition, or whether it can replace it with a cached pre-computed representation later.
It still needs to know which cached representation to use. If that is not known, it does not matter how many precomputed states there are. There is no algorithm that can make garbage into a valid result.
If an algorithm can take a "garbage" input and somehow make a valid computation, it's either lucky coincidence, or by tautology, the input wasn't garbage at all.

The reason HashLife has philosophical implications for our reality is the implication that what you experience as an arrow of time and history of states may be completely illusory and that reality itself does not neccessarily "compute" all states in between. Yes, we can run measurements to measure phenomena at arbitrarily small resolutions, but if the universe works like HashLife, than our perception of having taken 1,000 samples at 1000hz may simply be an illusion and our memory of "now" merely contains the perception of having carried out all of those steps, when in fact, the universe jumped forward 1 second and our memory reflects the correct and consistent history that never in fact occured.
Or if you think there's something to quantum foam theory, it calculates everything and more.

I've already gone into the fact that a simulation, if properly made, does not have any perception of outside time. A player's actions are mapped so that they match what an object would do in the simulation. The problem is that without synchronization of some sort, outside time does filter in.

Some things take longer to simulate than others. Maybe because of instruction count or mathematical complexity, it takes 2 ns in our time for A and 3 ns for B.
If A is not made to wait by something and it is allowed to loop, it will go faster than B. Unless the simulation is defined in such a way that it doesn't matter, something must keep track so that if B needs to interact with A, it interacts with the right version of A.

People have long questioned the continuum vs the discrete conception of time. However, in both conceptions, a monotonically increasing linear order was always assumed.
There is a lot of evidence this is not entirely the case at the quantumn level.
Though as far as games are concerned, this is a non-issue, quantumn mechanics does not make intuitive sense, and it's not really fun either.

When my character tunnels through a wall of a skyscraper and I fall 1000 feet, that's a bug, not quantum mechanics sneaking up on me.

Step T_n depends on Step T_(n-1) according to transition rules or the laws of physics. The philosophical debate engendered by HashLife was whether we even need this assumption, maybe the universe can "jump" or "tunnel" between arbitrarily large configurations, as long as all possible observers in the later configuration agree that the later state was a evolutionary consequence of the former.
It's not really necessary that observers agree even then. Einstein already illustrated the folly of determining simultaneity between three observers signalling across a distance with relativistic speeds.
Time may be entirely local, but thanks to the great speed of light and the matter-based laws of our means of perception and thinking, it's not very often we see odd effects.

Then there's entanglement experiments, which have been shown to possibly violate even the speed of light.

Time of course, would be merely an illusion, an after-though mental explanation by back-predicting the states that could have lead to the current one.
Even if it is, it is a consistently maintained one, even without our need to justify it mentally.

I perceive myself to have had a life history, because information in the state of my brain leads to me believe that the current state (now) could have been the consequence of prior ones (the past). Cause and effect extrapolation. The assumption of continuum, that a continuous stream of time existed between all events in my memory, are just that, an assumption, or an extrapolation.

And I'm sure if we are in a vast simulation, that there are higher beings tearing their n-dimensional hair out trying to simulate it so it's not taking forever for Them or Him or Her or Whatever.

In any case, I bring it up because HashLife is not just a curiousity. When Bill Gosper invented it, people were already playing with the idea of artificial life and intelligence within these grids, and philosophers began to ask "what if?" our universe amounted to nothing more than a big cellular automaton.
As far as I know cellular automatons, there are signs the universe routinely violates propogation restrictions so it may be a bit more complex than that. It's also not proven that the universe is deterministic, or that what is random in the universe is truly random.

If there is true randomness (it might not be, it could just seem that way), then no automaton can be the answer. There is no algorithmic way to arrive at randomness.

If so, what if it evaluates using HashLife techniques? Then many of our most fundamental assumptions are wrong. Sure, sophistry right? Still, it makes for interesting thought experiments.
It doesn't make a single one of our physical assumptions wrong at all. It might make for interesting theological debate, but as long as internal consistency is maintained, it doesn't matter at all how weird things are outside of our simulation.

The idea that our reality is The Reality is almost a non-sequiter to physical laws, which only worry about consequences within their context: a context which is safely insulated from a possible exterior world.
 
Last edited by a moderator:
True, reality may be massively parallel, (or not, it could just be a miraculously fast single thread) but as far as we know reality:
a) Synchronization is free, if something is not synchronized, it's because it's not meant to be
b) Storage is exactly as big as it will ever need to be
c) Storage is accessible exactly as fast as it needs to be (not just speed of light, things may possibly go faster)
d) If phenomenon A and B take 1 second to complete, it doesn't matter if B is 10 times as mathematically complex.
e) Reality has at its base behaviors that may be truly random, no amount of massive parallelism will make that work with an algorithm. (Minor aside, not really important for most things)
f) There is no doubt if two things will interact. If they interact, it's because they interact. If they don't, they (by definition) do not. That's not a viable development method.
g) Whoever's at the controls has not been tempted to hit Ctrl-Alt-Delete
h) The hardware apparently doesn't fail after 20 billion years.

A list of truisms, at best tautological and non-falsifiable, at worst, unsupportable claims, depending on definitions of words used, such as:

a) "synchronization" and "free" in what sense? Is the Pauli Exclusion Principle "synchronization". Is neutron degeneracy pressure? Is electromagnetism? How do you assign cost? If I assign cost according to statistical mechanics, then "synchronization" isn't neccessarily free, since the system's modes of freedom are altered nevertheless, the cost being an increase in entropy. Is the "synchronization" of two binary stars in mutual orbit via gravity "free"? The resulting system radiates its energy away. Without defining synchronization, the statement lacks meaning.

b) again, what is "storage". What does "need" mean? You mean, information about states? How do you know storage is exactly as big as it will needs to be? This seems to imply intent or design. How would this statement ever be testable? On the one hand, we have the holographic universe proponents showing that the universe actually has far more storage than it really needs, and on the other hand, we have some philosophers who think that quantization proves it doesn't have enough. If the universe had less storage than it "needs", and the result of running out of states was some natural phenomena, you'd just claim that the phenomena is meant to exist and therefore it purposely has "less" than it "needs" to prevent the phenomena, and therefore, has exactly what it "needs". Nice tautology.

It has been shown that if the universe is closed (collapses to a point in finite proper time), it's storage and information processing capacity diverges to infinity (infinite storage and infinite CPU cycles), while paradoxically, if it is "open", its storage capacity is finitely limited. Thus, the amount of storage the universe already is capable of is not neccessarily determined by physical laws, but merely by whether it has enough mass-energy to be closed.

Your list basically says "A is A and not B, therefore A"

It does not skip a calculation, the simulation just doesn't repeat itself. At some point in the program's execution, that given time step with the given preconditions was done at least once.

This is semantics. If you want to do algorithmic analysis, you have to be consistent about how you count "work". If I can predict 2*t steps ahead using a combination of previously solved subproblems, I have skipped. With both spatial and time compression, it's not even necessarily that all the steps from (T,2*T) at coordinates (x,y) on the grid need to have been computed if they have been done elsewhere at another time and place. There is an entirely practical difference between the simple case of memoization as used in classic divide and conquer algorithms (typicallly non-Matroid problems that require dynamic programming, like edit distance or optimal triangulation) and HashLife, where at best, in the classic algorithms, the speedup converts an exponential algorithm to a polynomial one where the amortized speedup per step is constant. If *every cell* at a given time step had to be processed in a prior time step, I might agree, but a trivial counter example is extremely large empty regions of space or regions with blinkers or breadloafs, in which you can advance the simulation very far ahead in a particular lightcone slice without ever touching the vast majority those cells or doing any intermediate steps in the cone.


Anyway, let's quote the DDJ version:
Dr Dobbs said:
At this point, the Life algorithm is not quite HashLife yet. On some large universes with regular patterns, it is faster than conventional algorithms, even highly optimized ones. For instance, on the classic breeder (see Figure 4), the time per generation is constant, even though the population rises with the square of the number of generations. All the infrastructure we have constructed so far lets you make one final huge step towards creating the final HashLife algorithm—compression of time.

Earlier I mentioned I would run some nontrivial patterns for trillions of generations. Even just counting to a trillion takes a fair amount of time for a modern CPU; yet HashLife can run the breeder to one trillion generations, and print its resulting population of 1,302,083,334,180,208,337,404 in less than a second. This requires only a relatively small change to the program I have described so far.

For HashLife to fail to produce a speedup with an unbounded automaton would require not only that the initial conditions provide infinite growth, but that the structure produced is ever growing in complexity, which we know neccessitates that the initial conditions must encode universality -- e.g. a universal turing machine. A non-universal CA would quickly produce spatial and temporal compression opportunities.

As far as I know cellular automatons, there are signs the universe routinely violates propogation restrictions so it may be a bit more complex than that. It's also not proven that the universe is deterministic, or that what is random in the universe is truly random.

CA are not constrained by locality. Try reading Stephan Wolfram's "A New Kind of Science" for examples of physics models using CA.

If there is true randomness (it might not be, it could just seem that way), then no automaton can be the answer. There is no algorithmic way to arrive at randomness.

There is indeed an algorithmic way to arrive at randomness. Randomness is rigidly defined by algorithmic complexity theory as information which is incompressible. If the shortest possible representation for a given bitstring is the bitstring itself, then the bitstring is said to be random. If a program is the shortest possible representation for a given bitstring, then the program itself is random. The reason the digits of Pi for example, aren't random, is because there exists a much more compressed representation of Pi, namely a program which generates its digits. It is *NOT* because the program is deterministic and produces the same output for initial input every time.

In an unbounded, or effectively unbounded, system so that cycles do not exist, a universal cellular automaton encoding a random program (random in the sense of algorithmic complexity theory), would lead to a truly random phenomena. It would be impossible to produce analytic equations or any model which could predict the sequence of the phenomena precisely, since that would by definition, violate the claim that the automaton itself was running the shortest possible program, unless the scientists had discovered the automaton itself, which is extremely unlikely.

To give a realistic example, let's take Chaitin's number "Omega". One can describe an algorithm to compute it, yet it is non-computable (if it was, one could instantly solve the Halting Problem as well as prove ANY theorem in mathematics magically and in constant time) However, the first 64-bits of it have been calculated. These first 64-bit bits are random in every sense of the word. Chaitin also proved that randomness exists in basic mathematics, such as the question of whether a given diophantine equation has a finite or infinite number of solutions is true for no particular reason (the answer cannot be derived from application of axioms and logic within the given math system), and therefore it just either is one or the other, with no possibility in principle to come up with a procedure to calculate the answer for an arbitrary diophantine equation input. (it is in fact, equivalent to the Halting Problem)


It doesn't make a single one of our physical assumptions wrong at all.

Yes and no. It may not violate a predictive model we have made of observed phenomena, but it may certainly violate the interpretations we make of such models. Whether or not it does violate physical assumptions depends on whether or not mutally compatible interpretations can be experimentally discriminated. If, for example, there was an experiment that could be conducted which could distinguish between Many Worlds and Copenhagen (renown physicist David Deutsch seems to think so) or Transactional interpretations of QM, it would imply that the (more correct) interpretation was producing predictions that either don't agree with the standard theory, or are not covered by it. In either case, it means either an explicit assumption is wrong, or an implicit one.


The idea that our reality is The Reality is almost a non-sequiter to physical laws, which only worry about consequences within their context: a context which is safely insulated from a possible exterior world.

Physical laws "worry"? :) I posted a long diatribe on this in the General Forum explaining my position of agnosticism via the epistomology problem, that humans at best can only construct models of the universe, that we may build confidence in them as they continue to agree with observed data over massive numbers of experiments and withstand counterexample, but we may neither never prove them correct, nor can we know if the interpretation of them is "correct", indeed, the very idea is meaningless. (in answer to the question of whether or not we can model a creator or intelligent designer's "intent" or "will" for us)

Doing physical science is like trying to clone the output of a forever inaccessable Turing machine. We observe the outputs and construct a parallel program which tries to produce the same outputs. We may succeed, but we can never a) prove that our program will always agree and b) prove that our program *IS* the same the underlying machine. c) for any given output, there may be an arbitrary number of programs all isomorphic in output and therefore you face a kind of polytheist problem d) there is no guarantee the universe has the "most compact/shortest" one and e) it doesn't make sense to try and describe transcendant structures (underlying reality) in terms of your own mental concepts embedded within the reality. The MAP is not the TERRITORY.

That said, interpretations are not useless. They create frontiers of thought that lead to new directions of thinking based on implicit properties or assumptions in the interpretation. In mathematics, one can often find 10 objects, all isomorphic, yet it is easier to solve some problems or think in some of the systems vs the others.

I will continue to assert that the problem of game simulations is much more tractable and parallelizable than many people claim, and that reasonable and lifelike games can be designed not just via spatial subdivision and local physics, but also by replacement of simulation parts by fascmiles, pre-computed results (some online, some offline) and relaxation of constraints of universal flow of time and physics everywhre.

Hell, even the Matrix Reloaded got this right. The machines didn't simulate Birds or Clouds at the particle level. As the "Oracle" says to Neo before his big fight with multiple smiths, they wrote separate custom programs for many of the things in the matrix (birds, bullets, etc) Humans simply don't need quantum physics simulation unless they are scientists running experiments.
 
Believe me, that's so close to my dreamed console.
Good you liked :)
But there are two things wrong
- The chip could be faster, almost 870 Gigaflops with single vector unit :oops:
- 1T-SRAM is denser than I thought and 128MB could be done with 55mm2 32nm (based on Flipper)

Remenber that there was a time where L2 cache was external to the chip (pentium pro and before).
Maybe this is time we condense more memory hierarchy levels inside the chip.

I guess they have different roles and ideas, but I think it would be nice to see 3dilettante, ERP, Fafalada, Fran, nAo, DeanoC, Gubbi, Shifty Geezer, Acert93 and so many others forum members working together in the design process of a new console, if not by their own hands, at least with MS actually hearing them, allowing potentially good minds to express their ideas.
I would like to hear it too.
 
a) "synchronization" and "free" in what sense?
Synchronization within the bounds of our reality is infinitely scalable and takes no additional time: it doesn't take any time. If two events have some kind of synchronous relationship, their sequence of events will flow the same relative to the rest of the universe. If two billion things happen at the same time, the evaluation of the events the same relative to an outside observer.

Atoms don't take a second to pause and figure out what they'll be doing to coordinate with every other atom within range. They just are.

In a real-world system, we see that synchronization does not take the same amount of time independent on the number of objects to be synchronized, and the action of checking for synchronization takes a non-zero amount of our time.

For the simulated world in the system, it doesn't matter. For those doing the simulation, it does.

Perhaps it's a metaphysical law: that which is instantaneous within the context of one reality cannot be strictly instantaneous with respect to an exterior reality that is simulating it.

Perhaps there are some similar laws (assuming an exterior world is similar to the one being simulated):

That which is arbitrarily parallel within a given simulated context cannot be equally parallel in the external context.

That which is irreducibly serial cannot be run in fewer time units in the external context than it can be run in the internal context.

The minimum clock tick of a simulation cannot be shorter than the equivalent clock tick of what is running the simulation.

Is the Pauli Exclusion Principle "synchronization". Is neutron degeneracy pressure? Is electromagnetism? How do you assign cost?
Cost is in time required to execute the change, the additional time required to register a change depending on the size of the problem. There is no additional time needed (so far as we know) for an electron captured by an ion to check to see if it has a neighbor in its energy level compared to when it does not. For a computer simulating the action, it takes at least one clock cycle.

N+1-body simulations take longer to simulate a given time step than an N-body system, but trinary star systems (as far as we know) have the same rate of time flow as binary systems. A star cluster of 100 stars has the same rate of time flow as a binary system, as far as we know.

That is most definitely not true of a computer's external context that is actually simulating it.

b) again, what is "storage". What does "need" mean? You mean, information about states? How do you know storage is exactly as big as it will needs to be?
We're getting into base assumptions here.
If we assume that representing the state of a given system requires some kind of symbolic combination of data that differentiates that state from all other unique states, and we assume that such a representation cannot take up 0 space, that makes up our need for storage.

For the universe, nothing can move out of reality. By definition, everything that ever happens in the universe exists, so it is represented. Any combination of internal states is stored. No matter how complex the state becomes and how large the representation is, there is no pop-up for a lack of hard drive space.

This seems to imply intent or design.
It's more of a failure of language, I'm running out of words that can be used well for such a topic, English tends to presume some kind of agency in its verbs.

How would this statement ever be testable?
Conservation of energy and matter. If either disappeared based on how complex a system's representation became, it would indicate that something was being lost from the representation of the universe.
Either that, or we could try to move off the edge of reality, but that would make no sense at a basic existential level. A trip that has a nonexistent end doesn't make much sense to an atom.

Nice tautology.
Yes, that's the point. Within the context of the universe, the tautology holds true. But to anything outside that world, it would be untrue. The same would go for any simulation we try to create within our universe. It will have the same tautology internally if we so define it, but we would always find that it would not translate to our reality.

Your list basically says "A is A and not B, therefore A"
Running a simulation is the act of saying X is A and therefore by comparison with Y that represents B, X is not Y and A is not be, therfore X therefore A.
There's an action inserted into what is for us a given.
A simulation's software and hardware can never say "it just is", reality does that all the time.

This is semantics. If you want to do algorithmic analysis, you have to be consistent about how you count "work".
Fine, a practical one. The number of clock cycles needed to complete a stride for every level in the tree on a fixed hardware platform (memory not being a concern, just no adding execution resources or memory ports whenever it seems convenient).

If I can predict 2*t steps ahead using a combination of previously solved subproblems, I have skipped.
A prediction takes at least one clock cycle, and it assumes that all previously solved subproblems encompass the next time step. There are pathological inputs and IO that would alter this, and it is trivially true for any unbounded growth situation that the number of predictions will have to increase, since the stride length in software is fixed relative to the population. If it is not, the act of changing the number of generations predicted is itself a calculation that takes non-zero time.

With both spatial and time compression, it's not even necessarily that all the steps from (T,2*T) at coordinates (x,y) on the grid need to have been computed if they have been done elsewhere at another time and place.
Per procedure call, at least one calculation will be made, if only to see if a result has been memoized. Unless you expect the number of lookups to sit at exactly one given number (even if the tree adds 80 more levels), the simulation will eventually take more time to complete as N increases.
Since compression is not perfect, the size of the state being run through will always increase if the problem size increases.

For HashLife to fail to produce a speedup with an unbounded automaton would require not only that the initial conditions provide infinite growth, but that the structure produced is ever growing in complexity, which we know neccessitates that the initial conditions must encode universality -- e.g. a universal turing machine. A non-universal CA would quickly produce spatial and temporal compression opportunities.
The speedup it is talking about is relative to a naive version of HashLife, not between HashLife and itself running a larger problem.
It's not a declaration that a problem that is N*2 in size will take less time to calculate than one of size N. If that were true, then if N=10^100 would take less than a picosecond and you could then use a subset of the results to solve a problem of size 10 way faster.

It's saying that there is a pathological case where it is even worse than a non-memoized version at a given problem size. Considering that they mention people are trying to make HashLife run various flavors of a Turing machine, this case will be true if they succeed.

There is indeed an algorithmic way to arrive at randomness. Randomness is rigidly defined by algorithmic complexity theory as information which is incompressible. If the shortest possible representation for a given bitstring is the bitstring itself, then the bitstring is said to be random. If a program is the shortest possible representation for a given bitstring, then the program itself is random. The reason the digits of Pi for example, aren't random, is because there exists a much more compressed representation of Pi, namely a program which generates its digits. It is *NOT* because the program is deterministic and produces the same output for initial input every time.
That measure of compressibility is closely related to a representation's Shannon entropy. That is a measure of randomness with respect to representation, not a means of creating random outcomes.

A random outcome is one where the state at time T has no bearing on the outcome at time T+1. Even if you knew the entire state at T, it would not lead to a prediction at T+1. Name an algorithm that can do that.

“anyone who is trying to generate random numbers by deterministic means is, of course, living in a state of sinâ€￾- (I think that was von Neumann)

In an unbounded, or effectively unbounded, system so that cycles do not exist, a universal cellular automaton encoding a random program (random in the sense of algorithmic complexity theory), would lead to a truly random phenomena.
If we were to follow the execution trace of this automaton would we see how the initial state relates to the final result?
If we can, and this automaton is running on a physical system, it is not random at all.

It would be impossible to produce analytic equations or any model which could predict the sequence of the phenomena precisely
Between algorithm step N-1 and final step N, would we see a relationship between them? If so, it is not random.

To give a realistic example, let's take Chaitin's number "Omega". One can describe an algorithm to compute it, yet it is non-computable (if it was, one could instantly solve the Halting Problem as well as prove ANY theorem in mathematics magically and in constant time) However, the first 64-bits of it have been calculated. These first 64-bit bits are random in every sense of the word.
When it comes to probabalistically random behavior, the act of calculation indicates it is trivially non-random. A calculation takes a prior state and derives a later one based on that state. That is not random.
If it were so simple to create random numbers algorithmically, why is it that no system anywhere does it? There are a lot of users of encryption that would kill for something like that.

If your example is non-computable, what use is it in a simulation that is entirely about computation?

Yes and no. It may not violate a predictive model we have made of observed phenomena, but it may certainly violate the interpretations we make of such models.
What's there to interpret? Something happens given a certain state, or it doesn't.
Thus far, nothing beyond our reality has proven to be observable and nothing we know of has been proven to interact with anything outside of reality.
How various things arrive at their end states within a given reality is like you said earlier irrelevant.

Science doesn't guess at what's outside of reality, it doesn't work with it at all.
If two overarching theories predict the same result in every situation, then they are the same theory.

There's no differentiation based on a possible outside world, it could be using turtles with pieces of paper on their back for all it matters to the simulated reality.

Physical laws "worry"? :)
Limitation of my word choice, not my intended meaning.
Physical laws only operate within their context, as far as we know.
Any other factors are our intellectual concerns, and have no bearing on the relationships the laws represent.

Doing physical science is like trying to clone the output of a forever inaccessable Turing machine.
The assumption still exists that it's a Turing machine, and as such has rules that only exist or have an effect within their own context.
It's a very primal assumption, but just as a Turing machine programmed to run on a Java VM doesn't care that it's running on a VM, it must be assumed when studying reality that such abstraction is true if any attempt at study is to work.

We observe the outputs and construct a parallel program which tries to produce the same outputs.
It's highly likely we'd know long before it got to that point that our simulation was off.
If we believe the universe is not integer-based, then the representation within a simulation will always be inadequate.
It would only fully capture its subject if memory location size was infinite (not the size of the pool of memory, just the data at a given address).
As far as we know, there are rational numbers that fail to be expressed in finite space for any numbering system, and they would be truncated and innacurate for any non-infinite memory location set.
Even one time step with an infinite of anything can never be calculated through in finite time.

If you believe that there's validity in chaos theory, then it's essentially guaranteed that something will not match, unless there is evidence of a similar limitation in reality, but that would undermine a lot of the stochastic phenomena we do know about.

We may succeed, but we can never a) prove that our program will always agree and b) prove that our program *IS* the same the underlying machine. c) for any given output, there may be an arbitrary number of programs all isomorphic in output and therefore you face a kind of polytheist problem d) there is no guarantee the universe has the "most compact/shortest" one and e) it doesn't make sense to try and describe transcendant structures (underlying reality) in terms of your own mental concepts embedded within the reality. The MAP is not the TERRITORY.
a)That's life, we can't know unless we were outside of the reality we're in. The possibility could exist if we had access to the basic framework.

b)That's a sign of a good simulation. We aren't supposed to know anything about the underlying machine. Even if it weren't, if two machines produce the same outcome always, they are equally valid for the problem at hand.

c)Not a big problem unless you concern yourself with irrelevancies.

d)Doesn't matter there either as long as it works.

e)It's not worth doing it even if we could, unless there's a serious flaw in our reality's methods, there is no means to gain any external information.

I will continue to assert that the problem of game simulations is much more tractable and parallelizable than many people claim, and that reasonable and lifelike games can be designed not just via spatial subdivision and local physics, but also by replacement of simulation parts by fascmiles, pre-computed results (some online, some offline) and relaxation of constraints of universal flow of time and physics everywhre.

I'll maintain my original contention that parallelism is good, but that reckless parallelization is a sacrifice of correctness to the demons of indeterminism, and incites the dread wrath of Amdahl and al-Khwarizmi.

Hell, even the Matrix Reloaded got this right. The machines didn't simulate Birds or Clouds at the particle level. As the "Oracle" says to Neo before his big fight with multiple smiths, they wrote separate custom programs for many of the things in the matrix (birds, bullets, etc) Humans simply don't need quantum physics simulation unless they are scientists running experiments.
It took some serious liberties with latency and time flow.
It depends somewhat on how many of these pod batteries there were, and where they were located geographically. I was under the impression that the machines had colonized pretty much the entire world.
If there were pods elsewhere on the planet, then any conversation between people who happened to be in widely separated pods would inexplicably have small periods of lag.

It would be virtually impossible to battle-rap.

Not that it mattered, the movie was presupposed to have an almost mystical justification for everything, even when it was supposedly bound to technology. It had to be, since the human battery thing was incredibly stupid from a technical standpoint.
 
Synchronization within the bounds of our reality is infinitely scalable and takes no additional time: it doesn't take any time.

You still haven't defined what such synchronization is. To say that my Pentium CPU can't run a general purpose simulation of itself that runs faster than itself is patently obvious, the existence of such a simulation could lead to infinite computation power (keep running the sim recursively). However, there is a difference between asserting this, and asserting that X "takes no time" in the universe.

For the universe, nothing can move out of reality. By definition, everything that ever happens in the universe exists, so it is represented. Any combination of internal states is stored. No matter how complex the state becomes and how large the representation is, there is no pop-up for a lack of hard drive space.

This trivializes the very real problem of information loss in physics, namely -- Do Black Holes Erase Information? (convert pure states into mixed states and violate unitary evolution) This is an intense debate in physics today. You simply do not know whether the universe has exactly the amount of storage it needs to store an ever increasing amount of entropy, and maybe black holes are the natural garbage collection mechanism. And as I said, on the other side is the holographic principle which implies it has too much storage for what it does.

It's not a declaration that a problem that is N*2 in size will take less time to calculate than one of size N. If that were true, then if N=10^100 would take less than a picosecond and you could then use a subset of the results to solve a problem of size 10 way faster.

I don't believe that's what I claimed. I claimed only that the time needed to compute step n, call it T(N) can grow smaller. This is true for the majority of the inputs. In fact, for some inputs, it doesn't even need to keep growing space.

It's saying that there is a pathological case where it is even worse than a non-memoized version at a given problem size. Considering that they mention people are trying to make HashLife run various flavors of a Turing machine, this case will be true if they succeed.

I already brought up universality and Turing Machines inside Life in my last post, so yes, there are pathological cases, and for the same reason that lossless data compression can't compress all inputs, time compression can't compress all runs of an algorithm. People aren't "trying to make Turing Machines" in Life, it's already been done, you can download the TM today in most Life simulators.

Moreover, time compression that worked for all inputs would probably lead to a violation of the Halting Theorem, as you simply encode the program ot be checked into a UTM into HashLife, and then wait a finite time for it to finish.

However, we have been talking of game simulations, and I'm not sure I need my game simulation to exhibit universality or correctly deal with it without loss.

That measure of compressibility is closely related to a representation's Shannon entropy. That is a measure of randomness with respect to representation, not a means of creating random outcomes.

No, randomness in Algorithmic Complexity Theory is more closely related to Godel's Incompleteness Theorem. It is based off the fact that for any formal axiom system encoding N bits of information, one cannot prove statements containing more than N+c bits of information. The randomness is inherent, there are a large number of theorems nevertheless true, but undecidable in FAS, and therefore true for completely random reasons. There is no dispute about this.


I disagree with your definition of a random outcome, which sounds suspiciously like bogus definitions of free will in philosophical arguments. It can't even be converted into a formal mathematical logic. You can't rule out, for example, that a given outcome is simply the result of an unknown cellular automaton and unknown seed (which could in fact, be the sum total of all information in the simulation). For any "random outcome" by your definition, I can simply propose an underlying mechanism responsible coupled to inaccessable state.

Randomness has nothing to do with determinism. It has to do with whether the sequence of output can be compressed smaller than the sequence itself. When this occurs, we throw up our hands and say "I can't seem to find a mechanism to compute the series of outcomes other than just to list them". Even people studying Quantum Randomness use the algorithmic concept of randomness and not "world at state T can't be a function of ANY information of world at state T-1"

Randonmess is inherently related to fundamental limits of metamathematics, the very limits to mathematical epistomology.

e)It's not worth doing it even if we could, unless there's a serious flaw in our reality's methods, there is no means to gain any external information.

Try reading Hans Moravec's _Mind Children_ in the chapter "Newway and the Cellticks" for a gedanken experiment, or more recently, the discovery of a possible mechanism in the cosmic microwave background for about 10k bits to have been encoded by a creator, or otherwise, as input from the external system.

Also, try reading some introductory papers to algorithmic complexity theory and randomness here: http://www.cs.auckland.ac.nz/CDMTCS/chaitin/

p.s. The best we have to our means, as far as the acquisition and understanding of the universe, is rational thought based on logic. Whether the universe is in the equivalence class of Turing Machine or not is unknown. However, if you operate strictly within logic, no one has yet been able to produce a model of a machine (even in principle) that would violate the Church-Turing hypothesis. (remember, CT doesn't care about speed or tractability) So much so that many physicists are now starting to argue that CT should be accepted as a physical principle. David Deutsch goes to show that a universal quantum computer can't compute non-recursive functions, and therefore, still is in essense, just a Turing machine, albeit a faster one. Asking whether a machine more powerful than a Turing machine can exist (*in principle*) may be like asking whether Godel's Incomplete Theorem can be violated, or whether the integers can be mapped into the reals. It just may not be possible, in which case, the universe would have to be based on transcendental logic.
 
With the OS scheduled SMP multi-threading model, threads are impossible to keep in sync. without a lot of complex wait states, because any other thread that timeshares on a core can slow it down. This is a problem if there is any kind of component of the task that has to be performed in series - for example if you have to complete adding two matrices together before multiplying it by a third matrix. That is the stark simple reason SMP is not used in supercomputing.

the stark simple reason why you don't get SMP in supercomputing is that at the levels of parallelism found in supercomputers, SMP would go in the corner and die sobbing.

the more the cores, the more contention kills your SMP utopia, as simple as that. and OS scheduling and time-sharing have nothing to do here - you're already dead by the moment you get your gazilion threads up and running each on its private SMP core, even if those were nicely synchroized at boot up and with no subsequent scheduling interference whatsoever.
 
Back
Top