Fast software renderer

Why would a core on Larrabee run half of the chip's clock?
The core is running at clock, but it's roughly half as wide as the dual-issue Pentium when running general code.
It was indicated in Intel slides (there was some additional reinforcement of my interpretation in this forum) that the core has a scalar x86 pipe and the vector pipe, and absent a vector instruction issue plain x86 is not going to take up the full width.
The performance jump from single to dual issue is close enough to doubling that I rounded it off.
 
Do you know it is actually implemented this way? Shared memory is not the only way to implement these features, as other vendors have demonstrated.
So theoretically nVidia *could* have implemented it this way, but I wouldn't be surprised if they didn't. In fact, I'd be surprised if they did.
No I don't know, that's why I used the words "I reckon".

http://forum.beyond3d.com/showpost.php?p=1289839&postcount=918

That's all I have. Feel free to suggest why graphics would leave shared memory unused.

Jawed
 
Speculation has benefits where it is appropriate, depending on the performance criteria of a design.
The wastage may be invariant over time. Expecting a certain performance level in code that provides no obvious extra non-speculative work per clock may only come by speculation.
For workloads where this is not a problem, the amount of non-computational active logic and die area expended on what turns out to be irrelevant bit-fiddling becomes increasingly inappropriate.
Jawed said:
Still, having lots of space for context on chip is currently the only solution. And with the difference in performance between on-die memory and off-die memory only increasing, there's no solution in sight.
I'll try to comment on both simultaneously to save on trees. ;)

There is a solution for having to keep lots of data on chip. That's why I pointed back to NV40. Clearly they've improved on that architecture since. What changed is that strands are processed faster when you decouple the ALUs from the texture samplers. On NV40 each instruction in a single strand had the same latency as the entire pipeline length. Nowadays it's a mix of texture sampling latencies, branch latencies, and ALU latencies. This reduced the number of strands in flight, allowing to have more registers per strand.

But we can do even better. That's where caches, speculation and out-of-order execution come in. Caches reduce the average memory access latency, speculation reduces average branch latency, and out-of-order execution reduces latencies due to instruction order. So all of them help process a strand much faster so we don't have to store so many contexts and have to reduce throughput when we run out of registers (slide 7).

Of course the burning question is: do we really need to do even better than today's GPU architectures? In my opinion, yes. Shaders are becoming more complex every day. And to support things like deep function calls or recursion a large stack is required (CPUs typically have around 1 MB of stack space). Even though 1 MB won't be needed any time soon, I still expect a spectacular increase in the temporary data a strand needs to store. Any architecture not able to cope with that will be left behind.

So I wouldn't instantly exclude something like speculation as (part of) the solution. Maybe it shouldn't be implemented as aggressively as on CPUs, and maybe there are better ways to spend your transistors, but it's not adding ever more registers.

Oh and if you're worried about power consumption, there has been some interesting research about branch prediction fidelity. When it's lower than a threshold, instruction fetch is halted till the branch is resolved. By choosing the right threshold a good balance between power consumption and utilization can be obtained, while keeping the number of strands in check.
 
Last edited by a moderator:
Do you think AVX will have similar problems?
Yes. I believe they should have made it 2 x SSE, instead of SSE with 256-bit registers. Some instructions only operate on the lower 128-bit. Although this can be corrected in the future, it means we'll have to write yet another code path for AVX2. This sort of unnecessary complications slow down the adoption of new ISA extensions.

Translating existing code from SSE to AVX should have been straightforward: use the corresponding AVX instructions and process twice the data in parallel. But because some instructions don't have a 256-bit equivalent yet (or should I say 2 x 128-bit), it can be a bit more complicated. Of course they did this to save a few transistors by not having to place two equivalent SSE units in parallel, but that's a mistake in my book. They could have foreseen having to do that at a later point anyway. I wouldn't care if some 256-bit instructions had an extra cycle of latency because they're executed on 128-bit units. At least that preserves binary compatibility, and just like when they widened SSE execution units from 64-bit to 128-bit it would seamlessly increase performance of existing executables on newer hardware!

Same thing for scatter/gather. They should have already added it to the ISA. It doesn't matter if the first implementation uses multiple cycles to read/write each element individually. People would already be able to use the instructions and performance would increase from one CPU generation to the next. This would also allow Intel to assess how many transistors they should throw at it, using real world applications. The problem of collecting data elements from different memory locations is only going to get bigger, so again they could have foreseen that and added the instructions early on.

Last but not least, by specifying foreseeable ISA extensions and sharing it with the competition, they can speed up the adoption and just battle things out at the hardware level. There are no winners when they add different ISA extensions. Developers just use the lowest common denominator while hardware manufacturers waste precious transistors and R&D on unused features.
 
There is a solution for having to keep lots of data on chip.
To get this out of the way, all the solutions you use later still increase the amount of data on-chip.
Some of it gets moved around, but the amount doesn't go down.

That's why I pointed back to NV40. Clearly they've improved on that architecture since. What changed is that strands are processed faster when you decouple the ALUs from the texture samplers.
There were a number of things the older style of pipeline had going against it. There's nothing speculative about decoupling until you explicitly add speculation, it can enhance utilization without trying to manufacture work the hardware doesn't know exists.

On NV40 each instruction in a single strand had the same latency as the entire pipeline length. Nowadays it's a mix of texture sampling latencies, branch latencies, and ALU latencies. This reduced the number of strands in flight, allowing to have more registers per strand.
Register amounts have gone up. How are you defining in-flight, though? In Nvidia's parlance, there are thousands of "threads" at the ready for instruction issue at any given time in the latest chips.

But we can do even better. That's where caches,
A metric ton of on-chip data storage.

speculation and out-of-order execution come in.
There is additional internal context and intermediate storage in the hardware. Depending on the scheme, there can be a massive amount more.
A modern OoO x86 has effectively several hundred registers per-thread.

Caches reduce the average memory access latency,
This is on-chip storage, however. No objection to large amounts of registers is valid if pushing the same data they hold onto a cache 1 mm to the side doesn't also count.

speculation reduces average branch latency,
The components of that latency interest me.
For many real-world examples, that latency is much smaller when the core is not heavily speculating or OoO. The fetch phase may get an extra stage, and the rename, ROB, and schedule stages exist because of OoO. A regular instruction set would help, too.

2/3 of the PIII's mispredict pipeline stages are due to the factors listed.

and out-of-order execution reduces latencies due to instruction order. So all of them help process a strand much faster so we don't have to store so many contexts and have to reduce throughput when we run out of registers (slide 7).

This isn't apples to apples. Nvidia's model exposes the full scope of the register allottment problem to the programmer.
OoO will have a lot of internal state that is hidden from the programmer, but is physically still there.
The internal context of a CPU thread on an OoO processor is much larger than it appears to be from the software viewpoint.

Of course the burning question is: do we really need to do even better than today's GPU architectures? In my opinion, yes. Shaders are becoming more complex every day. And to support things like deep function calls or recursion a large stack is required (CPUs typically have around 1 MB of stack space). Even though 1 MB won't be needed any time soon, I still expect a spectacular increase in the temporary data a strand needs to store. Any architecture not able to cope with that will be left behind.
The speculation versus non-speculation debate doesn't really cover this, as fully non-speculative CPU designs can be made to do all those things.

Oh and if you're worried about power consumption, there has been some interesting research about branch prediction fidelity. When it's lower than a threshold, instruction fetch is halted till the branch is resolved. By choosing the right threshold a good balance between power consumption and utilization can be obtained, while keeping the number of strands in check.
Do they take measures to limit the on-chip storage penalty of having additional data per branch predictor entry?
Is there also research into increasing branch predictor throughput?
 
Personally, I am in favor of non speculative architectures for parallel architectures. And while we are on this topic, let me go out on a limb and proclaim that in 10 years, the massive data centers will be shifting from the current CPUs over to massively multithreaded CPUs which do no speculation.

Cloud computing has all the tools necessary to crack open the death grip of x86. Power efficiency is paramount, it has/will have the scale of the internet to feed the fabs, and virtualization is there to parallelize any workload that you can throw at it. :LOL:

And for MS and Google at least, an ISA switch a JIT compiler away. And all the optimizations in ARM backends for the compilers will be put to a totally different use.

That does not mean that Intel will watch as the rest of the world eats it's lunch and dinner. I mean that competition in this area will be severe.

On the area of power efficiency atleast, I think the debate is well settled. x86 sucks here. It's not that modern ARM cores don't perform superscalar, or OoOE, it's just that their ISA's are far far more efficient when they are doing the same thing. For serial, oooe is good. It so happens that more efficient in order, scalar CPU's gain more when they speculate and do multi-issue, partly because of their higher code density.

Now, x86-decoders-are-tiny. Sure, folks, they are. But they don't consume 0 power. And if you are running a data-centers for a cloud, then changing ISA is like changing TV channels. And in hard times, every penny counts.
 
Where did you get that information ?

I don't know, it's too long ago... it was all over the place some time ago, when the buzz about Larrabee started. The X4000-series is the last of their current generation, and the next generation of IGPs will be based on Larrabee.
 
Personally, I am in favor of non speculative architectures for parallel architectures. And while we are on this topic, let me go out on a limb and proclaim that in 10 years, the massive data centers will be shifting from the current CPUs over to massively multithreaded CPUs which do no speculation.
That's quite a jump from one extreme to the other. There would be workloads or service criteria that wouldn't be as forgiving of reduced straight-line performance.
It's also a case where off-die factors such as lock contention or maintaining a certain level of memory consistency without cratering performance can change the situation so that speculation would still be necessary in order to allow parallel scaling for larger systems.

Cloud computing has all the tools necessary to crack open the death grip of x86. Power efficiency is paramount, it has/will have the scale of the internet to feed the fabs, and virtualization is there to parallelize any workload that you can throw at it. :LOL:
I guess we'll see about that. I'm not yet sold on the idea that it solves as many problems (or that there are as many problems) as is claimed.


On the area of power efficiency atleast, I think the debate is well settled. x86 sucks here. It's not that modern ARM cores don't perform superscalar, or OoOE, it's just that their ISA's are far far more efficient when they are doing the same thing. For serial, oooe is good. It so happens that more efficient in order, scalar CPU's gain more when they speculate and do multi-issue, partly because of their higher code density.
Current ARM cores don't perform OoOE. It will be interesting to see what the next iteration will be like.
That there is a power penalty to x86, even Intel has admitted. Whether it can eclipse Intel's engineering and process lead enough to be a deal-killer is not yet answered.
ARM is perhaps the only ISA now with any kind of install base anywhere that can threaten Intel at the low end, but ISAs don't work miracles.
 
On the area of power efficiency atleast, I think the debate is well settled. x86 sucks here. It's not that modern ARM cores don't perform superscalar, or OoOE, it's just that their ISA's are far far more efficient when they are doing the same thing.

Indeed ARM has a very neat ISA, imagine doing a = b + (c << d) or a = b*c +d in one 32 bit instruction.
That are 4 different registers, and it can do this even conditionally.

Not that Intel can not design good ISAs, i960 was really good, different but similar in quality to ARM. Very strange that it was not used for Larrabee.
When marketing managers get into control the most bizarre of things can happen.

Now, x86-decoders-are-tiny. Sure, folks, they are. But they don't consume 0 power. And if you are running a data-centers for a cloud, then changing ISA is like changing TV channels. And in hard times, every penny counts.

I have here a picture of one of the Nehalem cores in front of me, annotated with functional part naming. It may not be accurate as 'instruction decode and microcode' appear to be larger than branch prediction. Maybe you can point to a more accurate drawing ?
 
That's quite a jump from one extreme to the other. There would be workloads or service criteria that wouldn't be as forgiving of reduced straight-line performance.

If a workload scales with cores well enough, then straight line performance is not the be all and end all of all things. And even for serial codes, you pay by the hour on the cloud. So it gets amortized there. Even if you are running something like entropy encoding on the cloud, at times of high load, your work may not get as much time on a CPU.

With an ISA like ARM, each instruction can do a lot more. So dual issue for ARM makes much more sense.

Which parallel workloads are unforgiving of reduced straight line perf?

It's also a case where off-die factors such as lock contention or maintaining a certain level of memory consistency without cratering performance can change the situation so that speculation would still be necessary in order to allow parallel scaling for larger systems.



Current ARM cores don't perform OoOE. It will be interesting to see what the next iteration will be like.

here

http://en.wikipedia.org/wiki/ARM_architecture

for cortex a9, it says it does ooo and superscalar. I dunno if it is on the market yet. I thought it was.

Indeed ARM has a very neat ISA, imagine doing a = b + (c << d) or a = b*c +d in one 32 bit instruction.
That are 4 different registers, and it can do this even conditionally.

Yup. I really like this ISA. :D

I have here a picture of one of the Nehalem cores in front of me, annotated with functional part naming. It may not be accurate as 'instruction decode and microcode' appear to be larger than branch prediction. Maybe you can point to a more accurate drawing ?

May be branch prediction is larger than decode. But decoders are more complex than they are for risc's and afaik, ARM does not use microcode.
 
Last edited by a moderator:
To get this out of the way, all the solutions you use later still increase the amount of data on-chip.
Some of it gets moved around, but the amount doesn't go down.
Size matters, but it's what you do with it that counts most! ;)

Once again, let's look at NV40. It has quite a big register file, but it has poor usage. Subsequent architectures improved on it a lot. The amount of data hasn't really gone down; the relative transistor budget for registers is roughly the same. But the available storage is used much more efficiently, allowing each strand to use more registers without cripling performance. So this is clear proof that architectural improvements can be a better option than just throwing more silicon at it.

Same thing for texture caches, even in modern architectures. If the average latency is too high then you'll run out of threads to hide that latency. So some minor changes to the prefetching heuristics (speculation!) can make a big difference. Likewise an increase in size costs maybe 1% of additional die space but if that fixes the biggest bottleneck it's transistors very well spent.

One of the remaining problems with today's GPU architectures is that they have lots and lots of small caches, buffers, queues, register files, local storage, etc. Each of them has to be overdimensioned a little to avoid them from being a bottleneck. So on average only part is used. The rest is mostly wasted, just sitting there waiting for you to one day give them a workload that will result in 100% utilization. Ironically, when that happens, they become a bottleneck. Given the multitude of buffer like structures, there's always going to be one that is swamped, while the rest plays dead silicon. The reason why we have this many buffers is because that's the optimal configuration for running 3DMark. If you know how much storage is needed for every stage of the pipeline, a fixed dedicated structure is most efficient. But times are changing...

When you have to support a wide variety of workloads, and even future ones that you don't know about, the solution is unification. Think of the unification of vertex and pixel processing. A feeble first step in unification, but with remarkable results. You're no longer bottlenecked by either vertex or pixel processing. And despite the fact that dedicated vertex and pixel units are smaller so computational density is higher, unified architectures achieve better performance in real-world applications. Unification compensates the variation in workload, across different applications but also within the same application. You just use the silicon for what you need it the most. So the next step is unifying the data structures, to compensate for the variation in storage needs.

Larrabee is leading the way. It uses the L2 cache for many different purposes. It stores unprocessed vertices, processed vertices, primitive gradients, framebuffer data, shader constants, spilled registers, etc. None of that data poses a bottleneck any more, and storage not used by one task is automatically used to assist other tasks. Note that much of it can be controlled by software. So Larrabee can cater for many different workloads, including 3DMark but also things current GPUs don't handle well, and even future applications...

There's obvioulsy a tipping point for everything. CPUs have too much storage (cache), wasting precious die space, while GPUs have too little, decimating performance with certain workloads. But both can be improved by borrowing ideas from each other. CPUs can increase SMT to hide latency, at a minor increase in thread context but allowing to decrease the relative cache size, while GPUs can unify data storage structures, also leading to better utilization. Note for instance that Nehalem-EX will sport a seemingly massive 24 MB L3 cache, but that's actually less cache per thread than Penryn, for a server chip versus a desktop chip to boot.

So I'm really not convinced that speculation won't ever find its way into GPUs. You don't have to take it to the extreme like today's CPUs. That would be past the tipping point. But when branches are killing performance because you've run out of threads, resorting to speculation doesn't sound so bad. Note that you won't have to speculate with classic graphics workloads with short shaders, but it's a lifesaver for workloads that are branchy and use lots of registers. It has an implementation cost too, but with the things we'd like to run on the GPU becoming ever more complex there's bound to be a day when speculation is cheaper than adding more context storage.
There were a number of things the older style of pipeline had going against it.
Absolutely. And in a few years time you'll say exactly the same thing about today's architectures.
There is additional internal context and intermediate storage in the hardware. Depending on the scheme, there can be a massive amount more.
A modern OoO x86 has effectively several hundred registers per-thread.
I know. The Unabridged Pentium 4 mentions 128 integer alias data registers and 128 floating-point alias data registers. But they didn't grow the physical register file when adding Hyper-Threading. Also, P6 had 40 alias data registers, in total...

My point is that adding out-of-order execution and speculation don't force you to have a physical register file much larger than the logical register file. In fact they can be the same size. Register renaming is not a requirement of either technique. So once again it's just a knob the GPU designers can turn to optimize the architecture. There's no need to turn it all the way to the CPU setting. And if turning it off leads to the best performance/area/wattage balance for the workloads they want to cater to then fine, I accept that. But I do think it's wrong not to have a knob at all. For a long time the knob (or switch) to unify the vertex and pixel processing was set to off, but it would clearly have been foolish to conclude that it would never be needed. Last but not least, the law of diminishing returns tell GPU designers that they can't keep turning the same knobs to achieve higher performance. So one day they'll have to look at the remaining ones as well...

Anyway, unless there is some glaring evidence this will never happen, I'll leave the discussion at this. It has cost enough trees already and it looks like we're not going to reach consensus. Which is fine. After all we're speculating about speculation. ;)
 
If a workload scales with cores well enough, then straight line performance is not the be all and end all of all things.

Which parallel workloads are unforgiving of reduced straight line perf?
Remote computation for any user interactive workloads is going to have a latency maximum. Perhaps a server used for OTOY could service 100,000 simultaneous users on a little platformer game, if the response time or frame rate were 1 frame per second. It would no longer be useful for the intended workload.

In somewhat related hardware, various web server and transaction benchmarks and by extension the consumers of such services have maximum allowable response times.


With an ISA like ARM, each instruction can do a lot more. So dual issue for ARM makes much more sense.
That's not necessarily true, and not necessarily a good thing from a silicon perspective.
High-performance ARM may internally split up some of the more anachronistic things like the built-in shift option into a separate internal operation, because it is a potential timing liability.

for cortex a9, it says it does ooo and superscalar. I dunno if it is on the market yet. I thought it was.
The products I heard rumors about were for next year.

May be branch prediction is larger than decode. But decoders are more complex than they are for risc's and afaik, ARM does not use microcode.

I know some older and simpler ARM chips did not have microcode. I do not know that is true about some more recent ones.

Once again, let's look at NV40. It has quite a big register file, but it has poor usage. Subsequent architectures improved on it a lot. The amount of data hasn't really gone down; the relative transistor budget for registers is roughly the same. But the available storage is used much more efficiently, allowing each strand to use more registers without cripling performance. So this is clear proof that architectural improvements can be a better option than just throwing more silicon at it.
Do you mean NV45 or G80?
G80 would not be a data point in your favor in terms of strands in flight or the number and size of the registers.

Same thing for texture caches, even in modern architectures. If the average latency is too high then you'll run out of threads to hide that latency.
Those caches are there more for saving bandwidth than hiding latency. The optimal thread counts and the assumed arithmetic density for the bulk of GPU work are proportioned so that a trip to DRAM is accounted for.

So some minor changes to the prefetching heuristics (speculation!) can make a big difference. Likewise an increase in size costs maybe 1% of additional die space but if that fixes the biggest bottleneck it's transistors very well spent.
That would be a different kind of speculation that what was focused on earlier. It is interesting that in many cases, once bandwidth utilization becomes high without prefetching, it is often better to disable it, at least in the case of multi-socket CPU work.

Once there is no spare bandwidth, speculation becomes a liabilility, and the upsides are much less for architectures that already tolerate a massive amount of latency. As speculation and prefetching for CPUs can easily increase bandwidth demands 2-3 times, this might be why Larrabee'shardware has not been disclosed as doing much of this.

RV770 does prefetch based on what the triangle setup hardware determines are necessary texture coordinates. It's not the stride-based prefetches CPUs do, as it actually knows what those addresses are. It still might be speculative, there might be ways to branch away for some lanes that would have consumed the fetches, I'm not versed on those particulars.

So at least RV770 speculates in a narrow sense for about 5% of the die.
The UVD decoder might have some amount, as it's a specialized MIPS-based core of some sort, I think. I haven't seen much analysis on that one.

Larrabee is leading the way. It uses the L2 cache for many different purposes. It stores unprocessed vertices, processed vertices, primitive gradients, framebuffer data, shader constants, spilled registers, etc.
Textures?
And no bottlenecks?
I don't recall the first step in eliminating bottlenecks when working with many processes is to make them all use the same resource.
There are always bottlenecks, unless the design purposefully leaves performance on the table everywhere.

None of that data poses a bottleneck any more, and storage not used by one task is automatically used to assist other tasks.
From a software standpoint, it sounds like this all comes totally free...

But when branches are killing performance because you've run out of threads, resorting to speculation doesn't sound so bad.
That's not really the reason why branches kill performance in SIMD architectures.


It has an implementation cost too, but with the things we'd like to run on the GPU becoming ever more complex there's bound to be a day when speculation is cheaper than adding more context storage.
That's a pretty fuzzy statement.
Any unspecified amount of X can be made cheaper than any unspecified addition of Y.

Factors like arithmetic throughput growth, bandwidth constraints, and power constraints would typically make this case unlikely for most reasonable scenarios.

Absolutely. And in a few years time you'll say exactly the same thing about today's architectures.
It will get said of all architectures at some point. Designs are meant to target the current workload and a small window of time in the future.
The product in question had serious problems at the outset.

I know. The Unabridged Pentium 4 mentions 128 integer alias data registers and 128 floating-point alias data registers. But they didn't grow the physical register file when adding Hyper-Threading. Also, P6 had 40 alias data registers, in total...
The P4 as a data point would serve as one where speculation had been taken to an excessive level.
The fact that a lot of on-chip storage did not grow, and the interference of extreme speculation made performance highly unpredictable. And the later Prescott that actually fixed some of those problems was thermally limited to the point that it mostly didn't matter.

My point is that adding out-of-order execution and speculation don't force you to have a physical register file much larger than the logical register file.
The register file for a Tomasulo OoO engine is the most obvious size of context growth.
Changing to a different method still adds context, be it a central table, and other forms status tracking.
Are there methods that add less context? Yes, but they are far less effective without register renaming and would encourage the 64-128 ISA-visible registers we see anyway.
The tiny register file of x86 might be a benefit, as context switches are much less painful as a result, but it wouldn't be a benefit if it weren't for register renaming.

Last but not least, the law of diminishing returns tell GPU designers that they can't keep turning the same knobs to achieve higher performance. So one day they'll have to look at the remaining ones as well...
The primary knobs they've been looking most recently to are improving on-chip communications, divergence handling, and getting smart about memory traffic.
OoO might mitigate some of the first by making some latency hiccups more tolerable.
OoO and speculation tend to be negatives on the rest.

Anyway, unless there is some glaring evidence this will never happen, I'll leave the discussion at this. It has cost enough trees already and it looks like we're not going to reach consensus. Which is fine. After all we're speculating about speculation. ;)

As you wish. The thread's a bit off on a tangent, anyway.
 
Current ARM cores don't perform OoOE. It will be interesting to see what the next iteration will be like.
That's not really fair, they are generally used as single cores in larger systems ... with decreasing process sizes their area and power consumption become less and less important. Wasting some for ease of development and single thread speed is becoming a definite option. Microcontrollers in a way are coming to the same point normal CPUs were decades ago.
 
That's not really fair, they are generally used as single cores in larger systems ... with decreasing process sizes their area and power consumption become less and less important. Wasting some for ease of development and single thread speed is becoming a definite option. Microcontrollers in a way are coming to the same point normal CPUs were decades ago.

That sentence was in the context of comparing efficiency with x86.
A9, when it comes out, would be a chip that would indicate how much power the ISA would save once it has a similar performance target as something like a ULV Core chip.
Until the chip comes out in product form, we also can't take its power numbers on face value, either.
 
Microcontrollers in a way are coming to the same point normal CPUs were decades ago.

What do you imply whit that ?
The ARM started as a desktop processor 22 years ago:
http://en.wikipedia.org/wiki/Acorn_Archimedes

It was made by a company called Acorn, not too different from how Apple was.
This company made all the hardware for it's PC, including the microprocessor and including the operating system.
The OS had an advanced windowed GUI and it did cooperative multitasking.

Technologically it was miles ahead of how the x86 PC was at that time.
From a graphics point of view it was a dream machine for programming.
The ARM instruction set was so simple and powerful that you could actually enjoy writing assembler.
Over a time span of a decade I wrote dozens of graphical demos and algorithms including the original FQuake.
 
Back
Top