Was Cell any good? *spawn

Arwin · Aug 7, 2012

So, I did remember correctly:

If we consider the DMA transfers (Memory-LS communication) and the computation, the
theoretical speed-up should be the same as when not considering the communication. But this
is not the case as the PPE cannot feed all the SPEs with the required input and write back the
output results in the main memory global system fast enough.

and

The expected gain in performance was twice the actual that we are getting. The reason is due
to the idle time during which the SPEs are waiting for data from the PPE. The bottleneck is
reordering data for the gather/scatter task. The reasons could be the low peak performance of
the PPE (IPC throughput) and the number of L1/L2 cache misses (memory bound).
5.1.5 Conclusions
The major constraint is the PPE performance. The PPE is an in-order execution processor
with very few functional units. This property makes it difficult to get good performance on
the critical task that is executed on it. We cannot rely on executing critical code on such a
processing unit. We should port it (if possible or feasible) to the SPEs.
Data that we have to use for element loop computation is sparse in main memory and nontrivial to access from the SPEs. To help on that task the PPE gathers/scatters the required data
into a sequential buffer. The main unsolved problem is that the PPE driven gather/scatter
approach is not efficient and the SPEs are only working 60% of the time in the element loop
computation; they are idle waiting for data from the PPE in the remaining time. If we would
have started a new code from scratch we would have changed noticeable the data layout
structure. The data layout can be an important handicap for DMA transfers to the SPEs.
The solution could be to include a better PPE unit (Power6 or Power7 processor) in the
Cell/B.E. chip or implement the gather/scatter task through SPEs using DMA list (SPE
driven). In fact, the SPE driven approach for gather/scatter task can be implemented with the
actual hardware. However it requires DMA list research and maybe some reordering of the
mesh data at the beginning. It should be interesting if future Cell/B.E. implementations would
not have constraints for DMA list transfers of less than 16bytes.
We have SPE-wised the element loop computation, and the second remaining computational
part is the solver. Most of the time the solver executes SpMV (sparse matrix vector
operation). To further optimize the program we could provide an internal/external library with
such linear algebra operations on the SPEs.

Mike Acton also reached this conclusion and discussed them on these forums, that PPE feeding bottleneck could be overcome by the SPE doing all this themselves.

Note, by the way, that only three projects were actually optimised to use the SPEs at all as discussed in 6-5 (Alya, BSIT and Siesta). Some of the others were ported but only used PPE.

SIMD (Single Instruction, Multiple Data) is processing in which a single instruction operates
on multiple data elements that make up a vector data-type. SIMD instructions are also called
vector instructions. This style of programming implements data-level parallelism. Most
processor architectures available today have been augmented, at some point, in their design
evolution with short vector, SIMD extensions. Examples include Streaming SIMD Extensions
(SSE) for x86 processors, and PowerPC’s Velocity Engine with AltiVec and VMX. The
different architectures exhibit large differences in their capabilities. The vector size is either
64 bits or, more commonly, 128 bits. The register file size ranges from just a few to as many
as 256 registers. Some extensions only support integer types while others operate on single
precision, floating point numbers, and yet others process double precision values. Today, the
Synergistic Processing Element (SPE) of the CELL processor can probably be considered the
state of the art in short vector, SIMD processing. Possessing 128-byte long registers and a
fully pipelined fused, multiply-add instruction, it is capable of completing as many as eight
single precision, floating point operations each clock cycle.

This was in 2009, but I think this part was not changed from the original Cell, so their SIMD at least has managed to stay 'state of the art', at least according to this paper.

aaronspink said:
tool wise Cell never really has had a good set. One thing you notice pretty quickly is that anything you want to optimize kernel wise almost always has a call/function within MKL. Cells tools and libraries are severely lacking even today compared GPGPU. A large part of Cell's issues stem from it being a relatively exotic architecture will very poor tools support. It is interesting to contrast Cell in that regard with the data that came out of the TACC workshop on Intel's MIC which had large full scientific workloads ported in days and with good scaling.

I don't think many people disagree with you on this one. As the article says itself, 'the control is put back into the hands of the programmer' with the downside being that you have to do a lot of work you otherwise wouldn't have. Even with some of the tools maturing and becoming quite good (and the Cell simulator I think is pretty helpful too, as it allows you to follow your data exactly, quite nice), Cell remains (too much of) a specialist device to become very mainstream. But it is still a really, really cool design and chip to have in a console, and of all the things that have gotten in the way of the PS3, I'd say the blue laser was a far bigger issue (delaying its launch) as well as the company that made the device being crap at the software side of things.

Thanks for digging up these papers by the way.

EDIT: and respect to the guys working on this. They really know their stuff and it is surprising how far they got with the amount of time they've spent.

Nesh · Aug 7, 2012

aaronspink said:
Cell by and large was their plan for a GPU. Hence why Cell does ok at GPU tasks like geometry and anti-aliasing, etc. The only reason Sony had to scramble at the last moment is that their Cell as graphic theory didn't work.

Even if that was the case, it doesnt go against what I stated. The problem was not Cell, it was their initial planning for the overall console design and Cell was a great CPU that if Sony planned from the beginning to have a proper GPU the PS3 would have had a Cell+better GPU.

Nesh · Aug 7, 2012

Shifty Geezer said:
Absolutely. Cell is not IBM's design. They wanted homogenous multicore and didn't want SPEs. SPEs were all Toshiba's doing, and so it's no wonder that Cell schizmed into IBM's CPU and Toshiba's SPURSEngine.

That makes me wonder what part of the overall CELL design was part of Sony's vision

Shifty Geezer · Aug 7, 2012

Nesh said:
That makes me wonder what part of the overall CELL design was part of Sony's vision

A scalable architecture AFAIK. None of the particulars have been attributed to Sony from what I've seen.

jonabbey · Aug 7, 2012

Shifty Geezer said:
A scalable architecture AFAIK. None of the particulars have been attributed to Sony from what I've seen.

Though Kutaragi claimed that he was the one to insist on 8 SPEs for the aesthetics of it all. :smile:

Carl B · Aug 7, 2012

I think Kutaragi was definitely the man at the top of the whole thing, to both the betterment and detriment of the whole endeavor at times, and Sony had quite a bunch of engineers dedicated to the project as well. It's just the shouting matches of which we have been made aware occurred due to clashes of philosophy between IBM and Toshiba... and I think really, beyond the engineers in isolation, to an extent driven by the egos and corporate interests aligned on all sides.

Gubbi · Aug 8, 2012

Carl B said:
For the FFT implementation, it was a combination of not being able to port to the SPEs in this instance and bad PPE performance. I did give the page numbers you know.

The PRACE FFT performance number is not comparable to the others because PRACE ran mod2f on a CELL cluster with a problem size much bigger than the other papers listed here. The low performance is a result of data having to travel over PCIe when accessing data on other nodes.

What is intersting about the FFT solution is not the mediocre performance, but the absurd amount of work needed to get it up and running.

Carl B said:
I'll highlight also for doc D6-6 that pages 59 shows Cell as the top performer on mod2am, and page 60 affirms the relative strengths. A tweaked 2005/2006 chip beaten (barely) by an 8-core Nehalem and NVidia C1060, when you consider wattage, die sizes, and transistors, an extremely strong showing.

mod2am is dense linear algebra, - linpack stuff. Data access patterns are regular and computational density is about as high as it gets. CELL reaches 79% of peak FP performance, Core i7 reaches 94%. CELL wins because of higher peak FP performance.

If all computational problems could be reduced to dense linear algebra, the world would be ruled by vector CPUs. They can't, so it isn't.

Cheers

Andrew Lauritzen · Aug 8, 2012

Vitaly Vidmirov said:
You can use cache where needed, with any cache-policy, prefetch data you need in advance. All this is not possible on "classic" CPUs.

What on earth are you talking about? Are you honestly suggesting that software caches are as efficient and viable vs. their hardware counterparts? And you can't prefetch on CPUs?? What?

Vitaly Vidmirov said:
GPUs are SPMD machines and waste resources with their wide warps/wavefronts.

Uhh... they are SIMD machines, just like the Cell and CPUs and other stuff. SPMD is just programming model trickery. And sure wide SIMD has trade-offs, but we're into a pretty silly argument if you're really saying that 4-wide is some sort of magical efficiency maxima for all workloads. Furthermore this argument will have Cell losing squarely against OOE CPUs.

Vitaly Vidmirov said:
"Advanced" prefetchers that Intel uses is a joke compared to explicitly programmable DMAs.

LOL. Yeah, cause hardware prefetchers have *such* a hard time on DMA-style linear block transfers. And you're totally unable to prefetch in software...

You know what, this is getting so silly that I'm just going to stop there. No offense, but I've lost all confidence that most of the participants in this conversation have actually written optimized code on all of these different architectures. Without that background it's just fanboy BS and ignorance talking.

Arwin said:
I'd love to hear (honestly) about recursive/irregular algorithms actually being the faster ones. As far as I've learnt, by far the most efficient method of doing almost anything these days is streaming data along a factory line of modifier code.

Nah, that's just the brute force way. Most algorithms have the dumb brute force way that lets the throughput people advertise their massive GFLOPS numbers and then they have the real way that handles massively more data but requires irregular workflow. There was even a great article from the early days (80s IIRC) calling out parallel computing papers for this stupidity but alas we still see it to this day.

Two examples of the top of my head are the silly O(n^2) N-body numbers all the GPU people love to spout off vs the realistic algorithms (hierarchical, clustered) ones that everyone actually uses. Similarly most graphics algorithms do far better with BVH's, etc. than the standard brute force nonsense. I just showed some numbers in my BPS 2012 presentation today that demonstrate how the current GPU methods of light culling (such as used in BF3) typically do about two orders of magnitude too many fully-redundant tests, and thus even a modest CPU can outperform a monster GPU (by a wide margin considering perf/W).

Gubbi · Aug 8, 2012

Andrew Lauritzen said:
Two examples of the top of my head are the silly O(n^2) N-body numbers all the GPU people love to spout off vs the realistic algorithms (hierarchical, clustered) ones that everyone actually uses. Similarly most graphics algorithms do far better with BVH's, etc. than the standard brute force nonsense. I just showed some numbers in my BPS 2012 presentation today that demonstrate how the current GPU methods of light culling (such as used in BF3) typically do about two orders of magnitude too many fully-redundant tests, and thus even a modest CPU can outperform a monster GPU (by a wide margin considering perf/W).

Let's extend the last one to all kinds of space decomposition (KD-tress, octrees etc) since its used everywhere in games, from game logic to physics.

Cheers

sebbbi · Aug 8, 2012

Andrew Lauritzen said:
What on earth are you talking about? Are you honestly suggesting that software caches are as efficient and viable vs. their hardware counterparts? And you can't prefetch on CPUs?? What?

I don't think he was suggesting that. Hardware caches are of course better for random access patterns. However if you know your own access pattern (for example have generated an array to data pointers in previous algorithm step for data fetching) and are going to manually prefetch anyway, a fully manual local store is often (slightly) better than an automated cache, because you have full control over it. Coding manual cache logic for random access memory patterns of course is a major PITA.

Andrew Lauritzen said:
And sure wide SIMD has trade-offs, but we're into a pretty silly argument if you're really saying that 4-wide is some sort of magical efficiency maxima for all workloads.

That's actually a very interesting topic of discussion. I have been doing some AVX (eight wide vector) coding lately, and the step from 4 wide vectors is actually quite large. For SoA style large scale stream processing both are pretty much identical to code (and 512 bit and 1024 bit AVX2/3/4 would also be pretty much the same). Vector width can be easily abstracted away. But for simple 3d/4d linear algebra code for single objects (process/access object positions and rotations in game world) 8d vectors are hard to use, since you cannot just simply use your vector classes for math (simple linear algebra has almost 1:1 mapping to 4d vector intrinsics). To utilize all the 8 lanes you basically have to "VLIW" two operations together manually to fill the vectors. This isn't often feasible.

I would say that four wide SIMD is easier to use compared to wider SIMD units, because it is perfect for simple linear algebra, and games tend to do a lot of simple linear algebra (game worlds are 3d after all, and while 3d isn't exactly 4d, the efficiency is still very good). Wider SIMDs pretty much force you to do large scale SoA style batch processing. This isn't exactly a bad thing, since SoA processing tends to be more efficient. But it's not as easy to use, and not suitable for all kinds of algorithms.

Of course if we bring Cell to this discussion, you better be doing SoA style linear batch processing on it anyways. So 4 wide vectors do not make Cell an easier processor to utilize. But for general purpose CPUs (with good cache hierarchies) it's often easier to utilize 4d vectors compared to wide vectors.

Andrew Lauritzen said:
Two examples of the top of my head are the silly O(n^2) N-body numbers all the GPU people love to spout off vs the realistic algorithms (hierarchical, clustered) ones that everyone actually uses. Similarly most graphics algorithms do far better with BVH's, etc. than the standard brute force nonsense. I just showed some numbers in my BPS 2012 presentation today that demonstrate how the current GPU methods of light culling (such as used in BF3) typically do about two orders of magnitude too many fully-redundant tests, and thus even a modest CPU can outperform a monster GPU (by a wide margin considering perf/W).

I know exactly what you are talking about. We did our (tiled deferred) light culling using CPU in our last project. GPU processing would have required too much brute force (current generation console GPUs are not that well suited for better algorithms than brute force).

Quadtree based light culling reduces the required light culling tests drastically. It wouldn't make the memory access pattern irregular (as long as the light list fits in thread block shared memory). Of course it's impossible to make a generalization of these kinds of approaches, and it's hard to make algorithms that scale automatically to different sizes of thread block shared memories (no performance cliffs and always a proper result). Proper cache hierachies (and cache oblivious algorithms) are great for this. I think that (almost) everyone in this thread agrees that Cell isn't a good general purpose CPU. It isn't good for writing reusable scalable code (automatic memory hierarchy and parallel scaling). However you can get good performance out of it in a closed box (console), if you spend considerable time to adjust your algorithms (and data access patterns) to be suitable for it.

When I earlier said that Cell forced developers to design around the data access patterns, I didn't mean to imply that all developers were using bad access patterns in PS2/GC/PC games. The local stores weren't even the biggest change in the console architecture. Multithreading was a much bigger change and 500+ cycle memory latencies were also a really big change (if I remember correctly the memory latency on PS2 was around 40 cycles). Of course best developers were aggressively optimizing their memory access patterns in PS2 days as well, but that wasn't as crucial as it is for current generation consoles (Xbox 360 & PS3). For example in our PSP game (Warhammer 40K: Squad Command) we pretty much focused all our optimization efforts for improving graphics rendering performance. CPU memory access patterns never were a huge bottleneck for architectures of that era. But when we started developing for Xbox 360, it was clearly apparent that we had to make some big changes for our data structures

Shifty Geezer · Aug 8, 2012

Gubbi said:
The PRACE FFT performance number is not comparable to the others because PRACE ran mod2f on a CELL cluster with a problem size much bigger than the other papers listed here. The low performance is a result of data having to travel over PCIe when accessing data on other nodes.

Okay. Thanks for the explanation.

What is intersting about the FFT solution is not the mediocre performance, but the absurd amount of work needed to get it up and running.

It's so absurd as to be hard to believe! That doesn't mean it's not accurate, but I'd like to know the ins and outs so that I could accept the results.

mod2am is dense linear algebra, - linpack stuff. Data access patterns are regular and computational density is about as high as it gets. CELL reaches 79% of peak FP performance, Core i7 reaches 94%. CELL wins because of higher peak FP performance.

That was its point, although the lower efficiency contradicts some opinions of the pro-Cell side. Cell clearly isn't getting greater efficiencies in this case from its memory subsystem.

sebbbi · Aug 8, 2012

Gubbi said:
Let's extend the last one to all kinds of space decomposition (KD-tress, octrees etc) since its used everywhere in games, from game logic to physics

Dice published an whitepaper about Battlefield 3 viewport culling some time ago. In their previous technology they used bounding volume hierarchies (KD-trees and octrees are bounding volume hierarchies as well). In their new engine they switched to regular grids and "brute force" linear SoA vector batch processing. Culling performance improved by 3x.

This is a good example how improving your memory access patterns and transforming your problem to a linear stream processing problem can bring you huge gains. Linear processing is very easy to vectorize and run parallel on several CPU cores, so the slight bit of extra brute force processing (caused more coarse large scale block culling) isn't a big problem.

In general for current generation game programming, you do not want to use traditional (one node -> pointer -> one node) tree structures, because these structures are really bad for CPU caches. You want to be using bucketized structures (that waste slight bit of arithmetic instructions) for much improved memory access efficiency.

sebbbi · Aug 8, 2012

DICE culling stuff in page 70 (and lots of other data oriented stuff from their technology): http://dice.se/wp-content/uploads/Introduction_to_Data-Oriented_Design.pdf
Slides about DICE culling system: http://www.slideshare.net/DICEStudio/culling-the-battlefield-data-oriented-design-in-practice

Arwin · Aug 8, 2012

Shifty Geezer said:
It's so absurd as to be hard to believe! That doesn't mean it's not accurate, but I'd like to know the ins and outs so that I could accept the results.

The 6.6 paper explains most stuff in fair detail.

That was its point, although the lower efficiency contradicts some opinions of the pro-Cell side. Cell clearly isn't getting greater efficiencies in this case from its memory subsystem.

From the paper:

Suitability of the compiler/ paradigm to implement the kernel:
The paradigm is well suited for performing parallel computations on blocks of data (i.e.
running many similar tasks on different blocks of data). FFT is not a perfect match here since
the number of operations performed on one block of data is limited to some very basic
twiddle computations. The granularity of the task seems to be a big problem here, since we
are additionally limited by the very small Local Store of each SPU (256KB). To achieve good
performance some additional Cell programming techniques like SIMD computations or DMA
transfers have to be used. Some of them are partially addressed by the CSs language itself.

We also don't know here if they were still limited by PPE throughput as discussed in the other paper for other tests, that could perhaps have been overcome.

Also, note the original PPU performance and the final performance using the SPUs:

The final version works with CSs for problems larger than 2^M, where M>15. The initial
performance on PPU was very poor: 33,5 MFlops (2^20 problem). The final performance is
2656 MFlops (2^20 problem).

It is pretty clear that no amount of PPUs on the same die is going to be remotely close to the performance of SPUs in this area. In fact, didn't bkilian mention that because of this quite some impressive work was done though on the 360's GPU to get AVC decoding to work for HD DVD? If so, that was some early and groundbreaking GPGPU programming going on there, before that came into the general picture. Wasn't the GPU of the 360 later also employed in Kinect input analysis?

Carl B · Aug 8, 2012

Gubbi said:
mod2am is dense linear algebra, - linpack stuff. Data access patterns are regular and computational density is about as high as it gets. CELL reaches 79% of peak FP performance, Core i7 reaches 94%. CELL wins because of higher peak FP performance.

Absolutely, but though the i7 does reach greater efficiencies, in this instance to have 6-times the die size, greater wattage, and ten times the transistor count... and be a couple of years younger, the crown has to go to Cell as an architecture on the task. And it essentially did anyway in absolute terms.

None of that is to say anything other than there are tasks where Cell demonstrably excels even vs more modern architectures, GPU or otherwise (circa 2009). The fact that those tasks are a narrower subset of problems one would actually run definitely isn't in dispute.

Andrew Lauritzen · Aug 8, 2012

sebbbi said:
However if you know your own access pattern (for example have generated an array to data pointers in previous algorithm step for data fetching) and are going to manually prefetch anyway, a fully manual local store is often (slightly) better than an automated cache, because you have full control over it.

Right, but the number of interesting irregular cases where you have a statically predictable memory access pattern I can count on one hand. If you're talking about cases where you can prefetch and hide the latency then you're still better off with a hardware managed cache because the latencies will typically be lower (although they are of course doable with some pain on scratch pads). And yes manual prefetching is unworkable, but compilers in DSLs can do it pretty well. For instance on Larrabee it was quite straightforward to transform reads/writes in shaders to prefetch, fiber switch, load/store-op which mimics effectively what GPUs do with their stall-on-use RF model.

I've certainly heard a lot of the "OMG caches may not do what I want" fear mongering and definitely you can construct failure cases if you know the cache configuration, but that's hardly interesting. In practice even on architectures where you have cache pinning and locking I've found vanishingly few cases that it actually turns out to be an issue in real code. If you write good code (i.e. good access patterns) and let them do their thing they mostly do the right thing. In fact in a lot of cases if you start screwing around with them that's when you start to hit the failure cases.

sebbbi said:
But for simple 3d/4d linear algebra code for single objects (process/access object positions and rotations in game world) 8d vectors are hard to use, since you cannot just simply use your vector classes for math (simple linear algebra has almost 1:1 mapping to 4d vector intrinsics). To utilize all the 8 lanes you basically have to "VLIW" two operations together manually to fill the vectors. This isn't often feasible.

I submit that the horizontal style has always been the wrong way to use SIMD. I know pretty much all games like to do it because then you can just tuck a bit of SSE/whatever away in your math library and think you're optimizing, but you get much poorer speedups than if you do it "GPU-style" instead. i.e. you need to start vectorizing the outer loops (that operate over many objects/matrices/vectors), not the inner ones. I'll also take the opportunity again to point out the awesomeness of ISPC (http://ispc.github.com) here, since you can express pretty much arbitrary vectorized code in a sort of GPU-style SPMD model that will then compile for all sorts of architectures (even upcoming ones) for you and spit out an object file to link in.

sebbbi said:
This isn't exactly a bad thing, since SoA processing tends to be more efficient. But it's not as easy to use, and not suitable for all kinds of algorithms.

Sure but that's the same argument that you just made about what Cell has forced people to do... it's the "right" way in the long run, so while you can get away with the simple way for a while ultimately if you want more performance you'll have to transform your code.

sebbbi said:
I think that (almost) everyone in this thread agrees that Cell isn't a good general purpose CPU. It isn't good for writing reusable scalable code (automatic memory hierarchy and parallel scaling). However you can get good performance out of it in a closed box (console), if you spend considerable time to adjust your algorithms (and data access patterns) to be suitable for it.

Totally agreed, although I don't think everyone agrees on the first point (but whatever). My only argument here has been that while I was initially pretty interested in Cell and wrote a pile of code on cell blades and PS3 retrospectively the lack of caches wipes out too large a chunk of efficient algorithms for it to be truly interesting as a CPU. And when you look at it as a throughput machine, it's not a particularly efficient one in terms of useful FLOPS compared to GPUs.

Anyways I have no problem with people enjoying tooling around with optimizing on esoteric architectures. Hell I still have a soft spot for Larrabee

But that has nothing to do with whether it or Cell is a good architecture for writing fast code and shipping products efficiently. The latter is what this thread is about I believe.

Andrew Lauritzen · Aug 8, 2012

sebbbi said:
In general for current generation game programming, you do not want to use traditional (one node -> pointer -> one node) tree structures, because these structures are really bad for CPU caches. You want to be using bucketized structures (that waste slight bit of arithmetic instructions) for much improved memory access efficiency.

Totally agreed, but generally that just means you want to make your "leaves" bigger in your tree. Sometimes your branching factor if you need to traverse it using SIMD in various ways. But you still want a tree (or hash table or something similar) at the top level of your acceleration structure or you're just pissing away performance (and power).

ERP · Aug 8, 2012

sebbbi said:
In general for current generation game programming, you do not want to use traditional (one node -> pointer -> one node) tree structures, because these structures are really bad for CPU caches. You want to be using bucketized structures (that waste slight bit of arithmetic instructions) for much improved memory access efficiency.

FWIW IMO you've never wanted to do that on a console EVER.
When we worked on N64 we used a parse grid for gross geometry culling, because it totally destroyed any sort of tree implementation in performance. Plus is has the nice side effect that it reduces gross culling to a rasterization problem. And that happens to be my favorite hammer.

Yes there were lots of people at the time still using BSP trees/KD trees and even Oct or Quad trees, my assumption is that these people had read about these techniques implemented them and never actually tested the performance, because I can't see how they were ever comparable from a performance standpoint.

sebbbi · Aug 8, 2012

Andrew Lauritzen said:
I submit that the horizontal style has always been the wrong way to use SIMD. I know pretty much all games like to do it because then you can just tuck a bit of SSE/whatever away in your math library and think you're optimizing, but you get much poorer speedups than if you do it "GPU-style" instead. i.e. you need to start vectorizing the outer loops (that operate over many objects/matrices/vectors), not the inner ones.

[...]

Sure but that's the same argument that you just made about what Cell has forced people to do... it's the "right" way in the long run, so while you can get away with the simple way for a while ultimately if you want more performance you'll have to transform your code.

Exactly

Horizontal (/ AoS / single object) way is just wrong (it's not "proper" SIMD). SoA ("GPU-style") vector processing yields much better utilization (all lanes are always used even for sqrt/cmp/rcp/etc and dependency chains are shorter allowing better vector pipeline utilization). But as you said many game developers are still using AoS style processing. IBM specially added AoS dot product for Xenon VMX128 vector units, and Intel also did that for SSE4. Even the hardware designs are supporting this inefficient (but commonly used) way of processing.

And that's one of the reasons why I like Cell... it forces developers to rethink their data processing approaches. Cell is not well suited for AoS style (single instance at a time) processing (that is often used for object oriented approaches that have virtual methods and irregular memory access patterns). You need to do things properly. If you have millions of lines of legacy code, you need all the push available to make good decisions

ERP said:
Yes there were lots of people at the time still using BSP trees/KD trees and even Oct or Quad trees, my assumption is that these people had read about these techniques implemented them and never actually tested the performance, because I can't see how they were ever comparable from a performance standpoint.

I have to admit that we still used octree (loose octree) in Trials Evolution for view culling and area queries. However our implementation was cache optimized (bucketed structure). Octree isn't that bad if you use it for coarse culling, and have fat leaf nodes (that contain large linear arrays of objects). But I agree that a simple two (or three) level grid would be better (would be easy to SoA SIMD process and straightforward to scale to N threads).

Gubbi · Aug 9, 2012

sebbbi said:
Dice published an whitepaper about Battlefield 3 viewport culling some time ago. In their previous technology they used bounding volume hierarchies (KD-trees and octrees are bounding volume hierarchies as well). In their new engine they switched to regular grids and "brute force" linear SoA vector batch processing. Culling performance improved by 3x.

Couple of comments:

You can think of a regular grid search as a radix search which again you can think of as a tree-search with very fat nodes (ie low depth). I'm not questioning that it can make sense, especially on current console hardware. I'm just saying that stream computing is not a magic bullet that can solve all parallel problems.

Generally, the larger the latency of your memory subsystem the fatter nodes you want, ie. Octrees are a fat version of a special case of three KD-tree nodes with alternating x,y,z decomposition. The main advantage of octrees is that you can avoid branches entirely when traversing an Octree and "just" have to contend with data-dependencies.

Looking at this . If that is really how they did view frustrum tests (which I doubt) then they gained a lot of performance by eliminating poorly predictable branches.

sebbbi said:
In general for current generation game programming, you do not want to use traditional (one node -> pointer -> one node) tree structures, because these structures are really bad for CPU caches. You want to be using bucketized structures (that waste slight bit of arithmetic instructions) for much improved memory access efficiency.

The top nodes of a tree usually have excellent temporal spatial coherene, because they are common to a large amount of queries, the bottom nodes abviously do not. A hybrid scheme with smallish top nodes (ie. regular KD/octree) and fat leaf nodes (radix/grid) would make sense. The balance depends on how latency prone/sensitive your CPU is and how much bandwidth you have, - by increasing the node size you're basically trading bandwidth for lower latency.

I fully accept brute force makes sense on current gen console hardware, but a four way (ie four parallel queries) octree traversel on modern OOOe CPUs is fast.

Cheers

Was Cell any good? *spawn

Arwin

Now Officially a Top 10 Poster

Nesh

Double Agent

Nesh

Double Agent

Shifty Geezer

uber-Troll!

jonabbey

Carl B

Friends call me xbd

Gubbi

Andrew Lauritzen

Moderator

Gubbi

sebbbi

Shifty Geezer

uber-Troll!

sebbbi

sebbbi

Arwin

Now Officially a Top 10 Poster

Carl B

Friends call me xbd

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

ERP

sebbbi

Gubbi

Similar threads