Analog and pad limitations aren't unique to Rambus interfaces, it's something all chips face.
edit: or will if they use most of their perimeter prior to a shrink
edit: or will if they use most of their perimeter prior to a shrink
Why wouldn't it?
GPGPUs are eating the SPE lunch anyhow. Cell v2.0 is Fermi/GCN
If it's as flexible as Cell, I would say no. Are they as flexible, though? Plus, SPEs draw very little power. The 7 SPEs in the PS3 draw only about 35W, total, at 90nm. What would it be for 32 of them at 28nm?I dunno if you're being sarcastic, but I've been trying to make this argument a lot around here. Would Cell really be necessary with a fat Fermi/Kepler/GCN GPU right beside it ? Or even with an older design GPU like Cayman?
If it's as flexible as Cell, I would say no. Are they as flexible, though? Plus, SPEs draw very little power. The 7 SPEs in the PS3 draw only about 35W, total, at 90nm. What would it be for 32 of them at 28nm?
If it's as flexible as Cell, I would say no. Are they as flexible, though? Plus, SPEs draw very little power. The 7 SPEs in the PS3 draw only about 35W, total, at 90nm. What would it be for 32 of them at 28nm?
Flexible isn't a very useful word. Specifically, the GPGPUs aren't as fast in running code with a lowish threadcount, lots of jumps, and whose data fits well in the small local pool.
On the other hand, they are much better in loads where you want to directly access a larger pool of memory.
Flexible isn't a very useful word. Specifically, the GPGPUs aren't as fast in running code with a lowish threadcount, lots of jumps, and whose data fits well in the small local pool.
On the other hand, they are much better in loads where you want to directly access a larger pool of memory.
Considering a next gen cell would be paired with a unified gpu it would be freed up to take such tasks, and physics is likely the biggest hog and one that can provide very impressive visual bang. Hearing that the 5 year old cell outcompetes the 4 core i7 is quite nice, and if true is further evidence that heterogeneous cores are a must have in the console space(and if true makes one wonder about performance differences between cell and wiiu's 3 core cpu, as surpassing i7 core for core is doubtful).My opinion is that GPGPU has proven to be a huge red herring. HardOCP has been posting videos of tech demos from the Infernal Engine all week. The developers narrating are very emphatic on the point that it is far preferable to have your GPU processing graphics. Their physics engine, which is very impressive, relies on heavily threaded, CPU based processing. The results are pretty cool, but they still point out that the PS3 version is the one they like to demo because it can simulate the most objects. Even better than their PC build with 8 threads on an i7 platform.-Brad Grenz
Wonder how this affects physics calculations on gpgpus vs cell.Cell has 8 and 16 bit SIMD instructions for one. For another, GPUs use parallelism not merely to hide memory latency, they use it to hide pretty much all latency in the pipeline ... because they usually work on massively parallel problems and because memory latency is orders of magnitude higher than everything else they just don't bother trying to keep latencies low for anything else. Their actual instruction latency is hideous compared to Cell. That means you are forced to have many more threads than on Cell.-Mfa
If there are any such possible changes it would put paper performance on par with the 6Tflops cell cluster that can do realtime raytracing of large complex scenes.
Was reading a thread titled something like "is there something cell can still do better than modern cpu/gpu" here at beyond3d. And the answer seemed to be that one of the problem with architecture's such as larrabee was the cache, and that cache structures introduce complexity/power problems as you scale core numbers, cell's local store can scale much better according to some people in that thread.I wonder, if a newer processor like SPE can be made wider, like 16-wide or something and larger local store to accomodate.. SPE was design with clockspeed in mind, but we know that clockspeed design is not the way to go, unless they made some sort of breakthrough in managing power.
Also another old discussion, was including local store instead of cache in SPE a mistake ?
The lack of coherence between the Local Stores is probably seen as a disadvantage but once you start to scale Cell it'll turn out to be a big advantage.
Once you start adding in piles of cores coherent caches will become a major source of latency and power consumption. -ADEX
That was one of the larger design decisions on creating the Cell was that there was a limit to the amount of cache you can use before you hit diminishing returns. Where as not only is the sdram predictable its infinitly scaleable. -Terarrim
I also found this quote with regard to crowd ai somewhereIf physics is not well suited to run on CELL then CELL designers have failed
cause physics was one of the applications they tried to address with CELL design.
I believe they know better than you and me, in fact CELL architecture seems well suited for physics calculations. -Nao
15000 at 30fps, pretty impressiveEach individual chicken has its own behavior model interacting with other birds. The simulator was demonstrated to provide realtime (30fps) performance with several thousand chickens. In fact, when the number of chickens was increased to a total of 15,000 birds the Cell B./E. processor was still able to perform the simulation with interactive speed, but the graphics rendering was not able to keep pace, even on a stateoftheart NVidia GPU and started dropping 2 out of 3 frames, resulting in 10fps "sluggish" video output.-rapidmind chickenfarm simulation
The memory wall: the processor frequency has now surpassed the speed of the DRAM and the current workaround of using multilevel caching leads to increased memory latency....
The slow main memory access on traditional x86 architectures creates a data flow bottleneck causing processor idle times. This results in much lower sustained performance than the theoretical peak of the CPU. To combat the bottleneck, state of the art processors have significant cache (L1, L2, L3), typically several megabytes on the processor chip. This uses up space that would otherwise be available to allow more transistors (and more processing power, as well as more heat). This “wasted” cache memory area is one explanation for why Moore’s law no longer translates into equivalent performance increases.-link
It seems that many persons around here have the false assumption that physics is just raw vector maths (number crunching). It's not. Physics engines use a lot of complex acceleration structures to speed up their work and to keep the memory access patterns manageable. Traversing these structures often includes a lot of branches and semi random memory accesses. There are some forms of physics simulations (for example particle systems) that are straightforward to simulate in parallel, but also systems that are much harder (for example complex rigid body systems with lots of constraints between the bodies).
It seems that many persons around here have the false assumption that physics is just raw vector maths (number crunching). It's not.
Yes, this is true in general. However the scaling depends very much on the memory access patterns. If each core has it's own L2 cache and majority of memory operations happen inside the core's own L2, the scaling is much better.The memory wall is said by some to limit standard architectures to about 8 cores before performance starts to drastically drop, with 16 cores delivering 2 core performance at some tasks, and performance going down a cliff as one approaches 64 cores.
Exactly this. Compute is cheap, and raw gigaflop numbers are mostly meaningless. What matters is getting data where it needs to be.IMO, the memory system is much more important.
How many outstanding requests, bandwidth and latency are the most important parameters. Look at how much die area is devoted to floating point math and to the memory system. A (simd) FPU is a fraction of a core and a core is a fraction of a CPU die. At the same time you have load store units the size of FPUs, on-die L1, L2 and L3 cache and integrated memory controllers. The combined memory subsystem of a modern CPU can be more than 60% of the total die size.
But *only* for operations where granularity of memory operations is large. There are workloads where a 32bit*4 gather over the entire memory pool through a cache will absolutely crush the cell dma. Also, the burden of implementing all that is on the programmer.Wrt. CELL: The SPUs in CELL are really dumb, but fast, small processors. What empowers the SPUs is the DMA engine of CELL. The semi-autonomous DMA engine can implement flexible aggregating operations, eg. gather/scatter is trivial.
Manually managed pools (ala Cell) are easier to build, and give you better performance numbers for the same amount of transistors spent,
but have a huge cost in programmer productivity.
It depends, what I'd like to see is clothing physics in all characters, hair physics, deformable terrain, enviroments composed of destructible objects, muscle deformation physics, fluid physics, weather simulations, volumetric clouds affected by wind physics and maybe cloud formation physics, etc not just token use of physics here and there but ubiquitous use of physics everywhere affecting gameplay and looks. This could very well take up most of the performance provided, not just a few percent.Of course the automated cache logic (and coherency logic) cost die space (and cause extra heat and manufacturing costs). The big question is, how much? And how much more execution units (and other performance boosting features) we could have if we didn't have automated caches. Many algorithms require fast data caching, and it's very hard to beat (fixed function) hardware cache logic by software cache implementations. Also it's harder to implement general purpose (not performance critical) code without any automated cache logic. So the question also becomes, is the required extra software development cost reasonable just to get a few percents performance boost by making the hardware simpler?
Consoles are supposed to last about a decade, and middleware developers can handle software used by multiple companies. Even those with internal engines can reuse and optimize what they learn. A nextgen cell can also leverage what was learned in this generation.Manually managed pools (ala Cell) are easier to build, and give you better performance numbers for the same amount of transistors spent, but have a huge cost in programmer productivity.
I've heard the caches are getting ever bigger in an attempt to deal with the issue, but as the number of cores goes up the approach breaks down.Yes, this is true in general. However the scaling depends very much on the memory access patterns. If each core has it's own L2 cache and majority of memory operations happen inside the core's own L2, the scaling is much better.
IF cache sizes have had to balloon exponentially to keep up with just a few cores, I'm not entire sure putting 30+ cores with smallish caches will not result in subpar performance as expected from the memory wall issues.In this respect, the "memory wall" is a classic producer/consumer problem, and it's the reason that on-die cache sizes have ballooned in recent years. As the memory wall gets higher and higher, it takes more and more cache to get you over it.-arstechnica
It would be an interesting exercise to track cache size per core in deployed HPC systems, since larger caches have been the biggest defense against the memory wall. Cache has been growing exponentially to try and keep up with the multiplying cores.-hpcwire