CELL V2.0 (out-of-order vs in-order PPE)

sebbbi · Dec 7, 2011

steampoweredgod said:
IF cache sizes have had to balloon exponentially to keep up with just a few cores, I'm not entire sure putting 30+ cores with smallish caches will not result in subpar performance as expected from the memory wall issues.

If you connect 30 SPEs to a single main memory interface, you will hit the memory wall as well. It doesn't matter what kind of cores you are using if you have a single shared main memory that is used by all those cores (to read inputs and to write results). Local work memories (or caches) are useful for storing temporal structures, but you still have to move data to/from main memory at some point.

SPEs can move data between each other without passing it though the main memory, but so can general purpose CPUs (though L3 cache). Of course the interconnect network between the cores will become the bottleneck if you scale the core count too much (more cores require more complex interconnect network, as you cannot just create a direct link from each core to each core). A fully distributed memory system is the only way, it we want to keep scaling and scaling up the core count in the future (to thousands of cores).

steampoweredgod · Dec 7, 2011

sebbbi said:
If you connect 30 SPEs to a single main memory interface, you will hit the memory wall as well. It doesn't matter what kind of cores you are using if you have a single shared main memory that is used by all those cores (to read inputs and to write results). Local work memories (or caches) are useful for storing temporal structures, but you still have to move data to/from main memory at some point.

SPEs can move data between each other without passing it though the main memory, but so can general purpose CPUs (though L3 cache). Of course the interconnect network between the cores will become the bottleneck if you scale the core count too much (more cores require more complex interconnect network, as you cannot just create a direct link from each core to each core). A fully distributed memory system is the only way, it we want to keep scaling and scaling up the core count in the future (to thousands of cores).

For the time being we only seem to need to deal with 10s of cores on chip for cpus. Future XDR technology provides substantial improved bandwidth. The ibm guys intended to use a crossbar approach, I would assume with more transistor budget that can be used, how does a crossbar approach compare in complexity, scalability?

The spes can work without large L2, L3 caches and a 32 core system can likely be fed by the 100s of GBs from future XDR without problem, will the cache approach work with little or no L2 and L3? If not you're committing the designers to large hot caches that will likely take more die area than the processing elements themselves.

The cell was originally envisioned to be fed adequately by the 25.6GB/s XDR provides. That would be 1PPE 8SPES originally. One would presume 4x the amount of units could be fed by 4x the bandwidth or 100~GB/s certainly reasonable.

tunafish · Dec 7, 2011

steampoweredgod said:
For the time being we only seem to need to deal with 10s of cores on chip for cpus. Future XDR technology provides substantial improved bandwidth. The ibm guys intended to use a crossbar approach, I would assume with more transistor budget that can be used, how does a crossbar approach compare in complexity, scalability?

Crossbar is the theoretical best approach -- ignoring costs, you always want a crossbar. But it is the most expensive kind of interconnect to scale up. Crossbars work fine with a low port count, when you scale those up you want to move to a ring, mesh, or something like that.

steampoweredgod · Dec 7, 2011

tunafish said:
Crossbar is the theoretical best approach -- ignoring costs, you always want a crossbar. But it is the most expensive kind of interconnect to scale up. Crossbars work fine with a low port count, when you scale those up you want to move to a ring, mesh, or something like that.

Interesting, I've heard simple interconnects can scale up to 64 cores. Would a mesh become necessary at 32 cores or would a ring still be viable?

With regards to performance and bandwidth the following sounds promising...

Most (90%) of a stream processor's work is done on-chip, requiring only 1% of the global data to be stored to memory. -wiki

This three-level organization of storage (register file, local store, main storage) --with asynchronous DMA transfers between local store and main storage -- is a radical break with conventional architecture and programming models because it explicitly parallelizes computation and the transfers of data and instructions.

The reason for this radical change is that memory latency, measured in processor cycles, has gone up several hundredfold in the last 20 years. The result is that application performance is often limited by memory latency rather than peak compute capability or peak bandwidth. When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. Compared with this penalty, the few cycles it takes to set up a DMA transfer for an SPE is quite small. Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available.

The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this new approach to accessing memory has led to application performance exceeding that of conventional processors by almost two orders of magnitude, significantly more than anyone would expect from the peak performance ratio (about 10x) between the Cell Broadband Engine and conventional PC processors.-CELL Broadband Architecture From 20,000 Feet. Dr P Hofstee, Architect, IBM

The question is, are physics or any subset of it in this near 2 order of magnitude category? If they are it would put a hypothetical next gen cell several times above the performance of a state of the art high end 2011 gpu in this area.

CELL V2.0 (out-of-order vs in-order PPE)

ioe vs. oooe PPE cores for CELL V2.0

Unleash the hounds Smithers: go for ioe and clock speed.

Appease pesky multiplatform developers and implement oooe.

sebbbi

steampoweredgod

tunafish

steampoweredgod

Similar threads