nondescript
Regular
As I understand it, as shown in the Suzuoki PDF on Deano's site, the reason why LS was chosen over cache is because the access patterns observed for modern "hot-spots" (to borrow Suzuoki's terminology) go some thing like this:
I'm a computer architecture guy (actually, more of a semiconductor process guy), but I can easily imagine than most gaming code requires a small amount of instruction to handle your-favorite-example-algorithm (collision, procedural synthesis, whatever), and vast amounts of data that constitute the objects in the scene/game world.
In that case, it would make sense to manually load the instructions (which are heavily accessed, and are small), and then go to main mem for the data, doing away with the cache altogether, since the cache would be worthless large, one-use data. (For example, examining every object in the world once per frame)
Several reasons for this:
- As mentioned before, cache is more expensive (tags, logic, the works) A 256KB cache will be bigger in die size than a 256KB LS
- Latency. This hurts in two ways. Not only is cache slower than an equivalent-sized local store (because of logic overhead due to n-way associativity) when it hits, but it is also slower when it misses - mem request goes to cache, the cache misses, than it goes to main mem. The SPE, which has a programmer-controlled LS, would simply know what is in LS and what is not, and would just read main mem. Now, the 10-cycle lag of cache is insignificant to the 100-1000 cycle lag of memory, so this is not such a big deal. But the point remains.
And let's not forget the I/O controller! That was the best case scenario, with a perfect I/O controller, that makes sure waiting read/write requests from cache are executed in a optimal way that prevents additional waiting. This is usually not true, so you have additional latency from making sure the cache and memory are concurrent.
- When it comes to cache, bigger != better. The Pentium 4 is a case in point. There was a big hoopla about the Pentium 4 L2 cache going from 1MB to 2MB. The problem is, the 2MB P4 is actually slower in most scenarios. Going from 1MB to 2MB meant a few more cache hits, but each hit was slower, since the cache was bigger, so overall performance was actually lower. The point of cache is to be fast, and to be fast, it must be small. As mentioned before, the A64 does quite well with a 512KB L2.
In summary, a cache is only superior to a LS when it can insure significantly more hits than a LS. Which, thankfully, is most of the time. But for some scenarios (the one in the Suzuoki PDF), the LS is superior. It is questionable that the scenarios used to design CELL are characteristic of computer game loads, but I for one don't doubt that this model has at least some merit.
Directly taken from slide 15 said:Instruction is small and reused
- Loop Intensive
- Good news for cache system
Data is large and not to be revisited again
- Sometimes larger than L2 cache
- Bad news for cache system
I'm a computer architecture guy (actually, more of a semiconductor process guy), but I can easily imagine than most gaming code requires a small amount of instruction to handle your-favorite-example-algorithm (collision, procedural synthesis, whatever), and vast amounts of data that constitute the objects in the scene/game world.
In that case, it would make sense to manually load the instructions (which are heavily accessed, and are small), and then go to main mem for the data, doing away with the cache altogether, since the cache would be worthless large, one-use data. (For example, examining every object in the world once per frame)
Several reasons for this:
- As mentioned before, cache is more expensive (tags, logic, the works) A 256KB cache will be bigger in die size than a 256KB LS
- Latency. This hurts in two ways. Not only is cache slower than an equivalent-sized local store (because of logic overhead due to n-way associativity) when it hits, but it is also slower when it misses - mem request goes to cache, the cache misses, than it goes to main mem. The SPE, which has a programmer-controlled LS, would simply know what is in LS and what is not, and would just read main mem. Now, the 10-cycle lag of cache is insignificant to the 100-1000 cycle lag of memory, so this is not such a big deal. But the point remains.
And let's not forget the I/O controller! That was the best case scenario, with a perfect I/O controller, that makes sure waiting read/write requests from cache are executed in a optimal way that prevents additional waiting. This is usually not true, so you have additional latency from making sure the cache and memory are concurrent.
- When it comes to cache, bigger != better. The Pentium 4 is a case in point. There was a big hoopla about the Pentium 4 L2 cache going from 1MB to 2MB. The problem is, the 2MB P4 is actually slower in most scenarios. Going from 1MB to 2MB meant a few more cache hits, but each hit was slower, since the cache was bigger, so overall performance was actually lower. The point of cache is to be fast, and to be fast, it must be small. As mentioned before, the A64 does quite well with a 512KB L2.
In summary, a cache is only superior to a LS when it can insure significantly more hits than a LS. Which, thankfully, is most of the time. But for some scenarios (the one in the Suzuoki PDF), the LS is superior. It is questionable that the scenarios used to design CELL are characteristic of computer game loads, but I for one don't doubt that this model has at least some merit.