Now if you have some free time soon I think a lot of us would be curious what you and other developers thought of, say a 50MB chunk of embedded memory on-die that is full write/read with ~ 1TB/s of bandwidth and low latency.
For graphics, it's obviously the frame buffer. For other things, latency determines it's usability.
One of the things Cell is genuinely good at is physics, because not only do you have a lot of throughput, you can do all kinds of interesting acceleration structures with the 6-cycle latency local pool. GPUs aren't nearly as good at this, because the local pools are small, and because the latency of even the simplest operations kills all kind of interesting acceleration structures -- they are really only good at brute force. Having access to a few megabytes of low-ish latency (can't get anywhere near the 6 of Cell, but something like guaranteed <40 cycles and mostly running out of L1 would still be pretty awesome) massive bandwidth memory would make it a physics monster, with the limits much higher than cell.
As far as other use is concerned, I can't really say. There are a lot of things that would benefit, but the thing is, framebuffer and physics would benefit so much *more*, that it would simply make sense to reserve it all to them.
I've wondered alot about this rumour of the Nextbox having 6-8GB of DDR3/4. My biggest concern would be load times. I mean to load 6-8GB worth of data from an optical drive into a relatively slow pool of main ram, whether bluray or DVD, would take forever.
How would you get around that?
One of the best ways to use RAM in the case of slow disk access speeds is to just cache more. The more you can cache, the more BW you free up to real use, the more fidelity you get out of your limited pipe.
Also, I got shouted to the ground last time, but I still think that 1GBps+ connections will be commonplace for a huge chunk of customers by the latter half of the generation. Perhaps people will be more receptive now that Google has started putting down fibre in the USA? Anyone who has a GBps class connection is better off using the ram as the only local storage, and getting game content over the wire from the closest CDN.
Forgot to respond. The reason I said cost was because of this.
<snip 4 times better than sram>
I don't know if there's anything to back that up though.
It's legit. But it's comparing to traditional SRAM -- and as I understand, T-Ram is claiming a little less than 5X gain against SRAM. SRAM is known to be fat and expensive.
However, you should remember the caveat that as far as we know, T-Ram might not even exist. (Now, there's a lot of money behind it, so I expect it to pop up eventually, but genuinely new semiconductor tech has a habit of seeing decades of delays.)
I don't know if this would even be worthwhile, but could it be feasible for a console version of a GPU to modify the L2 cache in the GPU with a memory tech that would better perform?
This would be very hard if not impossible. One thing to keep in mind is that as far as switching individual cells goes, SRAM is fastest by far. Seriously, compared to things like eDRAM or 1T-SRAM it can be hundreds of times faster. However, they can advertise better speeds because SRAM is so fat, that when you build a pool of any significant size, the access latency will be completely dominated by the delay of getting the signal there. So, even if the 1T-SRAM takes a lot longer switching, because it's on average twice closer, it can win that back.
But that only goes when you got tens of megabytes of the stuff. For caches with ~low megabytes of room, the traditional approach is the fastest known approach. (That's why it's used.)
This rate of data consumption would only occur if the SPs only ran one single instruction on each chunk of data,
Well, a single instruction per chunk of data would actually be ~8kB per clock. 2 FLOPS = 1 FMA = 3*4B input, 4B output. 2B/FMA is the traditional practical value for normal shaders when doing direct rendering, and it assumes a lot of operand reuse. I don't have enough experience with it to say how it applies to indirect rendering.