First, the not-insanely-technical-and-long part...
wolf2 said:
Assuming TSMC's process is of similiar size (but optimized for low-power), why can't a 5 or 8MB chunck of eDRAM be embedded in a notebook chip for use as the VRAM? It seems the power savings in this topology would be fairly dramatic as you would eliminate close to 250MB/sec of bandwidth fetch from the shared memory for servicing the display.
Based on the cell sizes given by TSMC and IBM, I would indeed assume that the density of their EDRAM are very similar - I already noted this in the news post.
As for using EDRAM to save frontbuffer bandwidth on laptops, this is an interesting approach, but it seems a bit expensive to me for a rather small benefit. Remember that 5-8MB of EDRAM is still quite expensive today. I do not have the insider knowledge to judge this properly, but I would tend to believe that 16MB of stacked low-power DRAM would be a better design choice for that purpose. This even allows you to have a full 2560x1600 frontbuffer there. Sadly, as I said, I do not have the insider knowledge to estimate the exact cost and power implications of this approach.
And now, here comes the long part of the post! For those who don't have the time or desire to read it, the basic idea is to store the compressed version of the framebuffer in EDRAM. And because the same memory areas are used for the uncompressed or less-compressed versions of the data, this saves substantial bandwidth even in the worst-case scenario.
---
My idea for this is fairly simple. Consider how framebuffer compression likely works because of memory burst lengths. And consider that you likely have intermediary compression stages between "fully compressed" and "fully uncompressed".
Consider a GPU with X compression levels and colour had Y compression levels. So X-1 and Y-1 levels if you exclude "uncompressed" from there. Now, consider how that works on current GPUs due to memory burst lengths. The most aggressive compression level would likely only require writing and/or reading one burst of data, while the uncompressed level would require writing and/or reading several bursts of data.
However, the memory area used for the uncompressed data includes the one that would be used for the various levels of compressed data. So, given this, let us consider a maximum compression ratio of 4:1 and what happens if only one memory area out of four is exclusively present in EDRAM, while the three others are exclusively present in VRAM. In the worst case, you only save 25% bandwidth; in the best case, if all the data in the framebuffer is fully compressed (very unlikely except for test scenes with only one big triangle!), no read/write access to VRAM is required at all. The average savings should be very high for a reasonable amount of EDRAM.
So, let us also consider an extension of this scheme to support arbitrary resolutions and higher utilization rates of the EDRAM. Instead of reserving the same number of burst-sized memory areas for every block of pixels, a variable number of areas could be reserved. The driver would inform the GPU that burst-sized memory areas 1 and 2 are always in EDRAM for every block, while area 3 is in EDRAM for one block out of Y, where Y is either an integer or a floating point number. Areas 4, 5 and 6 would always be in VRAM. This could also be tuned separately for Z, Colour and Stencil if need be. In the extreme case, even the most aggressive level of compression is not guaranteed to be in EDRAM.
From my point of view, and as I have already said in the past, I believe TBDR rendering to also be usable as an efficient form of memory footprint compression. An interesting hybrid implementation I can think of is a natural evolution of Zhu's (now at NVIDIA; previously CTO of GigaPixel) patent on dynamic allocation of memory blocks for TBDR rendering.
The biggest problem with TBDR rendering for memory footprint compression is that certain areas, for example those with pixel-sized triangles, the footprint is going to be higher than that of even a naive IMR. The solution to this, assuming you are willing to dedicate enough hardware to the problem, is to allow a tile of pixels to revert to IMR rendering if the footprint required would exceed that of the IMR implementation. If the API exposes order-independent transparency, then tiles which need that information could be forced never to switch to IMR mode, and the allocator would then spill to VRAM, System RAM, etc.
It is difficult to be confident of how memory footprint would be affected by such a hybrid implementation without having access to the data that only those working at Imagination Technologies (or at NVIDIA, arguably, given they still have GigaPixel employees!) have access to. I would tend to believe the results could be extremely interesting, however.
Anyway, I am getting carried away. Even without use of TBDR-like technology, I hope that this post should clearly prove how usable EDRAM is for PC solutions. A number of optimizations could be done to further minimize the costs of this implementation, logic-wise, such as basing whether something is in EDRAM based on the upper bits of the memory address (although this would waste some VRAM!). I would tend to believe this should be cheap enough to implement as it is, however, at least relatively to the size of the EDRAM macro.
Other aspects of the GPU could obviously benefit to a lesser extend of using EDRAM or similar techniques instead of SRAM. For example, on an IGP or a handheld GPU, you could significantly increase the size of the L2 texture cache without increasing costs, if you already need EDRAM somewhere else in the design. I would tend to believe the effects of a much larger texture cache wouldn't do miracles, but every bit of bandwidth you can save counts in these market segments. As long, of course, as your costs remain reasonable.
EDIT: Please note that this is purely speculation, and not based on any insider information.