Ati and Z-Ram

Blitzkrieg

Newcomer
Just wondering ati has experience with edram, could they now use z-ram which due to AMD having a license?
Fabbing it AMDs own fabs might not be possible, but I assume their is the IBM fabs which could offer it.
Wouldnt it be good for integrated graphics and mobile graphics with its very low power use and good performance?
 
Just wondering ati has experience with edram, could they now use z-ram which due to AMD having a license?
Fabbing it AMDs own fabs might not be possible, but I assume their is the IBM fabs which could offer it.
Wouldnt it be good for integrated graphics and mobile graphics with its very low power use and good performance?
Uttar! Quick! It's a Z-RAM thread!

(He will probably post a dissertation...)
 
Damn you Baron, you're no fun ;)
I was half-way through writing a long post, but I realized it'd be needlessly big and boring in terms of information, so here goes instead... I'm sorry if you're aware of some of the facts I'm listing first, it's just that I'm sure many people aren't. While several of my statements would be appropriate for any form of FB-RAM (including Intel's recently announced one), they're obviously more focused on Z-RAM, and especially so on Gen2.

- Plain SOI process unlike EDRAM which only works on specifically designed process variants.
- Nearly perfect yields, at least for Renesas' FB-RAM variant, and most likely also for Z-RAM.
- Optimizable for either power, density and/or speed. Density of up to 10Mbits/mm2 on 45nm.
- License fee of ~$5M minimum for a company-wide technology license, or lower per-product.
- The response time can be as low as 3ns, so arguably it'd be faster than SRAM for large L3 arrays.

- PC GPUs: Unlikely before 2011, because TSMC takes SOI the least seriously of all fabs but Intel.
- PC IGPs: Same as for GPUs, but there is some obvious potential with AMD's "Fusion", we'll see...
- Handheld GPUs: Very likely within the next 2-3 years, especially so at AMD, but also at NVIDIA imo.

- There are imo tons of potential usages for Z-RAM at these densities, not just a dumb framebuffer.
- For GPGPU, it could be used as a gigantic mid-latency scratchpad and help a lot of raytracing etc.
- Even for TBDRs, it could be interesting; heck, DR could be seen as memory footprint compression.
- If used as a framebuffer for a non-TBDR, it is key to make the footprint mostly independent of AA.

I think one of the most interesting aspects of Z-RAM for GPUs is that it can really be reused for a vast number of tasks, including GPGPU, if implemented properly. Let's consider G80, and assume 8192 threads in flight, total. On 90nm, 8MBytes of Z-RAM would take about 30mm2, thus not increasing die size too much (~6-7%). For GPGPU, that buys you 1KiB of storage/thread, which is much more useful than the 10-20x lower size of the parallel data cache for certain algorithms.

You'd expect the number of threads to scale roughly linearly with the transistor count in the near/mid future, so this per-thread figure could be maintained. If you weren't too reliant on random memory accesses, you could probably double that by halving the number of threads. And obviously, increasing the part of the die dedicated to Z-RAM would also increase that.

For a fair comparaison with CELL, you'd have to consider that the 256KiB is reserved for 4 SIMD channels, so in practice, only 64KiB would be available per thread when running in SoA; and for various efficiency reasons (because the processor doesn't hide latency automatically), you'd likely want to further divide that by at least 4x IMO; so you're really back at about 16KiB. So one KiB per thread would still be a lot less, but you could quadruple that and still have only 20% of the die dedicated to Z-RAM, so there's some potential flexibility there.

The big question, then, is how could you use that Z-RAM best for other things than GPGPU, and how to keep your offerings extremely scalable. If you need to hit every PC market segment with your architecture, from $50 IGP to $2000 Quad-SLI systems, scalability is key. Obviously, that problem isn't present for consoles (as Xenos further demonstrates), given that it's a closed system. But unless you'd making an architecture only for the console (or handheld) market and getting paid back via NREs and Royalties, you're fucked nowadays if your architecture isn't fully scalable.

Using that memory as a TBDR cache is appealing, but the problem is that memory footprint under traditional TBDR implementations scale primarily in terms of scene complexity, and not resolution. Given the PC market's dynamics, it would be best to aim at a solution that scales in terms of resolutions for optimality, while also not being overly dependent on antialiasing settings. This is because a customer that wishes to play 3D games with a 30" monitor is unlikely to think of doing so on a $50 IGP, while someone on his 17" TFT is also unlikely to buy a $500 GPU.

There are several ways I can imagine to reach such a form of scalability, but most of them are dependent on average triangle size. As such, their scalability become a limit of scene complexity as resolutions scale further down, and they only make sense if the developer aims to achieve a sufficiently high average triangle size. Ironically, you are back at TBDR-like memory footprint scaling then; arguably, you could do slightly better, but the efficiency graph would be similar, just offset.

So, what could be done there, besides giving up on transparency for the application developer? I'm not sure, and it will remain to be seen. I do not believe this is an unsolvable problem, however. Personally, I believe the most likely option is a variant of traditional TBDR architectures aimed at minimizing memory footprint, instead of minimizing overdraw and memory bandwidth usage. While an exotic compression algorithm would be interesting, I am skeptical it would be sufficiently more efficient to justify its implementation cost.

The other solution is to simply use it as a lossy Z-Buffer that could page in video memory when compression fails. It could also simultaneously be used as a large-scale texture cache (~1MiB?) or geometry shader cache. Under one implementation, you could imagine one or multiple FB-RAM arrays doing all of the above at the same time, or possibly at different times depending on the workload.

So, in conclusion, the opportunities for any kind of high-density mid-latency on-chip storage are tremendous IMO, and Z-RAM is a key candidate for high-profile implementation opportunities in the next 5 years. Even for the PC market, the possibilities and endless, and the array could be reused for a variety of tasks, including GPGPU processing. However, keeping the architecture scalable for traditional rendering is a key design goal, and on-chip storage complicates this. Smart implementations and algorithms are necessary, but these requirements will eventually be met, sooner rather than later.

For other markets, such as the handheld industry, a variety of factors make this much simpler, and in fact the technology could already be used today as a drop-in replacement for the large SRAM arrays of these products. The timeframes in which more complex algorithms will be necessary for proper usage of Z-RAM in the handheld market are delayed by at least 5 years compared to those of the PC market.

In the console market, FB-RAM is even more appealing. Sony, IBM and Chartered are all high-profile manufacturers in this business, and they all use SOI. In 5 years, you'd expect it to be really usable at TSMC too. So all major console players would likely be able to use FB-RAM for their future projects, and I'd be VERY surprised if none of them did. Heck, I'd be very surprised if any of them didn't!

Arguably, "concluding" in three paragraphs isn't exactly very concise, but hey, if you didn't want to read this, you didn't have to! :)


Uttar
P.S.: And damn you again, Baron ;)
P.P.S.: In case that wasn't clear, I don't think ATI is considering making two distinct architectures for the PC, one for IGPs/Laptops and one for Desktops. That's just a waste of R&D budget. A bit of extra research to make Fusion more efficient? Yes. It'd be totally awesome if it could share L3 cache with the GPU. I'm skeptical about that, but we'll see.
 
Doesn't the ~400 MHz clock speed (when specially optimized for performance on a 65nm fab process) makes it somewhat a limiting factor?

Well, surely the clock speed will improve with Gen3++ but we don't really know by how much and any of ISi's forecasts, that I've seen, talk more about the further improvement in density and power consumption than the potential clock speed attainable.
 
Last edited by a moderator:
NEC's 90nm EDRAM is quoted as being 250MHz on 0.9V; so I don't think 400MHz is a particularly bad figure from that pov. It's certainly limiting if you want to use it for Desktop CPU's L2 cache though, never mind a L1 one. So it seems limited to a L3 cache for CPUs, at least for now, and in the PC market. In the handheld space, it's most likely very usable as L2, imo.

So, if you want to take the specific example of AMD, let's look at two pictures illustrating K8L, one fully official and the other not quite so:
http://news.com.com/i/ne/p/2006/amdslide4_532x402.jpg
http://aycu16.webshots.com/image/1735/1362245863807578290_rs.jpg

It's fairly easy to see that on K8L's cores, if you exclude the L3 cache and the smaller buffer/register arrays, the logic ratio is very high (relatively speaking, compared to other modern CPUs, of course). So I think from AMD's point of view, it's more interesting to keep the size of the L2 cache in check and add a much larger Z-RAM-based L3. This should also help to keep L2 latency down... (a road they apparently didn't decide to take with Brisbane, heh!)

If you look at Penryn, about 2/3rd of the die represent the 6MiB of L2 cache. I'm sure that in the 45nm timeframe, AMD would love to have a lower-latency L2 than Intel and a much bigger L3 at the same time. I think such a design choice makes sense for them, but it remains to be seen when such things will actually happen. Exciting times ahead, and that's also true for the CPU industry, for a change - heh! :)


Uttar
 
*Z-RAM education* :runaway:
approve.jpg
 
Would there be any problems in bulding a L3 cache with standard 6-T SRAMs for the tags running at full speed and utilizing Z-RAM for the main arrays. If for you sample the main arrays at 8 bits per clock per bus bit line a'la DDR3 and cycle it at 1/8th clockrate, would that then give a functional equivilent to a full 6-T SRAM design?
 
Thanks for the information, great posts. So it is definitely going to be very interesting the next few years, can't wait.
 
Back
Top