New Fusion information, but now in English!

It should make a lot of sense to produce a fusion processor on an advanced (and more expensive) SOI process in one of AMD owns fabs:

By that timeframe AMD could implement their large L3 cache not in SRAM anymore but in embedded DRAM Cells from IBM that use SOI to build up the capacitor.

You have to remember that the Xbox GPU already uses EDRAM for a very fast framebuffer and others. The increased size L3 Cache (3x more memory than SRAM) could be made configurable, so that CPU and GPU can access the same same without going over the extrernal memory. That would realy allow to the GPU and the CPU to efficiently collaborate.

Compare too the 6MB in Shanghai, the Fusion processor could feature 18MB L3 Cache.

The integrated memory controller could do DMA into this L3 cache to completly hide latency for larger GPU-type workloads.

This would indeed be more than than a simple combination of CPU and GPU: It would allow to realy create a fusion of both CPU in the sense that each part can work best on a specific workload.

If I were a betting man I'd say the 3x jump in L3 size from Barcelona to Shanghai is due to the use of ZRAM. I doubt it can get much denser than that for AMD... 18MB caches are probably a long way out for them, unless they start building in Intel fabs :p
 
Shanghai in 45nm

If I were a betting man I'd say the 3x jump in L3 size from Barcelona to Shanghai is due to the use of ZRAM. I doubt it can get much denser than that for AMD... 18MB caches are probably a long way out for them, unless they start building in Intel fabs :p

I thought Shanghai is going to be produced in a 45nm process. The 45nm is also IBM's process node to included the embedded DRAM.
 
I thought Shanghai is going to be produced in a 45nm process. The 45nm is also IBM's process node to included the embedded DRAM.

This is true, so perhaps we will see some sort of amazing technology dropped on us. I am thinking to myself that there may be road blocks, such as limits on 18meg size and the actual effect the speed increase from cpu to gpu this would create would be limited in performance gain.
 
If I were a betting man I'd say the 3x jump in L3 size from Barcelona to Shanghai is due to the use of ZRAM. I doubt it can get much denser than that for AMD... 18MB caches are probably a long way out for them, unless they start building in Intel fabs :p
In fact, SRAM is way easier to design than functional units.

If it's eDRAM, then it could be a little harder, but with a way higher density.

Though, I don't see any benefit from adding more and even more L2/L3, for example in Penryn: the die almost costs twice the "needed" space, just because of the added 4MB L2, and without real performance gains.
 
I thought that Penryn has 6MB L2?
6MB L2 = 300M transistors.

Given the low performance gain between 2MB and 4MB C2D, the 2MB has a better perf/transistor ratio and the same goes for Penryn.

2MB C2D, index 100 with 200M transistors (perf/transistor index: 100).
4MB C2D, index 105 with 300M transistors (perf/transistor index: 70).
6MB Penryn, projected index 115 with 400M transistors (perf/transistor index: 57).

Remember, a 4-cores 2MB K10 counts ~460M transistors.

I don't know why they add so much cache since it's obvious the expected performance gain is so small with the high hit rate we now see.

In the case of Fusion, the bigger L3 could be usefull given the number of calculation units it could contain, but in no way it's usefull in a "standard" CPU.
 
I don't know why they add so much cache since it's obvious the expected performance gain is so small with the high hit rate we now see.

Everybody already has a CPU, Intel and AMD have to convince them to replace it with a new one. To do that, they have to make existing software (which is almost exclusively single-threaded) run faster.

What else are you going to do with the extra transistors Moore's Law keeps handing out? The return on investment for additional execution units, bigger branch-prediction tables, instruction look-ahead pools, or any of the other traditional ILP mechanisms is pretty small these days, just like cache is.

There are a few workloads that benefit from lots of cache, so bigger caches give you an app or two for your marketing slides to get hysterical about.
 
There are a few workloads that benefit from lots of cache, so bigger caches give you an app or two for your marketing slides to get hysterical about.
This is perfectly right from a marketing perspective, but it is horribly wrong from a financial and accounting perspective.

The only thing that matters at the end of the day is revenue/mm2, where yield is implicitly taken into account. In the CPU industry, a number of industry dynamics influence this ratio:

- Revenue from a product is not linearly proportional to performance, even when TDPs don't matter; indeed, prices tend to grow faster than performance, which can make the mid-range more interesting than the low-end, for example.
- Traditionally, there is very little redundancy in CPUs except for cache; thus, the implicit effect of yield on the ratio makes cache less expensive per mm2. It is also the densest kind of transistor, and as such is even cheaper per transistor. Logic redundancy is slowly changing though, with dies harboring defectives cores being sold more and more (Intel is doing it with Conroe, IBM is doing it with CELL, and now AMD is doing it with Phenom).
- Both AMD and Intel have their own fabs; as such, it is very important to have them running at full capacity as much as possible. If you realize that you have too much capacity, then increasing your mix of chips with (slightly) lower revenue/mm2 can even be a good strategy, ironically enough.

It's a complex question really, so I won't go into my opinions of what makes the most sense in practice. Needless to say though, I disagree it's only a marketing decision (well, if your company's organization really sucks, it might be - but that's suboptimal and another problem completely).

My opinion is that the 'ideal' long-term solution is to have memory chips use photonic interfaces to significantly increase bandwidth and reduce power. Interestingly, photonics requires SOI, and so does Z-RAM, so it seems to me there's an obvious synergy there... It's difficult to know exactly what the economics of that would be without being an insider, thogh. And either way, it's a long way off, so I'm really just disgressing here!
 
- Traditionally, there is very little redundancy in CPUs except for cache; thus, the implicit effect of yield on the ratio makes cache less expensive per mm2. It is also the densest kind of transistor, and as such is even cheaper per transistor. Logic redundancy is slowly changing though, with dies harboring defectives cores being sold more and more (Intel is doing it with Conroe, IBM is doing it with CELL, and now AMD is doing it with Phenom).

I think that was before the multi-core era gave them so many more options.

Just an example: The Core 2 Duo E4xxx.
There are two additional sub-products of this die. The Core 2 Solo U2xxx and the Pentium Dual-Core E2xxx.
Intel played with cache in the second one (disabled 1 of the 2MB and sells it as a "dual-core for the masses") and further disabled one of the two cores while reducing clockspeeds down to Ultra Low-Voltage levels in the first one.
Yet, the same core exists in a Celeron 5xx on higher clockspeeds and voltages.

So, out a single of their two mainstream 65nm dies today (the other being the E6xxx/T7xxx/Q6xxx/Xeon's), Intel is able to have a wealth of products on offer.
This flexibility with clocks, amounts of L2, FSB's, voltages, number of active cores and disabled features (VT, etc) didn't exist in the Pentium III days, for instances.
The "tri-core" move by AMD didn't surprise me as it was just another logical step.
 
Large L3 Cache might be "splitable"

Though, I don't see any benefit from adding more and even more L2/L3, for example in Penryn: the die almost costs twice the "needed" space, just because of the added 4MB L2, and without real performance gains.

That's right, increasing cache sizes show diminishing benefits. But a fusion CPU could split the available L3 cache so that, lets say 12 or 16MB are used as a fast framebuffer, texture-memory-cache etc for the GPU part. The CPU can even get access to GPU memory without going to main-memory or through any bus system like PCIe.

So there would indeed be a synergy as soon as the GPU gets widely used for different purposes than rendering.
 
lets say 12 or 16MB are used as a fast framebuffer, texture-memory-cache etc for the GPU part.
Yup, and this is even more attractive with Z-RAM, which I am very, very confident that we will see in Bulldozer, if not earlier. I hope we'll see it in the first non-MCM Fusion part too, but who knows. Anyway, you don't even need 12MB to get some very nice returns on an IGP imo... Needing as much eDRAM as your framebuffer's size is a massively outdated concept.
 
Yup, and this is even more attractive with Z-RAM, which I am very, very confident that we will see in Bulldozer, if not earlier. I hope we'll see it in the first non-MCM Fusion part too, but who knows. Anyway, you don't even need 12MB to get some very nice returns on an IGP imo... Needing as much eDRAM as your framebuffer's size is a massively outdated concept.

Have i detected a small "sting" to Xenos ? :smile:
 
Back
Top