CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

j^aws · Aug 22, 2004

Panajev2001a said:
Also note that by taking these unit areas from the PSX core, were also inheriting the areas of datapaths etc. which would scale with our calculations.

Click to expand...

The problem they are seeing right now is that routing logics, wires do not scale as fast as you would want them to and they are anyways a big limit to clock-speed scaling in future chips.

http://realworldtech.com/page.cfm?ArticleID=RWT062004172947

If Cell is a VLIW processor, then we'd see most of the scheduling/control logic removed from the hardware and into software, a clever compiler, no?

j^aws · Aug 22, 2004

Panajev2001a said:
Scared of multi-processors!

Click to expand...

No, I want better cache for the APUs and the PU.

I want more general purpose performance:

Sounds like you'd prefer the Xe CPU!

Panajev2001a said:
I want the PU to be easier to program and to be able to better direct the APUs.

Do you think that you'd get low level access to the PUs and APUs? I have this feeling that low level access would jeopordise future compatability with future faster APUs in Cell (they'd likely to be many implementations of Cell) and therefore devs would be abstracted from them, no?

Vince · Aug 22, 2004

Jaws said:
If Cell is a VLIW processor, then we'd see most of the scheduling/control logic removed from the hardware and into software, a clever compiler, no?

Some members of the IBM team behind Cell patented something related to this in 2002 if memory served me correctly. Should check it out.

PS. We went over this HERE a while ago. You can take or add or correct me on my area calculations if you want, I didn't scale the logic down correctly.

j^aws · Aug 22, 2004

Vince said:
Jaws said:

If Cell is a VLIW processor, then we'd see most of the scheduling/control logic removed from the hardware and into software, a clever compiler, no?

Click to expand...

Some members of the IBM team behind Cell patented something related to this in 2002 if memory served me correctly. Should check it out.

Thanks, I'll see if I can dig something up!

Vince said:
PS. We went over this HERE a while ago. You can take or add or correct me on my area calculations if you want, I didn't scale the logic down correctly.

I've checked the link out, it's about the Vietnam war? Not sure if this was the intended link or not!

Panajev2001a · Aug 22, 2004

Jaws said:
Panajev2001a said:

Scared of multi-processors!

Click to expand...

No, I want better cache for the APUs and the PU.

I want more general purpose performance:

Click to expand...

Sounds like you'd prefer the Xe CPU!

Are they that "incredibly" different ?

Cough...

I want engineering excellency and awesome Vector and good Scalar performance is better (for PlayStation 3 too) than excellent Vector and poor Scalar performance.

Panajev2001a said:
Panajev2001a said:

I want the PU to be easier to program and to be able to better direct the APUs.

Click to expand...

Do you think that you'd get low level access to the PUs and APUs? I have this feeling that low level access would jeopordise future compatability with future faster APUs in Cell (they'd likely to be many implementations of Cell) and therefore devs would be abstracted from them, no?

Lower level access to the PU ?

Why would that be a problem ?

Also abstraction of the orchestration of APUs means that someone has to write that abstraction code and that code needs to run fast.

Also, sooner or later low level access to APUs and PU will be granted IMHO.

I do not think this would introduce backward-compatibility issues.

DeanoC · Aug 22, 2004

Gubbi said:
Shurely some mistake Deano. Did you mistake KB for MB ?

Athlons has 128KB level 1 cache (split between I and D caches, yes) clocking at 2.4GHz (0.416ns cycle time) with a load to use penalty of 3 cycles (1.25 random access latency).

Stuffing lots of RAM on chip seems to be a good way to boost performance in future CPUs.

Exactly, a PC chip (which always use lots of die space for cache, as its a good general performance improvement) maxes out at 128K. And your suggesting that each APU (of which a single chip is meant to have between 8 and 32) has a similar amount at a similar speed? (assuming APU clock at around 2 Ghz).

Now if you want to clock the APUs down to 500Mhz its a different story, but >1 Ghz on chip memory is massively expensive. Hardware engineers are allows telling me, small increases in cache RAM is very expensive.

I'd be happy to be wrong...

Panajev2001a · Aug 23, 2004

Jaws said:
Panajev2001a said:

Also note that by taking these unit areas from the PSX core, were also inheriting the areas of datapaths etc. which would scale with our calculations.

Click to expand...

The problem they are seeing right now is that routing logics, wires do not scale as fast as you would want them to and they are anyways a big limit to clock-speed scaling in future chips.

http://realworldtech.com/page.cfm?ArticleID=RWT062004172947

Click to expand...

If Cell is a VLIW processor, then we'd see most of the scheduling/control logic removed from the hardware and into software, a clever compiler, no?

PUs will not be VLIW processors, likely the APUs will: we do not exepct branch prediction even for them judging all the patents we have seen.

I want to see internal busses and APUs running fast ( 4 GHz PE vs 1 GHz x4 PEs) since I do not expect the APUs to be MT (Multi-Threaded, no evidence in the patents we have seen so far), I want to reduce context switching time to the minimum.

The PU will have to juggle a tons of threads to keep the system running efficiently and a good thread scheduler will need a nice and fast PU to do its job.

Still, even though you remove a nice chunk of scheduling and contorl logic, you need to worry about wires and the wall that they are putting on CPU scaling.

CELL is going the right way: highly-parallel multi-core solution (tons of little cores).

I will even grant you that yelds on such a 300 mm^2 beast would not be so bad: the degree of redundancy you have helps you.

You have lots of repearting blocks: SRAM cells, DRAM cells, APUs, PUs, DMACs.

Once each block has been simulated (in your simulator, tested on an FPGA or using a custom ASIC), tested, debugged, fixed, etc... replicating it will not be an impossible undertaking.

Plus, and this you forgot, you can add few more blocks just to increase yelds: if you put 9 APUs per PE it will be much more probable to have working PEs with 8 fully functional APUs.

Aside from only rumored, but not confirmed in any way, manufacturing process hiccups (and othe reconomical concerns: even with what we have said so far 1 PE is still cheaper than 4 PEs at the same frequency), the main concern is the heat such a beast would produce.

Panajev2001a · Aug 23, 2004

DeanoC said:
Gubbi said:

Shurely some mistake Deano. Did you mistake KB for MB ?

Athlons has 128KB level 1 cache (split between I and D caches, yes) clocking at 2.4GHz (0.416ns cycle time) with a load to use penalty of 3 cycles (1.25 random access latency).

Stuffing lots of RAM on chip seems to be a good way to boost performance in future CPUs.

Click to expand...

Exactly, a PC chip (which always use lots of die space for cache, as its a good general performance improvement) maxes out at 128K. And your suggesting that each APU (of which a single chip is meant to have between 8 and 32) has a similar amount at a similar speed? (assuming APU clock at around 2 Ghz).

Now if you want to clock the APUs down to 500Mhz its a different story, but >1 Ghz on chip memory is massively expensive. Hardware engineers are allows telling me, small increases in cache RAM is very expensive.

I'd be happy to be wrong...

The LS does not have to run at 4 GHz, first. Second, it could if you accept few more cycles of latency (no-one is asking a single cycle latency LS).

Still Pentium 4 Prescott CPUs have 1 MB of L2 cache running at 3.5 GHz, sold at a very nice profit, right now on a still not perfected, non SOI manufacturing process.

Itanium 2 KcKinley (first generation Itanium 2) 3 MB of cache, 400+ mm^2, 130 nm manufacturing process with 200 mm Wafers, clock-speed >= 1.0-1.2 GHz...

Cost ? $130-140 for Intel per CPU.

Intel plans to push in 90 nm more than 24 MB of cache.

"Otellini said that Montecito has 1.7 billion transistors on each chip, 24MB of cache, is dual core, and would have three times increase in performance bandwidth, introduced in 2005. "We have silicon for this today," he claimed. He also said that nine out of 10 RISC firms now ship Itanium processors. The tenth is Sun, of course."

Fafalada · Aug 23, 2004

DeanoC said:
Exactly, a PC chip (which always use lots of die space for cache, as its a good general performance improvement) maxes out at 128K.

Cache takes more die space then regular embeded memory though, and moreover, it has a different curve in regards to size/performance benefits.

I don't know about how high it will clock, but given the process and suggested size of the BE, I don't think it's so unreasonable to expect a couple of MB of embeded memory in it.

I wouldn't expect anything silly like the 64MB eDram pool some have suggested on TOP of APU local storage.

passerby · Aug 23, 2004

1c contribution.

In personal guesses, I'm currently in the '<<< 4GHz' side. A lot of our guesses can be made much simpler, and more believable(even more elegant!) if we didn't need to try scaling conjectures to the 'prefered embodiment of 32GF per APU' statement.

Some time ago there was a report of a near-TF(or multi-100 GF) CPU created by an institute - was it based in Isreal? Recalled that Inquirer/Register reported it. No reference to clockspeed, just a vague statement 'this CPU achieved its performance by having a large amount of its area dedicated to mathematical computations'. And of course this product is suitable for use only in certain domains.

Hey as another really good example, newset video cards only clock at the half-GHz ranges, and can still post high GF numbers.

My current guess:
Each APU is just in the sub-1GHz regions - maybe even much lower. How they claim high GF numbers - well we'll wait and see. After all we don't know much about what an APU is - just lots of papers about how APUs deploy and work together, but nothing technical on an APU itself.

DeanoC · Aug 23, 2004

Panajev2001a said:
The LS does not have to run at 4 GHz, first. Second, it could if you accept few more cycles of latency (no-one is asking a single cycle latency LS).

Still Pentium 4 Prescott CPUs have 1 MB of L2 cache running at 3.5 GHz, sold at a very nice profit, right now on a still not perfected, non SOI manufacturing process.

Itanium 2 KcKinley (first generation Itanium 2) 3 MB of cache, 400+ mm^2, 130 nm manufacturing process with 200 mm Wafers, clock-speed >= 1.0-1.2 GHz...

Cost ? $130-140 for Intel per CPU.

Intel plans to push in 90 nm more than 24 MB of cache.

"Otellini said that Montecito has 1.7 billion transistors on each chip, 24MB of cache, is dual core, and would have three times increase in performance bandwidth, introduced in 2005. "We have silicon for this today," he claimed. He also said that nine out of 10 RISC firms now ship Itanium processors. The tenth is Sun, of course."

Click to expand...

Even at full clockspeed the latency varies, No L1 RAM has single cycle latency these days. Currently it varies between 3-9 cycles (dependent on processor) and this stuff is amazingly expensive. L2 even at full clock has a much higher latency (10+ cycles).

You basically backing up what I said, low latency fast RAM will be VERY scarse, however slower high latency RAM is alot cheaper.

eDRAM, L2 and L3 are all the second cheaper type whereas L1/Scratchpad RAM is the first.

What I originally suggested was just that if you talking about MB levels of RAM on the chip, its mostly going to be slower latency stuff.

The limit IMO would be 1 MB of SRAM on chip, which if we have 8 APUs would mean the original 128K local ram.

DeanoC · Aug 23, 2004

Fafalada said:
DeanoC said:

Exactly, a PC chip (which always use lots of die space for cache, as its a good general performance improvement) maxes out at 128K.

Click to expand...

Cache takes more die space then regular embeded memory though, and moreover, it has a different curve in regards to size/performance benefits.

I don't know about how high it will clock, but given the process and suggested size of the BE, I don't think it's so unreasonable to expect a couple of MB of embeded memory in it.

I wouldn't expect anything silly like the 64MB eDram pool some have suggested on TOP of APU local storage.

Couple of megs of eDRAM sure but eDRAM has 20+ cycle latency (hidden well in graphics processors).

Its the lots of very fast low latency SRAM, I'm doubting.

Gubbi · Aug 23, 2004

DeanoC said:
Even at full clockspeed the latency varies, No L1 RAM has single cycle latency these days. Currently it varies between 3-9 cycles (dependent on processor) and this stuff is amazingly expensive. L2 even at full clock has a much higher latency (10+ cycles).

But scratchpad memory isn't cache. In a cache you have check tags to see if you have a hit. Then you have to choose way (No one uses flat caches these days). Then you have to get the data out the SRAM array. All this in 3 cycles @ 2.4GHz.

APU scratchpad memory is directly adressable so using the same 3 cycles you can either 1.) go for faster clock, or 2.) save power.

32APUs x 128KB is just 4MB SRAM. Intel already has CPUs with 6MB level 3 cache on die (Itanium 2), and that's 2 logic generations before what CELL is supposed to be on.

Cheers
Gubbi

nAo · Aug 23, 2004

Gubbi said:
Intel already has CPUs with 6MB level 3 cache on die (Itanium 2), and that's 2 logic generations before what CELL is supposed to be on.

Even if PS3 CELL chip will be expensive I doubt it will be so expensive as an Itanium2 to build ..

Panajev2001a · Aug 23, 2004

Fafalada said:
DeanoC said:

Exactly, a PC chip (which always use lots of die space for cache, as its a good general performance improvement) maxes out at 128K.

Click to expand...

Cache takes more die space then regular embeded memory though, and moreover, it has a different curve in regards to size/performance benefits.

I don't know about how high it will clock, but given the process and suggested size of the BE, I don't think it's so unreasonable to expect a couple of MB of embeded memory in it.

I wouldn't expect anything silly like the 64MB eDram pool some have suggested on TOP of APU local storage.

Also, power consumption for SRAM transistors is way lower than logic transistors.

Panajev2001a · Aug 23, 2004

DeanoC said:
Panajev2001a said:

The LS does not have to run at 4 GHz, first. Second, it could if you accept few more cycles of latency (no-one is asking a single cycle latency LS).

Still Pentium 4 Prescott CPUs have 1 MB of L2 cache running at 3.5 GHz, sold at a very nice profit, right now on a still not perfected, non SOI manufacturing process.

Itanium 2 KcKinley (first generation Itanium 2) 3 MB of cache, 400+ mm^2, 130 nm manufacturing process with 200 mm Wafers, clock-speed >= 1.0-1.2 GHz...

Cost ? $130-140 for Intel per CPU.

Intel plans to push in 90 nm more than 24 MB of cache.

"Otellini said that Montecito has 1.7 billion transistors on each chip, 24MB of cache, is dual core, and would have three times increase in performance bandwidth, introduced in 2005. "We have silicon for this today," he claimed. He also said that nine out of 10 RISC firms now ship Itanium processors. The tenth is Sun, of course."

Click to expand...

Click to expand...

Even at full clockspeed the latency varies, No L1 RAM has single cycle latency these days. Currently it varies between 3-9 cycles (dependent on processor) and this stuff is amazingly expensive. L2 even at full clock has a much higher latency (10+ cycles).You basically backing up what I said, low latency fast RAM will be VERY scarse, however slower high latency RAM is alot cheaper.

Pentium 4's Northwood (512 KB of L2 cache ) has a pure L2 latency of 7-9 and this is true even at 3+ GHz.

L1's latency is only 2 cycles for the 8 KB Data cache Northwood uses: very small cache, but 2 cycles of latency at 3 GHz is impressive.

eDRAM, L2 and L3 are all the second cheaper type whereas L1/Scratchpad RAM is the first.

What I originally suggested was just that if you talking about MB levels of RAM on the chip, its mostly going to be slower latency stuff.

The limit IMO would be 1 MB of SRAM on chip, which if we have 8 APUs would mean the original 128K local ram.

Fine by me as that is what I expect: 1 PE at 4 GHz with 8 APUs per PE.

How fast do you see this LS being in terms of latency ?

If Intel can push 512 KB of L2 cache (which is more complex than regular scratch-pad RAM) at 3 GHz with 7-9 cycles of pure latency on a 130 nm process, what can STI do in 65 nm ?

Panajev2001a · Aug 23, 2004

nAo said:
Gubbi said:

Intel already has CPUs with 6MB level 3 cache on die (Itanium 2), and that's 2 logic generations before what CELL is supposed to be on.

Click to expand...

Even if PS3 CELL chip will be expensive I doubt it will be so expensive as an Itanium2 to build ..

$140 for an Itanium 2 (with still x86 compatibility logic in there) for a >400 mm^2 chip with 3 MB of cache and running at 1-1.2 GHz (and faster now, that was an estimation of the cost when the chip was launched) in 130 nm ?

Guden Oden · Aug 23, 2004

DeanoC said:
No L1 RAM has single cycle latency these days.

Like Gubbi already mentioned, scratchpad RAM is not cache, so this does not apply...

and this stuff is amazingly expensive.

Huh? "Amazingly" expensive? Why? HOW?

One transistor isn't neccessarily any more "expensive" than another you know, they're all etched the same way out of the silicon substrate. Anyway, remind me, what does a 2.8+GHz CPU cost these days? Not THAT much!

L2 even at full clock has a much higher latency (10+ cycles).

The limit IMO would be 1 MB of SRAM on chip

I'm not contradicting you or anything (because I don't know myself), but how do you figure this to be the case, common sense perhaps?

If someone in 1998 would have told you Sony would release a console two years later with a total of 40+ million transistors in its two main chips, what would your general reaction have been? Probably one of disbelief I would think.

I wouldn't bet the house on having 32APUs and 4MB SRAM, but we can be sure that if it is doable, Sony will do it. They're not afraid of aiming for over the top solutions rather than the nice and safe middle-of-the-road approach like most others.

j^aws · Aug 23, 2004

Vince said:
....
PS. We went over this HERE a while ago. You can take or add or correct me on my area calculations if you want, I didn't scale the logic down correctly.

Okay, I've had a quick scan through that thread. It's interesting we've used two approaches and we both end up with a BE ~ 300 mm2.

However, your BE is slightly more conservative with less eDRAM than mine (with 32 MB) and no L2 and L3 cache for the PUs. Still, it shows with two different approaches were around the same ballpark figure!

Some comparable figures;

4 PUs: Jaws~21mm2 Vince~25 mm2

32 APUs: Jaws~170mm2 Vince~162 mm2

64 MB eDRAM: Jaws~55mm2 Vince~59 mm2

Both sets all add upto 246 mm2 !

Your approach was more pessimistic where you didn't use geometric ^2 scaling and you factored in 2 mm2 fudge factor per APU. My calculations took the fudge factor inherent in the unit areas into account and is mostly one process drop from 90nm to 65 nm. Still in hindsight I should increase my 4 PUs to ~ 28 mm2 with the L2 128 KB cache.

It's interesting that you used the PU cores as PowerPC 440, which is unlikely now as IBM has sold the IPs. The PU cores could be something completely new, that wasn't on IBMs roadmap when Cell was announced to go with the new ISA and archtecture? Or the rumoured embedded PowerPC 300 series?

j^aws · Aug 23, 2004

Panajev2001a said:
Jaws said:

Panajev2001a said:

Scared of multi-processors!

Click to expand...

No, I want better cache for the APUs and the PU.

I want more general purpose performance:

Click to expand...

Sounds like you'd prefer the Xe CPU!

Click to expand...

Are they that "incredibly" different ? Cough...

You might want that cought seeing too!

If they are not that fundamntally different, then what is the real agenda behind Cell and all this investment? Why not just ask IBM for 4 dual issue PowerPC 970/980s to burn into a 65nm CPU and perhaps save some cash?

Panajev2001a said:
Lower level access to the PU ?

Why would that be a problem ?

By low level, I mean register level/ assembly level/ microcode level or something similar type access to the PU and APUs. Likely to be different implementations of APUs and PUs for Cell...Wouldn't this affect future compatability?

Panajev2001a said:
...
Also abstraction of the orchestration of APUs means that someone has to write that abstraction code and that code needs to run fast.

Wouldn't a VLIW compiler deal with that?

Panajev2001a said:
Also, sooner or later low level access to APUs and PU will be granted IMHO.

Yeah, if devs complain about performance or need to get that extra mile...but at the cost of future compatability? Sony would probably accept this for any edge on the PS3, especially if Xenon has low level access. Wouldn't your head hurt if we get the BE, dealing with 32 APUs and 4 PUs!

And I thought Sony wanted to make it easier for devs!

CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

j^aws

j^aws

Vince

j^aws

Panajev2001a

DeanoC

Trust me, I'm a renderer person!

Panajev2001a

Panajev2001a

Fafalada

passerby

DeanoC

Trust me, I'm a renderer person!

DeanoC

Trust me, I'm a renderer person!

Gubbi

nAo

Nutella Nutellae

Panajev2001a

Panajev2001a

Panajev2001a

Guden Oden

Senior Member

j^aws

j^aws

Similar threads