CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

V3 · Aug 24, 2004

Just read an interesting, if true, article about Intel. Here

That's interesting, if true.

PS. How big is NV40? I know it's large, but how large?

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=2

At 222 Million transistors and the same process size as the previous generation high end parts, the NV40 is a big chip, fitting only about 190 full cores per 300mm wafer. In fact the chip is so large, you may wonder what their yields will be like and hence what the availability will be like â€“ nonetheless, it seems like a fairly impressive feat to have a chip this large running on 130nm.

NV40 probably already contained more than 32 processors, even if they're stream processors. Broadband Engine, is not too big of a deal compare to it. Only its clock speed at 4 GHz, that will be an achievement.

j^aws · Aug 24, 2004

Vysez said:
Jaws said:

You dont have a link to that do you ? Cheers..

Click to expand...

Of course i have a link: A little site named Beyond3d.com

It appears that Dave didn't added the die size (Which i remembered because i've saw it lately, i checked the transistor count, though).

Thanks...He's got it down as 0 mm2, isn't that false advertising!

Vysez said:
But since you're looking for a link, i googled this link.

This link says 270 - 305 mm2...

Duh...I was just there getting the 970FX die size, didn't know they did GPUs, nice...though 270-305 mm2 is quite a large range :? ...still, it's BE territory!

Fafalada · Aug 24, 2004

Deano said:
Couple of megs of eDRAM sure but eDRAM has 20+ cycle latency (hidden well in graphics processors).
Its the lots of very fast low latency SRAM, I'm doubting.

Actually there's an interesting question - what kind of pipeline length will the APUs have in the first place. I mean, I sincerely doubt we can expect something as shallow as 4stages, and the mem latency only needs to be enough to fit the pipeline length if we follow VU design.

Jaws said:
The PPC 970FX is the dual issue, SOI, strained silicon, higher clocking, 90 nm modified version of the G5/PPC 970 core...

Er, last I checked 970FX was still 8-way superscalar.
Anyway, why not something like a modified PPC440 instead - it's lightweight, already 2-way SS, and unlike the 970 family it doesn't come with a ton of baggage that has no real purpose in a console.

Meh...didn't you see that die image of the PSX core, with 4 MB eDRAM at 130nm and the rest of the core at 90nm, and only 86mm2 die area, and my attempt to show a BE ~ 300 mm2 at 65 nm

Well I just think Sony probably wants a chip that won't melt the casing of the machine if you run it more then 30minutes.

Gubbi · Aug 24, 2004

Fafalada said:
Deano said:

Couple of megs of eDRAM sure but eDRAM has 20+ cycle latency (hidden well in graphics processors).
Its the lots of very fast low latency SRAM, I'm doubting.

Click to expand...

Actually there's an interesting question - what kind of pipeline length will the APUs have in the first place. I mean, I sincerely doubt we can expect something as shallow as 4stages, and the mem latency only needs to be enough to fit the pipeline length if we follow VU design.

I don't think it will be much above 4. The APUs are likely to be simple two way (load/store +exec) superscalar CPUs. There are very clear advantages to keeping pipeline length down, first and foremost you can make do with simple branch prediction.

Given the super simple nature of an APU, instruction fetch, decode and issue should all be able to be performed in one cycle (or 2 at the most). IBM has a tradition of resolving branching in the fetch/decode stage aswell, so you might count that as a 3-way superscalar.

The exec stage might be pipelined (like IBM has done in Power4/5), so that will take 2 cycles for simple instructions (add, sub, boolean ops etc), loads and stores will take 3 or 4 cycles (guesstimate). Floating point operations will take 3 or 4 cycles (for fadd and fmul).

And then one cycle for result write back.

So you'll see something like 4-6 cycles for simple ops, and 6-8 cycles for floating point/memory operations.

Fafalada said:
Jaws said:

The PPC 970FX is the dual issue, SOI, strained silicon, higher clocking, 90 nm modified version of the G5/PPC 970 core...

Click to expand...

Er, last I checked 970FX was still 8-way superscalar.
Anyway, why not something like a modified PPC440 instead - it's lightweight, already 2-way SS, and unlike the 970 family it doesn't come with a ton of baggage that has no real purpose in a console.

The PPC 970 can fetch 8 instructions per cycle, but can only decode/issue 5 of which one is a branch, so in reality it's only a 4 or 5 issue superscalar.

As for PPC 440, I think that is too simple (being single issue). You'd want to be able to co-issue load/store instructions to keep your execution units fed.

What I described above is more akin to the PPC 603, but with a SIMD exec unit.

Cheers
Gubbi

Deepak · Aug 24, 2004

Link

"We will be assessing the impact of this new IBM technology on systems along the entire cellular computing cascade. At the high end of this cascade are supercomputers, in the middle are workstations and servers, and at the bottom are personal gaming and other consumer entertainment devices -- all based on a common compute cell publicly referred to as Cell processor. Over the past five years IBM and Sony have concurrently engaged in bringing the new class computing cascade into commercial life within about the same 2005-2006 period; at the high end is IBM's world performance shattering super-computing system, the P-machine (for Petaflop, 10 to the 15th power floating point operations per second), and at the low end is Sony's super-entertainment system (PlayStation 3-class gaming console and related systems)."

Panajev2001a · Aug 24, 2004

Jaws said:
Hmmm...I wonder what this next Linux / Cell OS kit would be like... I hope they don't lock things down like the current kit...

Hey, be nice: the only locked the DVD drive, the I/O CPU and the SPU2 from direct control, but at least for the I/O CPU and the SPU2 they have exposed them a bit through Linux (/dev/dsp1 and /dev/dsp2 for the SPU2 [these two are working, there should be more]).

The EE and GS are fully exposed to the programmer, including the DMAC thanks to a library called SPS2 (this library does minimal set-up work for you: it is a library that allows you to access the hardware at a very low level, it is not a 3D engine built for you even though you can esasily find material to start in the samples or in other projects on the PlayStation 2 Linux website):

http://playstation2-linux.com/projects/sps2/

Panajev2001a · Aug 24, 2004

The PPC 970 can fetch 8 instructions per cycle, but can only decode/issue 5 of which one is a branch, so in reality it's only a 4 or 5 issue superscalar.

No, the 5 thingies it tracks down are not simple instructions, but instruction groups composed of one or more iops (some can be NOPs I agree).

Let's see what IBM has to say about it:

The PowerPC 970 can dispatch five instructions (including one branch) per cycle, and can issue up to eight instructions to the execution units per cycle.

http://www-3.ibm.com/chips/products/powerpc/newsletter/dec2002/newproductfocus2.html

As I noted above, IBM's PowerPC 970 fetches 8 instructions per cycle from the L1 cache into an instruction queue, from which the instructions are pulled for decoding at a rate of 8 per cycle.

[...]

Almost all the PowerPC ISA instructions, with a few exceptions, translate into exactly one IOP. Of the instructions that translate into more than one IOP, IBM distinguishes two types:

* A cracked instruction is an instruction that splits into exactly 2 IOPs.
* A millicoded instruction is an instruction that splits into more than 2 IOPs.

This difference in the way instructions are classified is not arbitrary. Rather, it ties into a very important design decision that the Power4's designers made regarding how the chip tracks instructions at various stages of execution. Before I explain this decision and the impact that it has on the 970, I should first recap how instructions normally move from a processor's front end to its execution core.

[...]

If you take a look at the middle of the large PPC 970 diagram from the beginning of the article, then you'll notice that right below the "decode, cracking, and group formation" phase I've placed a group of five boxes. These five boxes represent what IBM calls a "group", and each "group" consists of five IOPs arranged in program order according to certain rules and restrictions. It is these organized and packaged groups of five IOPs, and not single IOPs in isolation, that the 970 dispatches in-order to the six issue queues in its execution core. Once the IOPs in a group reach their proper issue queues, they can then be issued out of order to the execution units at a rate of 8 IOPs/cycle for all the queues combined. Before they reach the completion stage, however, they need to be placed back into their group so that an entire group of 5 IOPs can be completed each cycle.

I probably shouldn't go any further in discussing how these groups work without first explaining the reason for their existence. By assembling IOPs together into specially ordered groups of five for dispatch and completion, the 970 can track these groups, and not individual IOPs, through the various stages of execution. So instead of tracking 100 individual IOPs in-flight as they work their way through the 100 or so execution slots available in the execution core, the 970 need track only 20 groups. IOP grouping, then, significantly reduces the overhead associated with tracking and reordering the huge volume of instructions that can fit into the 970's "deep and wide" design.

The price the 970 pays for this reduced overhead is a loss of execution efficiency brought on by the reduced granularity of control that comes from being able to dispatch, schedule, issue, and complete instructions on an individual basis. Let me explain.

When the 970's front end assembles an IOP group there are certain rules it must follow. The first rule is that the group's five slots must be populated with IOPs in program order, starting with the oldest IOP in slot 0 and moving up to newest IOP in slot 4. Another rule is that all branch instructions must go in slot 4, and slot 4 is reserved for branch instructions only. This means that if the front end can't find a branch instruction to put in slot 4, then it can issue one less instruction that cycle. Similarly, there are some situations in which the front end must insert noops into the group's slots in order to force a branch instruction into slot 4. "Noop" (pronounced "no op") is short for "no operation"--it's a kind of non-instruction instruction that means "do nothing". In other words, the front end must sometimes insert empty execution slots, or pipeline bubbles, into the instruction stream in order to make the groups comply with the rules.

The above rules aren't the only ones that must be adhered to when building groups. Another rule dictates that instructions destined for the conditional register unit (CRU) can go only in slots 0 and 1. And then there are the rules dealing with cracked and millicoded instructions. From IBM's Power4 whitepaper:

Cracked instructions flow into groups as any other instructions with one restriction. Both IOPs must be in the same group. If both IOPs cannot fit into the current group, the group is terminated and a new group is initiated. The instruction following the cracked instruction may be in the same group as the cracked instruction, assuming there is room in the group. Millicoded instructions always start a new group. The instruction following the millicoded instruction also initiates a new group.

And that's not all! A group has to have the following resources available before it can even dispatch to the core. If just one of following resources is too tied up to accommodate the group or any of its instructions, then the entire group has to wait until that resource is freed up before it can dispatch.

* Group Completion Table entry: The GCT is the 970's equivalent of a reorder buffer. The GCT has 20 entries for keeping track of 20 active groups as the groups' constituent instructions make their way through the ~100 execution slots available in the execution core's pipelines. Regardless of how few instructions are actually in the execution core at a given moment, if those instructions are grouped so that all 20 GCT entries happen to be full then no new groups can be dispatched.
* Issue Queue slot: If there aren't enough slots available in the appropriate issue queues to accommodate all of a group's instructions, then the group must wait to dispatch. (In a moment I'll elaborate on what I mean by "appropriate issue queues".)
* Rename Registers: There need to be enough register rename resources available so that any instruction which requires register renaming can issue when it's dispatched to its issue queue.

Again, when it comes to the above restrictions, one bad instruction can spoil the whole bunch.

Because of its use of groups the 970's dispatch bandwidth is sensitive to a whole complex host of factors, not the least of which is a sort of "internal fragmentation" of the group completion table that could potentially arise and needlessly choke dispatch bandwidth if too many of the groups in the GCT are partially or mostly empty.

[...]

You can't really get a full picture of what the 970 offers until you examine its execution core and issue queues. The 970 offers twelve (depending on how you count them) execution units for doing the actual grunt work of executing instructions, and though twelve is a relatively large number, a simple enumeration of execution resources doesn't tell you nearly as much as an examination of how those resources are organized.

[...]

http://arstechnica.com/cpu/02q2/ppc970/ppc970-4.html

Gubbi · Aug 24, 2004

Panajev2001a said:
The PPC 970 can fetch 8 instructions per cycle, but can only decode/issue 5 of which one is a branch, so in reality it's only a 4 or 5 issue superscalar.

Click to expand...

No, the 5 thingies it tracks down are not simple instructions, but instruction groups composed of one or more iops (some can be NOPs I agree).

PPC 970 and Power 4 can have 20 groups in flight or 100 instructions. It dispatches the instructions in one group at a time.

Panajev2001a said:
Let's see what IBM has to say about it:

The PowerPC 970 can dispatch five instructions (including one branch) per cycle, and can issue up to eight instructions to the execution units per cycle.

Click to expand...

IBM PPC design has traditionally used local queues (reservation stations) where others use a global instruction scheduling window. Instructions are issued (in IBM lingo - dispatched) at the rate of one group or 4 instructions (plus a branch) per cycle.

Throughput is 4 instructions + a branch and hence it's a 4 (or 5) way superscalar.

Cheers
Gubbi

Panajev2001a · Aug 24, 2004

That is not the peak it can do; you are defining its superscalarity by basically the number of IOPs in the group:

Once the IOPs in a group reach their proper issue queues, they can then be issued out of order to the execution units at a rate of 8 IOPs/cycle for all the queues combined.

j^aws · Aug 24, 2004

Fafalada said:
......

Jaws said:

The PPC 970FX is the dual issue, SOI, strained silicon, higher clocking, 90 nm modified version of the G5/PPC 970 core...

Click to expand...

Er, last I checked 970FX was still 8-way superscalar.
Anyway, why not something like a modified PPC440 instead - it's lightweight, already 2-way SS, and unlike the 970 family it doesn't come with a ton of baggage that has no real purpose in a console.

...why doesn't anyone get my gist! ...What I'm saying is that the PPC 970FX core will be modified to a dual issue core and three of those cores will make the Xe CPU ~ 3 GHz with a die around 200 mm2, IMHO!

Btw, the PPC 400 series IP has been sold by IBM to AMCC : Source...

Fafalada said:
Well I just think Sony probably wants a chip that won't melt the casing of the machine if you run it more then 30minutes.

Well, you need to get some of these cooling towers...I just bought some from e-bay, apparently you need one for each PE.

DeanoC · Aug 24, 2004

Jaws said:
...why doesn't anyone get my gist! ...What I'm saying is that the PPC 970FX core will be modified to a dual issue core and three of those cores will make the Xe CPU ~ 3 GHz with a die around 200 mm2, IMHO!

No we get your gist but its just not remotely true ;-)

j^aws · Aug 24, 2004

Panajev2001a said:
Jaws said:

Hmmm...I wonder what this next Linux / Cell OS kit would be like... I hope they don't lock things down like the current kit...

Click to expand...

Hey, be nice: the only locked the DVD drive, the I/O CPU and the SPU2 from direct control, but at least for the I/O CPU and the SPU2 they have exposed them a bit through Linux (/dev/dsp1 and /dev/dsp2 for the SPU2 [these two are working, there should be more]).

The EE and GS are fully exposed to the programmer, including the DMAC thanks to a library called SPS2 (this library does minimal set-up work for you: it is a library that allows you to access the hardware at a very low level, it is not a 3D engine built for you even though you can esasily find material to start in the samples or in other projects on the PlayStation 2 Linux website):

http://playstation2-linux.com/projects/sps2/

Nah, I wasn't being nasty or anything...I've always thought their 'hobbyist' dev kits were a great idea ever since the Net Yarouze. I was waiting for the PS2 kit but what ultimately put me off it was that you couldn't expand the RAM from the base 32 MB!

... I could live with the other flaws I suppose. It would be nice if they made the PS3s RAM expandable for it's Linux / Cell OS kit...and let you play DVD's, games etc. without having to re-boot the damn thing! :? ...and a DVI output would be cool also 8)

j^aws · Aug 24, 2004

DeanoC said:
Jaws said:

...why doesn't anyone get my gist! ...What I'm saying is that the PPC 970FX core will be modified to a dual issue core and three of those cores will make the Xe CPU ~ 3 GHz with a die around 200 mm2, IMHO!

Click to expand...

No we get your gist but its just not remotely true ;-)

Bah...stop hiding behind your fake NDA!

PS: Hehehe...Am I warm or cold, if I said PPC 976?

PSS: Or 3 Z80 superscaler 'Power Cores'?

DeanoC · Aug 24, 2004

Jaws said:
PS: Hehehe...Am I warm or cold, if I said PPC 976?

Honestly don't have a clue. But I suspect cold.

Console CPU != Desktop or Server CPU.

Different sets of requirements, code and applications to run.

Gubbi · Aug 24, 2004

Panajev2001a said:
That is not the peak it can do; you are defining its superscalarity by basically the number of IOPs in the group:

Once the IOPs in a group reach their proper issue queues, they can then be issued out of order to the execution units at a rate of 8 IOPs/cycle for all the queues combined.

Click to expand...

Super scalarity is not defined by peak but by throughput.

It's defined by min(fetch, decode, issue/dispatch, exec, retire), not max(fetch, decode, issue/dispatch, exec, retire)

Cheers
Gubbi

archie4oz · Aug 24, 2004

PPC 970 and Power 4 can have 20 groups in flight or 100 instructions. It dispatches the instructions in one group at a time.

Actually the instruction window is 200 instructions (the 100 just relates to the GCT). And yes it dispatches (and completes) 1 group at a time.

However...

IBM PPC design has traditionally used local queues (reservation stations) where others use a global instruction scheduling window. Instructions are issued (in IBM lingo - dispatched) at the rate of one group or 4 instructions (plus a branch) per cycle.

It still uses local queues (5 in Power4, 6 in the 970), and instructions are issued out of order at up to 8 per clock, hence IBM referring to it as an 8-issue design. If you have a problem with their definition, take it up with them as they're are ones advertising it as an 8-issue design...

PS: Hehehe...Am I warm or cold, if I said PPC 976?

I agree with Deano... Cold...

AzBat · Aug 24, 2004

Xenon Hardware Overview from Xbox-Scene.com said:
The Xenon CPU is a custom processor based on PowerPC technology. The CPU includes three independent processors (cores) on a single die. Each core runs at 3.5+ GHz. The Xenon CPU can issue two instructions per clock cycle per core. At peak performance, Xenon can issue 21 billion instructions per second.

The Xenon CPU was designed by IBM in close consultation with the Xbox team, leading to a number of revolutionary additions, including a dot product instruction for extremely fast vector math and custom security features built directly into the silicon to prevent piracy and hacking.

Each core has two symmetric hardware threads (SMT), for a total of six hardware threads available to games. Not only does the Xenon CPU include the standard set of PowerPC integer and floating-point registers (one set per hardware thread), the Xenon CPU also includes 128 vector (VMX) registers per hardware thread. This astounding number of registers can drastically improve the speed of common mathematical operations.

Each of the three cores includes a 32-KB L1 instruction cache and a 32-KB L1 data cache. The three cores share a 1-MB L2 cache. The L2 cache can be locked down in segments to improve performance. The L2 cache also has the very unusual feature of being directly readable from the GPU, which allows the GPU to consume geometry and texture data from L2 and main memory simultaneously.

Xenon CPU instructions are exposed to games through compiler intrinsics, allowing developers to access the power of the chip using C language notation.

I don't put much faith in the Xenon Block Diagram or the Xenon Hardware Overview. However, it seems a lot of people do and are using it define the Xbox2 CPU. Now I don't completely understand CPU technology, but I'll see if I can explain it well enough to get my question across. Hopefully, somebody here can answer it.

First, as defined by the Xenon Hardware Overview, SMT equals "symmetric hardware threads". I've also seen SMT equal "Simultaneous Multi-Threading" and SMP equal "Symmetric Multi-Processing". I think this leads to problems that Jaws and others are having when they believe that either a PPC970 or Power5 derivative processor more accurately describes the Xbox2 CPU. They believe the PPC970 series only executes one SMT per core and that the Power5 executes two SMT per core.

Are you guys stating that the use of "two symmetric hardware threads" and "2 instructions" per core used in the overview above actually compares to "2-way/issue superscalar"? If that's the case, are you saying that the Xenon Hardware Overview is not describing a PPC900 series or Power5 derivative processor? Is it more likely to be describing a derivative of the embedded PowerPC 400 series or something else?

Tommy McClain

ERP · Aug 24, 2004

I don't think they're explicitly saying anything other than taking a little leaked info and jumping to conclusions that make sense to you doesn't often equate to reality.

AzBat · Aug 24, 2004

ERP said:
I don't think they're explicitly saying anything other than taking a little leaked info and jumping to conclusions that make sense to you doesn't often equate to reality.

LOL, agreed. I'm just trying to understand why folks like Jaws, TeamXbox, etc believe(technically) the idea that the PPC970 series or Power5 is more likely than something like the PowerPC 400 series. Personally I'm not sure what to think with regard to the CPU other than its made by by IBM. LOL Considering we're almost 10-months since the Microsoft/IBM announcement, it's amazing we don't know more than we do.

Tommy McClain

j^aws · Aug 24, 2004

AzBat said:
ERP said:

I don't think they're explicitly saying anything other than taking a little leaked info and jumping to conclusions that make sense to you doesn't often equate to reality.

Click to expand...

LOL, agreed. I'm just trying to understand why folks like Jaws, TeamXbox, etc believe(technically) the idea that the PPC970 series or Power5 is more likely than something like the PowerPC 400 series. Personally I'm not sure what to think with regard to the CPU other than its made by by IBM. LOL Considering we're almost 10-months since the Microsoft/IBM announcement, it's amazing we don't know more than we do.

Tommy McClain

Hehehe...just a bit of harmless fun...

So in conclusion...
...it's not the PPC 400 series, IBM has sold the IPs as mentioned above...

...it's not the PPC 970/970fx/976/980 series as it's server class and prone to excessive heat with three cores in a console...

...it's not the Power 4/4+/5 series as they're server class also...

so...dun dun dun...

...it's the PowerPC 300/350 series that are actually linked with the Cell PUs...It's for the embedded market to replace the ageing PPC 400 series, low power and heat generation for 3cores in a console and will likely be dual issue like the 400 series...

About 10 months ago, IBM began to layout the foundation for a major architectural revamp to the PowerPC line that will allow chips to run much cooler for mobile and embedded applications, sources recently told AppleInsider.

The first fruits of the endeavorer are rumored to include a 64-bit PowerPC 300 series built on a 65-nanometer (nm) process. The series will reportedly tip-off with the PowerPC 350 due out in mid-2005, and will be proceeded by the 45-nm PowerPC 360 in 2007, according to preliminary documents detailing IBM's PowerPC roadmap.

Sources were unsure if the PowerPC 300 series would debut as a variant of the cell processor, but did confirm that it will utilize the PowerPC instruction set. The chips will reportedly consist of many specialized cores--each handling one or two instructions--connected together by an ultra-wide high bandwidth on-chip-fabric capable of processing 128GB of data per second.

A subsequent report on IBM's 65-nm process claims that IBM's Fiskill fab has been tooled, and engineers have recently begun to produce experimental parts, and work out the bugs. "The first primordial [65-nm] components should be produced soon," sources said, "just slightly behind those of Intel."

However, it's expected to take approximately 18 months before Phase III of East Fishkill fab will be capable of producing complex microprocessors on a 65-nm process.

In August, IBM announced that it had entered into a multi-year agreement with Infineon Technologies AG and Chartered Semiconductor Manufacturing to speed up the process of 65-nanometer chip development, and later 45-nm chips. The announcement also cited a focus on variants tuned for high performance and low power -- presumably the PowerPC 300 series.

AppleInsider...

I personnaly believe those cores are too large for the Cell PUs and they will likely be some custom / clean sheet core ? Then again, they still could be based from them :? ... the PPC 300 could be the Cell PUs and the PPC 350 could be the Xenon cores, but at 65 nm and not 90 nm :?

CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

V3

j^aws

Fafalada

Gubbi

Deepak

B3D Yoddha

Panajev2001a

Panajev2001a

Gubbi

Panajev2001a

j^aws

DeanoC

Trust me, I'm a renderer person!

j^aws

j^aws

DeanoC

Trust me, I'm a renderer person!

Gubbi

archie4oz

ea_spouse is H4WT!

AzBat

Agent of the Bat

ERP

AzBat

Agent of the Bat

j^aws

Similar threads