CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

Brimstone · Sep 4, 2004

guess #2
1 vector processor, executes instructions on 4 sets of 4x32-bit registers in a cycle. Which means that the APU must be able to transfer 512 bits of data to the registers from APU local memory/cache really, really fast to be useful(512-bit wide bus for APU local mem?). Not all code will benefit from this. But if a large portion (?%) of executed gaming code can be contructed into an SIMD operation of this form, it will be useful. Of course to be really useful it is highly desirable that the 'CPU core' of the APU can interact well with this VP, so that complicated instructions can be more easily executed. BTW with 4x32-bit registers, we can store all vertices of a triangle plus its normal. (Obviously I speak as one who is an outsider to the discipline of 3D graphics. ) This APU will be able to hit 32GF at 1 GHz. Of course, if STI can gurantee really high mem <-> register speed and a more agressive VP, they may be able to deliver the numbers at 500MHz....

Vector processing works great with data-parallel tasks, but for the code that doesn't have instruction level parallelism, all you need is a simplified scalar core. The more simpler you make the architecture (the CPU core and Vector pipes), the more energy effcient the overall design will be, hopefully leading to more performance per watt. Vector processing will work out great with lots of eDRAM.

My guess with CELL is that Sony is saying, superscalar processors should be buried in a graveyard and Ken Kutaragi has a nice tombstone waiting for them.

Megadrive1988 · Sep 4, 2004

No, the fundamental idea behind Cell is to provide one single architecture with a common make-up and instruction set that scales from rock bottom to cutting-edge high-end performance...

exactly. thank you. I am so tired of explaining to people (on other boards) that "the Cell" is not one specific processor, but an architecture from which a vast range of specific Cell-based processors can be built from.

The broadband engine will be a monstrous chip no matter what, that much is certain. What we're arguing over () is exactly HOW monstrous it will be!

exactly, exactly. thanks again Guden.

j^aws · Sep 5, 2004

ERP said:
FMAC's are assumed because they are actually useful operations.

What do you itnend to accomplish with an FDMAD?
....

I wasn't implying that the FMACs were'nt useful or anything...my motivation was to 'see' if there were other assumptions that could be valid and could show (or Sony show) the 'same' flops rating but at a lower clockspeed.

The reason I mentionned 'division' was it's conspicuos by it's absence in the APU from our assumptions, especially when the PS2s EE has 2 FDIVs for VU1 and 1 FDIV for VU0. Doesn't this seem a bit 'strange'?

ERP said:
Also an FMAD is

Ans = Arg1*Arg2+Arg3

at 128 bits/argument I count 384 bits of input and 128 bits of output...Hmmm a lot like your diagram.

Your right, or we could have 12 32bit inputs and 4 32bit ouputs as I mentioned above. I'm not too hot on my matrix theory but couldn't the FDMAD be 'useful' for recipricols/ inverse or division of the FMAD equation above, Arg1*Arg2+Arg3 with either 'Arg(1-3')? So a maximum of only 3 Args are still used for 1 Ans? What if this FDMAD was still capable of FMAD, you basically got the 'division' for free? or is none of this useful?

j^aws · Sep 5, 2004

passerby said:
......
As for whether such a thing is possible, I think there is a reasonable chance. After all the EE is, simply put, a CPU + 10 FPU implementation(right? can't recall)....

IIRC, 9 FMACs, 3 FDIVs and 1 FPU.

passerby said:
Of course the important question is how well this team is integrated to work togther. Think about how difficult it is to get useful work out of VP0 on the EE.

Still my biggest worry...will the compiler etc. be upto scratch ! :?

passerby said:
....
Out of guesses now. Visit this thread again at CELL's unveiling....

Either way, it will be funny looking back...'Did I actually write this s!&t !

j^aws · Sep 5, 2004

Brimstone said:
......
Vector processing works great with data-parallel tasks, but for the code that doesn't have instruction level parallelism, all you need is a simplified scalar core. The more simpler you make the architecture (the CPU core and Vector pipes), the more energy effcient the overall design will be, hopefully leading to more performance per watt. Vector processing will work out great with lots of eDRAM.

My guess with CELL is that Sony is saying, superscalar processors should be buried in a graveyard and Ken Kutaragi has a nice tombstone waiting for them.

I'm still not sure if Cells PUs will be superscaler or not :?

Here's a really nice article (pre Cell, circa 2001), comparing intruction level parallelism (ILP) and thread level parallelism (TLP) with multi-threading (both simultaneous and chip SMT, CMT) and VLIW (EPIC) designs....

Conclusion

Over the last several years Intel and HP have heavily promoted the EPIC processor design approach, and IA-64 in particular, as the next great step in the evolution of high end processor design. It is readily apparent that relying on increasing exploitation of ILP to drive processor performance onwards and upwards is a difficult path to follow and will offer meager and hard fought for gains. It is even debatable whether or not EPIC is the best way to increase ILP exploitation, since the 12 year old postulate used to justify EPIC, the idea that superscalar processors would choke on their own complexity, is demonstrably no more true today than it was in 1989. There is also no reason that full predication, memory disambiguation, and data speculative techniques cannot also be used by superscalar RISC based architectures. This would obtain the benefit of these features without the need for static scheduling or code size expansion.

TLP is also a basic source of higher performance and the reason we have single computer systems with 16, 32, 64 or more processors in operation running large scale applications like data base management, on-line transaction processing, and simulations of physical phenomena for scientific and technical applications. CMP is the obvious approach to using TLP to increase MPU performance. But SMT increases TLP exploitation of a uniprocessor MPU by modestly building on the mechanisms of speculative out-of-order execution already in place in high-end processors. For a given set of execution resources (functional units, caches, TLBs etc), SMT provides better single thread performance and multithread performance than CMP. The disadvantage is increased design effort and time to market. The relationship between EPIC, CMP, and SMT is shown in Figure 3.

The SMT processor can be thought of as the multi-fuel engine of computer architectures. When high levels of ILP is present in the workload an EV8-like SMT can use its wide issue width, and deep, speculative out-of-order execution to help exploit ILP nearly as well as an aggressive EPIC processor. When high levels of TLP is present, then the SMT can exploit it more adroitly than a CMP. In contrast, CMP cannot generally exploit high ILP content in workloads while EPIC cannot exploit high TLP content in workloads. SMT seems to be the best approach to use to design a general purpose microprocessor.

Intel and HP argue that CMP and SMT are techniques that can eventually be applied to IA-64 processors once the ILP well runs dry. A CMP IA-64 processor may appear relatively soon because ILP-based performance increases from wider issue fall off rapidly, especially for an in-order processor. A CMP with dual 6 issue wide IA-64 processor cores might prove superior to a single 12 issue wide design for many applications, especially if EPIC compiler technology development stalls. On the other hand, applying SMT techniques to IA-64 appears very, very difficult. Not only do IA-64 implementations deliberately avoid the superscalar implementation infrastructure that SMT builds on, the huge architected state of IA-64 (128 GPRs, 128 FP registers etc.) would mean support for extra threads would greatly increase the size and/or number of physical register files which could hurt clock rates. Other complex elements like the register stack engine would likely have to be replicated on a per thread basis. It is ironic that an SMT enabled EPIC MPU would need to accrue far more hardware complexity than that which Schlansker and Rau originally sought to avoid.

The Alpha EV8 is an exciting new design for several reasons. It is by far the most aggressive speculative out-of-order execution superscalar RISC processor yet proposed. It will exploit SMT, arguably the most important new development in computer architecture in the last ten years, to double its sustained throughput to 8 to 10 billion instructions per second. When the EV8 first ships (2002?) it should drop easily into Compaqâ€™s then existing high performance computing platforms built around the EV7 and its on-chip dual 4 channel direct Rambus memory controllers and four 6.4 GB/s interprocessor communication link channels. It is hard to imagine what other architecture or platform could come close to challenging single or multiple processor EV8 systems in raw performance. But the onus is on Compaq to execute their high-end product strategy much more effectively then they have since acquiring DEC and Alpha in order for this technology to have the impact in the marketplace that it deserves.

Full article, based around the Alpha EV8: http://www.realworldtech.com/page.cfm?ArticleID=RWT011601000000

The article predicts bold thinks for superscaler and EV8 in particular, which makes me think that STI must be supremely confident with Cell in order to progress with the project.

I think the Xenon CPU falls under the SMT/CMT curve on that graph. Not sure where Cell would be, especially not knowing whether the PUs are superscaler or not, prolly closer to EPIC or a new class, a hybrid EPIC/SMT/CMT :?

Brimstone · Sep 5, 2004

Jaws said:
Brimstone said:

......
Vector processing works great with data-parallel tasks, but for the code that doesn't have instruction level parallelism, all you need is a simplified scalar core. The more simpler you make the architecture (the CPU core and Vector pipes), the more energy effcient the overall design will be, hopefully leading to more performance per watt. Vector processing will work out great with lots of eDRAM.

My guess with CELL is that Sony is saying, superscalar processors should be buried in a graveyard and Ken Kutaragi has a nice tombstone waiting for them.

Click to expand...

I'm still not sure if Cells PUs will be superscaler or not :?

Here's a really nice article (pre Cell, circa 2001), comparing intruction level parallelism (ILP) and thread level parallelism (TLP) with multi-threading (both simultaneous and chip SMT, CMT) and VLIW (EPIC) designs....

Conclusion

Over the last several years Intel and HP have heavily promoted the EPIC processor design approach, and IA-64 in particular, as the next great step in the evolution of high end processor design. It is readily apparent that relying on increasing exploitation of ILP to drive processor performance onwards and upwards is a difficult path to follow and will offer meager and hard fought for gains. It is even debatable whether or not EPIC is the best way to increase ILP exploitation, since the 12 year old postulate used to justify EPIC, the idea that superscalar processors would choke on their own complexity, is demonstrably no more true today than it was in 1989. There is also no reason that full predication, memory disambiguation, and data speculative techniques cannot also be used by superscalar RISC based architectures. This would obtain the benefit of these features without the need for static scheduling or code size expansion.

TLP is also a basic source of higher performance and the reason we have single computer systems with 16, 32, 64 or more processors in operation running large scale applications like data base management, on-line transaction processing, and simulations of physical phenomena for scientific and technical applications. CMP is the obvious approach to using TLP to increase MPU performance. But SMT increases TLP exploitation of a uniprocessor MPU by modestly building on the mechanisms of speculative out-of-order execution already in place in high-end processors. For a given set of execution resources (functional units, caches, TLBs etc), SMT provides better single thread performance and multithread performance than CMP. The disadvantage is increased design effort and time to market. The relationship between EPIC, CMP, and SMT is shown in Figure 3.

The SMT processor can be thought of as the multi-fuel engine of computer architectures. When high levels of ILP is present in the workload an EV8-like SMT can use its wide issue width, and deep, speculative out-of-order execution to help exploit ILP nearly as well as an aggressive EPIC processor. When high levels of TLP is present, then the SMT can exploit it more adroitly than a CMP. In contrast, CMP cannot generally exploit high ILP content in workloads while EPIC cannot exploit high TLP content in workloads. SMT seems to be the best approach to use to design a general purpose microprocessor.

Intel and HP argue that CMP and SMT are techniques that can eventually be applied to IA-64 processors once the ILP well runs dry. A CMP IA-64 processor may appear relatively soon because ILP-based performance increases from wider issue fall off rapidly, especially for an in-order processor. A CMP with dual 6 issue wide IA-64 processor cores might prove superior to a single 12 issue wide design for many applications, especially if EPIC compiler technology development stalls. On the other hand, applying SMT techniques to IA-64 appears very, very difficult. Not only do IA-64 implementations deliberately avoid the superscalar implementation infrastructure that SMT builds on, the huge architected state of IA-64 (128 GPRs, 128 FP registers etc.) would mean support for extra threads would greatly increase the size and/or number of physical register files which could hurt clock rates. Other complex elements like the register stack engine would likely have to be replicated on a per thread basis. It is ironic that an SMT enabled EPIC MPU would need to accrue far more hardware complexity than that which Schlansker and Rau originally sought to avoid.

The Alpha EV8 is an exciting new design for several reasons. It is by far the most aggressive speculative out-of-order execution superscalar RISC processor yet proposed. It will exploit SMT, arguably the most important new development in computer architecture in the last ten years, to double its sustained throughput to 8 to 10 billion instructions per second. When the EV8 first ships (2002?) it should drop easily into Compaqâ€™s then existing high performance computing platforms built around the EV7 and its on-chip dual 4 channel direct Rambus memory controllers and four 6.4 GB/s interprocessor communication link channels. It is hard to imagine what other architecture or platform could come close to challenging single or multiple processor EV8 systems in raw performance. But the onus is on Compaq to execute their high-end product strategy much more effectively then they have since acquiring DEC and Alpha in order for this technology to have the impact in the marketplace that it deserves.

Click to expand...

Full article, based around the Alpha EV8: http://www.realworldtech.com/page.cfm?ArticleID=RWT011601000000

The article predicts bold thinks for superscaler and EV8 in particular, which makes me think that STI must be supremely confident with Cell in order to progress with the project.

I think the Xenon CPU falls under the SMT/CMT curve on that graph. Not sure where Cell would be, especially not knowing whether the PUs are superscaler or not, prolly closer to EPIC or a new class, a hybrid EPIC/SMT/CMT :?

CELL has been called a media processor a few times. So CELL is designed to run in an enviroment flooded with parallelism.

Why pack on so much eDRAM? Vector processing.

Why would CELL use a VLIW core? You get larger code size and the enviroment already has plenty of data parallelism. Plus the addional transistors spent on cache, design complexity, and power/heat.

A superscalar or VLIW core with the SIMD "band-aid" is a waste. Cut to the chase and design an architechture that knows its going to crunch multi-media code from the get go and cuts throught it like a hot knife thru butter. Hence a simple scalar core combinded with massive vector processing firepower.

nAo · Sep 5, 2004

Unfortunately there is no sign of multithreading support on CELL architecture (according to issued patents) so far..

Brimstone · Sep 6, 2004

nAo said:
Unfortunately there is no sign of multithreading support on CELL architecture (according to issued patents) so far..

I don't understand? 64mb of eDRAM should give plenty of low latency bandwidth for vector processing. Vector instructions are all about parallelism.

The CELL PS3 compiler searches the game code for parallelism and turns them into vector instructions, whats left over are scalar.

nAo · Sep 6, 2004

Brimstone said:
I don't understand? 64mb of eDRAM should give plenty of low latency bandwidth for vector processing. Vector instructions are all about parallelism.

Main problem here is not just edram (or sram) latency but even ALU calculations.
On the PS2 VUs where we have single cycles memory/registers access we also have to process more than a vertex at the same time to fill all the free ALU slots due to vector/scalar (DIV!!) instructions latency.
Now we expect APU pipelines will be longer than VUs pipelines, moreover we expect even a increased latency on memory accesses due to the higher clock.
I'm not sure about that..we miss a lot of details here..but we can expect even a big number of 'free' slots to fill with meaningful calculations.
Maybe with very long/complex (vertex) shaders this will not be a problem..but we don't want to code a shader to play with 10 vertices at the same time, don't we?

What about shaders that use texture sampling? I don't want to think about that now..

The CELL PS3 compiler searches the game code for parallelism and turns them into vector instructions, whats left over are scalar.

I doubt it will work that way..

passerby · Sep 6, 2004

The CELL PS3 compiler searches the game code for parallelism and turns them into vector instructions, whats left over are scalar.

Some time ago I spent a few weeks playing around with compilers, including gcc, icc, ecc. Tried to auto-generate vector code. Yes, they can auto-generate vector codes - but only for the really, really obvious codes. Those important loops that suck up all the computation time - they only need to have a tiny bit of data ambiguity for the compiler to give up. I don't really blame the compiler - how is it supposed to know? Anyway I concluded it was better off programming vector codes manually if I really want performance. Auto-vectorization by compilers is just a slight bonus for applications that don't have much 'vectorizable' code, or it is not performance-critical, but 'it would be nice'.

Brimstone · Sep 6, 2004

passerby said:
The CELL PS3 compiler searches the game code for parallelism and turns them into vector instructions, whats left over are scalar.

Click to expand...

Some time ago I spent a few weeks playing around with compilers, including gcc, icc, ecc. Tried to auto-generate vector code. Yes, they can auto-generate vector codes - but only for the really, really obvious codes. Those important loops that suck up all the computation time - they only need to have a tiny bit of data ambiguity for the compiler to give up. I don't really blame the compiler - how is it supposed to know? Anyway I concluded it was better off programming vector codes manually if I really want performance. Auto-vectorization by compilers is just a slight bonus for applications that don't have much 'vectorizable' code, or it is not performance-critical, but 'it would be nice'.

That was for SIMD extenstions right? I very much doubt the PS3 is going to have the SIMD "band-aid" approach.

SIMD extenstions are not vector instructions. A compiler working on code with a CISC, RISC, or VLIW processor including SIMD extentions compared to a vector architechture is totally different.

Gubbi · Sep 6, 2004

Brimstone said:
That was for SIMD extenstions right? I very much doubt the PS3 is going to have the SIMD "band-aid" approach.

SIMD extenstions are not vector instructions. A compiler working on code with a CISC, RISC, or VLIW processor including SIMD extentions compared to a vector architechture is totally different.

They are short vector instructions. The fact that the ALU operations are decoupled from the actual data streaming in a SIMD extended microprocessor has little influence on the actual analysis the compiler has to do to generate vector code. - For one instruction set or the other.

Cheers
Gubbi

ERP · Sep 6, 2004

nAo said:
but we don't want to code a shader to play with 10 vertices at the same time, don't we?

I'll put money on there being even more latency than that......
If you look at the number of registers in the patents, there probably a good indicator. 4 fold increase in registers would to me imply a 4 fold increase in latency.
But I have no real info so it's all speculation.

nAo said:
What about shaders that use texture sampling? I don't want to think about that now..

Me either, I'm really hoping for a more conventionally architected Rendering unit.

Vince · Sep 6, 2004

ERP said:
I'll put money on there being even more latency than that......
If you look at the number of registers in the patents, there probably a good indicator. 4 fold increase in registers would to me imply a 4 fold increase in latency.

But how is that a bad thing in the overall scheme of things? Correct me where I'm wong, but If you're using a Control Processor to handle the thread management (DMAC access and APU tasking) -- assuming such a latency increase means you can clock the Control Processor at a slower speed than the calculation units.

ERP · Sep 6, 2004

Vince said:
ERP said:

I'll put money on there being even more latency than that......
If you look at the number of registers in the patents, there probably a good indicator. 4 fold increase in registers would to me imply a 4 fold increase in latency.

Click to expand...

But how is that a bad thing in the overall scheme of things? Correct me where I'm wong, but If you're using a Control Processor to handle the thread management (DMAC access and APU tasking) -- asuming such a latency increase means you can clock the Control Processor at a slower speed than the calculation units.

Depends if you have to manage the latency by hand like you do on the PS2. Personally about 4 verts in flight is about all I can manage by hand, I start to loose track of registers after that.

You can of course just provide tools to do the interleaving VCL does this for VU code. Although they'll have to do better than the pure bruteforce approach VCL uses if they want such a tool to be useful.

nAo · Sep 7, 2004

Vince said:
But how is that a bad thing in the overall scheme of things? Correct me where I'm wong, but If you're using a Control Processor to handle the thread management (DMAC access and APU tasking) -- assuming such a latency increase means you can clock the Control Processor at a slower speed than the calculation units.

There is no mention of special (hw assisted) thread management on current CELL patents.
We'd need a fine grain multithreading (an APU should be ablet to switch thread every single clock tick) and we'd need a way to assign banks of registers and local memory(SRAM) to an APULET. Each thread assigned to a particular APU should be able to trigger a registers bank switch every time a different thread instruction(s) is(are) executed.
Again..there is nothing like that in the patents we examinated so far..
I believe the 'word' thread is mentioned just here and there in CELL patents but it's never really addressed.
Obviously an APU (controlled by a PU) can run multiple threads but at this time we haven't any info on how it can be done in a useful way.
Maybe there is a lot of info we still don't know about multithreading and CELL..at least that's my personal hope

ciao,
Marco

Fafalada · Sep 7, 2004

Brimstone said:
The CELL PS3 compiler searches the game code for parallelism and turns them into vector instructions, whats left over are scalar.

Not gonna be that simple I'm afraid. To properly search code for paralelism, compiler's domain knowledge would have to be comparable to that of the programmer.
The job of 'teaching' the compiler where to look will still be up to the programmer. The best I would expect is the means/tools to do that job easier, through language extensions or similar.

Anyway, the area of paralelization where manual labour will fail or seriously struggle, is what ERP mentions - trying to paralelize small tasks to hide latencies. Whether the solution they plan is hw or sw based, this is what I would worry most about, not whether the compiler can auto insert vector muladds into code

nAo said:
Maybe there is a lot of info we still don't know about multithreading and CELL..at least that's my personal hope

I'd say it's not only yours

ERP said:
Personally about 4 verts in flight is about all I can manage by hand, I start to loose track of registers after that.

It's not just keeping track of data in flight - the worst part is maintaining the freaking code. Every change - you have to redo the optimization steps. And what happens when you maintain a couple of dozen programs - it becomes unmanageable to do by hand, no matter how you look at it.

one · Sep 7, 2004

As I'm not sure someone already brought up info about Sony papers at this year's ISSCC 2004 in February in this forum, I'll add it to this thread before forgetting it.

http://www.isscc.org/isscc/2004/ap/ISSCC2004_AdvanceProgram.pdf

It's a processor for PSP or Clie.

3.5 Dynamic Voltage and Frequency Management for a Low-
Power Embedded Microprocessor
3:45 PM
S. Akui, K. Seno, M. Nakai, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi,
H. Kawahara, K. Kumano, M. Shimura
Sony, Shinagawa, Japan
A dynamic voltage and frequency management scheme that autonomously
controls the clock frequency (8 to 123MHz at 0.5MHz step) and adaptively
controls the voltage (0.9 to 1.6V at 0.5mV step) with a leakage power
compensation effect is developed for a low-power embedded microprocessor.
It achieves 82% power reduction in personal information management
(PIM) application.

Now, this memory interface technology may be for Cell's RAM.

7.5 A 160Gb/s Interface Design Configuration for Multichip LSI
10:45 AM
T. Ezaki, K. Kondo, H. Ozaki, N. Sasaki, H.Yonemura, M. Kitano, S.Tanaka,
T. Hirayama
Sony, Shinagawa, Japan
The Multichip LSI (MCL) comprised of both an embedded 123MHz CPU
and a 64Mb memory in one package is introduced. 1300 signal lines are
directly connected by microbumps between the two chips and achieve
160Gb/s signal interface performance. Both the CPU and memory are
fabricated in a 0.15Âµm CMOS technology.

and a report about this technology.

Till the end of this year the preview of ISSCC 2005 is announced, then you'll see Sony unveil Cell at ISSCC 2005 or not.

V3 · Sep 7, 2004

Now, this memory interface technology may be for Cell's RAM.

Nice find, that technology is similar to the related patent that I've read previously.

If they go this route, they won't be using eDRAM, they'll just bond the logic and memory together and call it Broadband Engine. This sort of method is what I've been suspecting they'll be using, but all will become clear soon, I suppose.

Panajev2001a · Sep 7, 2004

160 Gbits per second is opnly about 20 GB/s, 5.6 GB/s slower than an XDR memory interface the rumors mentioned (the last one rumored was 51.2 GB/s), also 150 nm technology ?

CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

Brimstone

B3D Shockwave Rider

Megadrive1988

j^aws

j^aws

j^aws

Brimstone

B3D Shockwave Rider

nAo

Nutella Nutellae

Brimstone

B3D Shockwave Rider

nAo

Nutella Nutellae

passerby

Brimstone

B3D Shockwave Rider

Gubbi

ERP

Vince

ERP

nAo

Nutella Nutellae

Fafalada

one

Unruly Member

V3

Panajev2001a

Similar threads