Sony talked about Cell and Cell based workstations

Entropy · May 16, 2004

Brimstone said:
As far as processing real time graphics, there is no way transistor for transistor that Sony/IBM will trump ATI or Nvidia. A CELL based console should be very flexable, but today we are already starting to see flexability in VPU designs and DX10 should allow for even more flexability.

The overall design of the PS3 will be elegant, but I fail to see how CELL is going to give a performance advantage compared to the constantly improving designs of other companies.

The weakness of the PC model is in the partitioning af tasks between the CPU and GPU, and the clunky communication path between them. Furthermore, the CPU has an architecture that is the result of 25 years or so of evolution of a processor designed to manipulate text. The benefits of the Sony/IBM solution are not likely to be at the rasterizer end.

Vince · May 16, 2004

Brimstone said:
What is so ambitous about CELL compared to the continual evolution of hardware from AMD, ATI, INTEL, 3dlabs, Nvidia, Sun, and PowerVR?

How is Cell not more ambitious than any analogous design and engineering project, within the consumer oriented field, as we've seen in the last 5, perhaps even 50, years?

To design, from scratch, a modular architecture which is targeted at a pervasive broadband future in which resources - be it information, storage, or computation - can be nonlocal and shared, within tolerances, by a macroscopic and distributed "system" is nothing less than a paradigm shift in thinking, design and, quite frankly, everything. Add in the presumed preformance levels of a single node, computational level of a TFlop(+)/s per IC with bandwith in the hundreds or perhaps even the low thousands of MB/s, and how is that not ambitious for a 2005/2006 timeframe in a consumer device? Never have we seen such a move towards all these aspects in a single IC, a single device... and a consumer one at that.

It's an open question if nVidia and ATI will be close in computational output and bandwith. And, although the 3D vendors will be the most comparable with a STI part due to their bias towards computational/floating point preformance, don't forget Intel and AMD's [negative] influence on the system - which have such a low probability of reaching parity due to their architecture's legacy. See Intel's x86 architect's comments here. And with x86-64's sucess in the marketplace, the ability to reach parity is hovering around zero IMHO.

What I find telling is SA's recent responce to MfA in the 3D Forum. SA, historically, has proven to be a very intelligent and well informed individual, and I think his comment sums up what Cell and, by extention, PS3 is striving for -- as opposed to the current PC platform:

[url=http://www.beyond3d.com/forum/viewtopic.php?p=284737#284737 said:
SA[/url]]When you look at the computation market, GPU manufacturers have targeted the floating point performance segment far more than the CPU manufacturers.

I agree that pretty much all of the computation intensive tasks will eventually move to what is now considered the GPU. This includes all the physcial simulation, collision detection, 3d graphics (including all the culling), etc.

This means the GPU of the future will need to gradually transform into a massively parallel general purpose floating point vector processor that can be programmed using standard programming languages like C++. This in turn means general purpose addressing, branching, and stack management. Something that looks more like a FASTMATH processor than a current GPU. However, one major difference compared to a CPU is that the vast majority of the transistors will be used for actual logic with large numbers of floating point ALUs rather than for on-die cache. Instead future GPUs/vector processors will likely continue to rely on very high external memory bandwidths with a small amount of on-die cache.

It also means scaling up frequencies significantly using dynamic logic.

Which shares many qualities with STI's Cell.

As I, and a few others, stated a few years back when there was all the debate over Cell and how different it is than any architecture at that time (~2001), Cell is an embodiment of tomorrow's architecture: It's heavily biased towards computation/floating point preformance and bandwith. It's network aware aspects are also over-the-horizon but reflect upon how the future's computing enviroment (be them intersystem, or intrasystem and local, or intrasystem and nonlocal) will be. Today's current PC paradigm needs to die and it needs to die hard - if it isn't obvious to you already. And untill the unlikely events happen, as stated in that thread, about the merger of a CPU-esque and GPU-esque company, like it or not, STI and Sony is the only game in town... and they have a 4 year, several billion dollar headstart.

I don't intend to gloat, but I don't think I was wrong in my arguments with Joe and Dave. In fact, I can't help but feel that what we're debating, what the 3D Forum is debating and what we're begging to see is exactly what we stated - be that the DX10/PC's inherient architectural inferiorities or the bound resting almost solely on computational resources. IMHO, the whole concept and architecture underlying a PC needs to evolve or it will die - and that will be one hard task at hand with the politic and corperate barriers.

Gubbi · May 17, 2004

Almost all the innovation associated with CELL is in software. How to slice and dice a problem so that it can be distributed across an array of processors, both code and data.

Also, I don't think it's fair to compare CELL to traditional processors, CELL is alot less general purpose than both Opteron, P4 or PPC 970. Single thread performance is going to be a bitch to achieve on CELL.

This also means there's a very real risk of reducing it to niche markets such as rendering or signal processing in general.

Calling CELL a network processor is SONY putting a positive spin on the fact that the memory subsystem is a packet switched network, which means fine grain memory arbitration is going to COST.

Everything depends on the toolchain and if SONY and IBM can get it up to snuff on time. CELL isn't very good as a general purpose CPU, and it won't do as well as dedicated graphics processors on rendering either. What it does do is put an awful lot of execution units at the programmer's (or compiler's) disposal.

Cheers
Gubbi

MfA · May 17, 2004

Microsoft can only bloat the same old programs we have been using for a decade so much ... it is general purpose processors which are only suited for niche applications. They are either way overpowered for what they are running (desktop applications) or hugely inefficient (games and multimedia).

Panajev2001a · May 17, 2004

The reason we call it a Network processor is for the nature of the Apulet itself and how that gears to natural distributed processing over any network.

This also means there's a very real risk of reducing it to niche markets such as rendering or signal processing in general.

I could see CELL based CPUs on Desktop PCs: speed up the PU in terms of clock, have a dual PE system with 4 APUs per PE or 4 PEs and 4 APUs per PE ( to reduce the load on the PUs in terms of orchestrating the APUs and increase single thread and scalar performance ).

If you open your Task Manager you can see quite a lot of processes running at the same time ( in systems in which people are not paying attention to MSCONFIG, there are tons of processes running at the same time ): at least 4 or more of them can be executed in parallel with little dependency issues and with some work with the OS and the tasks we are running, we could likely get more.

I think that such an architecture would be fast enough in most of the "solved" computing problems modern Desktop PC users have to face ( Word Processing, Database traversal, Web Browsing, Multi-tasking, video encoding/decoding, graphics composition/editing ) and it provides the necessary juice for the applications that really need it, the applications that cause people to upgrade their computers more often than not: games.
The mix of 3D graphics, Physics, A.I., etc... makes for a nice assortment of quite parallel processing friendly problems.

I would not mind running on a CELL based Desktop PC: I might not get the best single thread performance ever and not refresh Word documents at 400 fps, but in most of the applications I run it will be fast enough ( current processors are hitting, for those applications a big fat wall ) especially because I keep tons of different programs running at the same time, but what it matters the most is that in multimedia applications ( be it running 3DS Max, playing advanced 3D games, etc... ) my PC will be able to pull ahead and deliver the performance I want.

Sure, maybe running HSpice would not be the best idea, maybe there are several specialized problems that would need faster single thread performance, but I say that it is time for PCs to branch off: multi-media PCs driven by parallel architectures on the CPU and the GPU side ( one being a bit more general purpose than the other ) and Workstation PCs driven by your K9, Pentium V/VI processors and Itanium Next-Generation CPUs.

I have used the PS2Linux, I have tried using it as a PC would be used: sure it is not too fast, but it has only 32 MB of RAM and a 300 MHz R5900i driving it ( with nice and flexible, but not fully general purpose VUs ).

I imagine increasing the RAM to 256 MB and the clock-rate of the EE ( a 12+ MTransistors chip ) to something like 600-800MHz ( the original EE was built on a 250 nm process and the GS was built on a 250 nm process using stacked capacitor based e-DRAM and not the newer trench capacitor e-DRAM which according to engineers from both Sony/SCE and Toshiba should allow for quite higher clock-rates of the logic portion of the chip that uses e-DRAM ) and we would see nice performance as a Desktop PC and AWESOME 3D graphics and physics processing ( a peak of 16.5 GFLOPS and a fill-rate of 6.4 GPixels/s )

A CELL based Desktop PC might not exclude a proper GPU, depending on the power of the CELL based processor used.

Also, if you had special instructions in the APUs that catered towards "texture filtering" or if we had fast TMUs in our CELL based processor, I do not see how a DirectX 10+ GPU would be much faster in 3D processing than what CELL could achieve.

Still, I like the very versatile CELL model: I want to see a REYES renderer operating in real-time, full micro-polygons based rendering and stochastic AA delivering high quality motion blur and edge AA and CELL seems to me the best way to approach such a model... very high floating point power to maximize polygon processing and shading performance and a very fast rasterizer with humongous fill-rate.

Sorry, I digressed.

For all we know, PlayStation 3, to make an example, might have a more formal GPU or it might have what you see in the patents as Visualizer PEs.

Even if we have a normal GPU for Shading and Texturing, a CELL based CPU can have a field day with other problems that have remained CPU bound in modern games and can surely give more than just a hand in 3D Graphics processing.

Gubbi · May 17, 2004

If you measure efficiency as FLOPS/watt for a task that fits you special purpose architecture, then yes.

However, it is as easy finding problems that flies on your special purpose architecture, as it is to find a problem that doen't.

Take two problems:
1.) dense matrix/vector math.
2.) Collision detection (quad tree based or whatever space decomposition you like to use, will involve alot of pointer chasing anyway).

First one can more less use all the execution units you can throw at it. The second one relies more on a fine grain low latency memory subsystem than exec units. My prediction is that #1 will fly on CELL, #2 will grind to a halt (compared to a general purpose CPU). In fact #2 is likely to be so slow that you'll use completely different algorithms on CELL, with conderable more computational overhead and end up with lower efficiency.

Cheers
Gubbi

MfA · May 17, 2004

Collision detection can still be done for lots of objects in parallel, as for latency ... well that is what multi-threading is for (or alternatively allow lots and lots of outstanding prefetches).

Panajev2001a · May 17, 2004

MfA said:
Microsoft can only bloat the same old programs we have been using for a decade so much ... it is general purpose processors which are only suited for niche applications. They are either way overpowered for what they are running (desktop applications) or hugely inefficient (games and multimedia).

I cannot believe it, poetry in words

.

At least I know I am not alone

.

Gubbi · May 17, 2004

MfA said:
Collision detection can still be done for lots of objects in parallel, as for latency ... well that is what multi-threading is for (or alternatively allow lots and lots of outstanding prefetches).

The individual APU isn't multithreaded (to my knowledge). This means that whenever it'll need another node in the tree search (which is going to be at every step), it'll need to arbitrate to main memory to get it (possibly the ondie eDRAM) and hence the APU halts. Is the PE going to have to arbitrate as described in the patents or can the APUs arbitrate themselves. Not at all clear to me.

CELL still seems like a glorified signal processor to me, nothing more. It's not general purpose to do general purpose (single thread) stuff fast enough, and not special purpose enough to compete with the speed and efficiency of special purpose circuitry (GPUs).

I'd love to be surprised though.

Cheers
Gubbi

Edit: typos

Fafalada · May 17, 2004

In fact #2 is likely to be so slow that you'll use completely different algorithms on CELL, with conderable more computational overhead and end up with lower efficiency.

#2(or any kind of pointer chase heavy algorithm across large data structures) is already a memory access limited problem on desktop cpus - at least on a chip like BE you could run several dozen of them at the same time.

The individual APU isn't multithreaded (to my knowledge). This means that whenever it'll need another node in the tree search (which is going to be at every step), it'll need to arbitrate to main memory to get it (possibly the ondie eDRAM) and hence the APU halts. Is the PE going to have to arbitrate as described in the patents or can the APUs arbitrate themselves. Not at all clear to me.

Well APU can issue DMA fetches on its own, but the specific kind of arbitration will likely depend on the problem you're solving.
If I can run collision testing process in the same way as we do rendering nowadays (kickstart a few APUs and let them work while I do other stuff), that would probably be preferable to using PE for arbitration even if memory accesses won't be optimally scheduled.

Guden Oden · May 17, 2004

Gubbi said:
My prediction is that #1 will fly on CELL, #2 will grind to a halt (compared to a general purpose CPU). In fact #2 is likely to be so slow that you'll use completely different algorithms on CELL, with conderable more computational overhead and end up with lower efficiency.

I think the 100+ team of highly experienced engineers that designed this micro-architecture already thought of these issues - and more besides. Don't you?

DeanoC · May 17, 2004

The bottleneck for some time hasn't been speed of the machine but the speed of development.

Most coders don't have a intimate knowledge of machine level features. Certainly don't want to worry about it while writing game code. So a machine that will perform badly UNLESS the coder knows about DMA, small memory pools, synchronisation etc is going to cause problems in development.

From a game point of view (not the high profile specialised jobs of graphics, physics, sound) we are more interested in the PU rather than the APU's. If the PU isn't really good, then all the APU's in the world won't make the slightly bit of difference to complex games.

Cost of development is proving a problem and Cell looks to raise the cost even higher. Pure bruteforce performance may lose out to simplier development, in the end if you don't have the time to use all that extra performance what good was it? And if it runs 'normal' code slower, it could actually make for worse (but prettier) games.

I certainly hope Sony haven't forgetton this fact, but from things I heard its not looking good. The Sony PR hype might be say its easy, but who believes the hype?

Edit:
Of course non of the next-gen architecture look that easy to get good performance out of.... So it may be the only way to get the jump we all want....

V3 · May 17, 2004

Is the PE going to have to arbitrate as described in the patents or can the APUs arbitrate themselves. Not at all clear to me.

APUs can arbitrate themselves.

Gubbi · May 17, 2004

Fafalada said:
The individual APU isn't multithreaded (to my knowledge). This means that whenever it'll need another node in the tree search (which is going to be at every step), it'll need to arbitrate to main memory to get it (possibly the ondie eDRAM) and hence the APU halts. Is the PE going to have to arbitrate as described in the patents or can the APUs arbitrate themselves. Not at all clear to me.

Click to expand...

Well APU can issue DMA fetches on its own, but the specific kind of arbitration will likely depend on the problem you're solving.
If I can run collision testing process in the same way as we do rendering nowadays (kickstart a few APUs and let them work while I do other stuff), that would probably be preferable to using PE for arbitration even if memory accesses won't be optimally scheduled.

It is exactly this mechanism I'd like to know more about. Are we talking explicit setting up DMA with source, destination and range for every access, then wait for it to complete (polling for completion ?). That's alot of overhead just to read one node in a tree.

It would also mean locking the entire space decomposition tree down as shared read in the embedded DRAM.

Cheers
Gubbi

Guden Oden · May 17, 2004

Deano,

What you're saying basically sounds as if programmers aren't interested in doing their job, they just want to lean back, relax and have the money come pouring down their pockets all by itself.

Certainly it's NOT too much to ask that programmers learn about the hardware they're writing code for! That's why Japanese developers typically kick US coders' asses when it comes to slickness, especially on "weird" hardware such as PS2; because they're willing to get down with it.

US coders seem to be stuck in their lazy-ass rear-wheel-drive-automatic-transmission way of thinking even when out on the race track, unable to understand why they're getting outclassed by manual transmission 4WD drivers. What, you mean it's not possible to win by just steering and stepping on the gas pedal?! You gotta be kidding me!

Ok, so I'm generalizing, but when listening to PC dev-people like Sweeney and Carmack, that's certainly the way they come off; they're so obsessed with their own ease and comfort when coding. So programming for a console requires WORK and THINKING. Tough shit pal, if that's too much to ask of you maybe you should run home to mama instead, or switch carreers and apply for a job with MS over in Redmond, I'm sure nobody will ask you to do any hard optimization-work over there!

Panajev2001a · May 17, 2004

DeanoC said:
The bottleneck for some time hasn't been speed of the machine but the speed of development.

Most coders don't have a intimate knowledge of machine level features. Certainly don't want to worry about it while writing game code. So a machine that will perform badly UNLESS the coder knows about DMA, small memory pools, synchronisation etc is going to cause problems in development.

How does this

Q : Have software development framework for the Cell announced to developers?

CTO : Not yet. But we've setteled the basic policy and discussed with some vendors. Contrary to PS2, we will provide library with devoted support like PSP.

CTO : Since OS/Library manages distributed processing, software based on our framework will gain benefit of the distributed processing transparently.

sound to your ears ?

Also, governing three CPUs ( each being something like a PowerPC 970+ CPU ) that can each process up to two threads in SMT mode and with a big shared L2 that cna be written/read to/from the GPU is no easy task for the programmer who knows nothing about computer architecture and performance tuning.

From a game point of view (not the high profile specialised jobs of graphics, physics, sound) we are more interested in the PU rather than the APU's. If the PU isn't really good, then all the APU's in the world won't make the slightly bit of difference to complex games.

True, but I do not think the PU will be too slow ( I imagine a PowerPC 440-like running at at least 1-2 GHz if not more ) and in a chip like the Boradband Engine we woudl have 4 of them.

Also, with APUs being able to do I/O processing ( setting particular flags in the shared DRAM they can have data go from the I/O device to/from the APU's LS without basically stopping at GO, I mean the Shared DRAM

) as well as DMA work on their own with very little intervention of the PU I think we could assign the APUs, or pipelines of APUs, to other tasks other than graphics, physics, A.I. and sound.

Granted that it will not be ultra efficient, but we can keep more APUs busy this way if in a particular application we need to use the PUs for other tasks.

MfA · May 17, 2004

As long as a processor can have enough outstanding prefetches/DMA-requests you can do vertical multithreading in software.

DeanoC · May 17, 2004

Guden Oden said:
Deano,
Certainly it's NOT too much to ask that programmers learn about the hardware they're writing code for! That's why Japanese developers typically kick US coders' asses when it comes to slickness, especially on "weird" hardware such as PS2; because they're willing to get down with it.

Sorry thats rubbish, I ported Silent Hill 2 so I know about Japanese development practises. They have the same problems the west has. In fact there catching up with us in software engineering practising after finding that the tradional japenese ad-hoc software practises weren't cost effective.

Application domain specific knowledge. Somebody who knows the solution to the problem isn't nessecarily the person who knows the architecture like the back of there hand. If you restrict yourself to people who know both (what your saying) then you cut out about 75% of the development staff. Your only employing people who are experts in 2 fields (for example AI and PS2). Thats going to raise costs because not surprisingly there going to be pretty damn scarce.

Would be great for me personally though

Of course you going to have to pay via the Â£100 games....

DeanoC · May 17, 2004

Panajev2001a said:
Q : Have software development framework for the Cell announced to developers?

CTO : Not yet. But we've setteled the basic policy and discussed with some vendors. Contrary to PS2, we will provide library with devoted support like PSP.

CTO : Since OS/Library manages distributed processing, software based on our framework will gain benefit of the distributed processing transparently.

Click to expand...

sound to your ears ?

Sounds like hype to my ears. Exactly how is it going to enforce data locality, functional blocks, etc Its going to be transparent

Does anybody really believe that? Certainly people like EA or Sony game studios don't.

Panajev2001a said:
Also, governing three CPUs ( each being something like a PowerPC 970+ CPU ) that can each process up to two threads in SMT mode and with a big shared L2 that cna be written/read to/from the GPU is no easy task for the programmer who knows nothing about computer architecture and performance tuning.

Agreed Xenon isn't going to be easy either.

Panajev2001a said:
True, but I do not think the PU will be too slow ( I imagine a PowerPC 440-like running at at least 1-2 GHz if not more ) and in a chip like the Boradband Engine we woudl have 4 of them.

Could be a bad assumption that PS3 CPU == Patent Broadband Engine.

And even if we have 4 2Gig PowerPC, thats gives us the same problems that Xenon has as well as the APU problem.

Panajev2001a said:
Also, with APUs being able to do I/O processing ( setting particular flags in the shared DRAM they can have data go from the I/O device to/from the APU's LS without basically stopping at GO, I mean the Shared DRAM ) as well as DMA work on their own with very little intervention of the PU I think we could assign the APUs, or pipelines of APUs, to other tasks other than graphics, physics, A.I. and sound.

Granted that it will not be ultra efficient, but we can keep more APUs busy this way if in a particular application we need to use the PUs for other tasks.

Only if we retrain all the level designers (who aren't usually trained programmers, let along low-level programmers), game play coders etc to use 128K only with no random pointer walks. Similar problems will occur on Xenon due to massive miss cache costs. On PS3 it simple won't run as they won't be able to access outside the 128K memory.

Customers want more content, bigger games but the platform holders are making it harder to do. Its about development not hardware, from a tech-head point of view I like the idea of both next-generation architectures I know something about. But from a senior development staff budget/cost point of view I get scared and hide in the cupboard

I'm very interested on Cell OS and the libraries as well as XNA. Ask anybody involved in next-gen game development and we all agree hardware is irrelevant. To complicated/advanced and it could be a curse not an blessing.

A few years ago Sony said only 4 developers could make games for PS2, we managed to disprove that prediction. Does anybody want a world where the only developers are EA, Sony/Enix, Nintendo and SEGA making sequels to there franchises?

Fafalada · May 17, 2004

Deano said:
Only if we retrain all the level designers (who aren't usually trained programmers, let along low-level programmers), game play coders etc to use 128K only with no random pointer walks. Similar problems will occur on Xenon due to massive miss cache costs. On PS3 it simple won't run as they won't be able to access outside the 128K memory.

Oh that's easily solved, just don't let level designers access to anything except a script lang. that prevents them doing anything you don't want them to

Incidentially, one of the recent patents outlined we will be able to run APU programs>128KB after all (whee, overlays are making a big comeback), and we already know that they can data-access external mem also.

Sony talked about Cell and Cell based workstations

Entropy

Vince

Gubbi

MfA

Panajev2001a

Gubbi

MfA

Panajev2001a

Gubbi

Fafalada

Guden Oden

Senior Member

DeanoC

Trust me, I'm a renderer person!

V3

Gubbi

Guden Oden

Senior Member

Panajev2001a

MfA

DeanoC

Trust me, I'm a renderer person!

DeanoC

Trust me, I'm a renderer person!

Fafalada

Similar threads