Cell mass production plan for 2nd half of 2005

rabidrabbit · Oct 10, 2003

chaphack said:
That be the book to read? Hmmm....say for someone young, with minimal knowledge of computers, is it digestable/fun if yeay interested in 3d graphiX?

I don't think it has many nice pictures.
But you could always try to colour that cover picture,
or use the book as 'draw by numbers' book. There must be many numbers that you can connect.

sorry chap, just joking (not even funny). I'm no much wiser myself in 3D technical matters.

But I'd think the best place to find information would be to surf the net. Forums just like Beyond3D are an excellent source of information. But one must be able to filter all the misinformation that is spread by funbois.

notAFanB · Oct 10, 2003

That be the book to read? Hmmm....say for someone young, with minimal knowledge of computers, is it digestable/fun if yeay interested in 3d graphiX?

3D graphics? fun? absolutely not.

try some siggraph papers online they are quite digestable after awhile.

So as what Gubbi and Pinky have said, FLOPS ratings are "quirky" to say the least? The better way to judge a cool performance is by the ram/bandwidth/latency/clockspeed/etc?

nope, but twisting raw bandwidth/latency figues are less tenable.

Not that i have anything hohohaha with that 1TFLOPS number in the first place, but with the commotion going on, i thought it was THE cool factor. Thus i wanted to see more about that.

if it's a reasonable implementation it will be rather intersting from a hobbist point of view, that and it'll be a btch to get the hang of.

London Geezer · Oct 10, 2003

imagine a game developed by Chap....

I'm just kidding chap, i'm sure with the right education u could be the next Carmak... which would be quite funny to think about 10 years from now...

rabidrabbit · Oct 10, 2003

london-boy said:
imagina a game developed by Chap.... I'm just kidding chap, i'm sure with the right education u could be the next Carmak... which would be quite funny to think about 10 years from now...

Shiney Shine would be the distinguishing feature of Chap's games graphixX

and HOHOHAHA gameplay.

London Geezer · Oct 10, 2003

rabidrabbit said:
london-boy said:

imagina a game developed by Chap.... I'm just kidding chap, i'm sure with the right education u could be the next Carmak... which would be quite funny to think about 10 years from now...

Click to expand...

Shiney Shine would be the distinguishing feature of Chap's games graphixX
and HOHOHAHA gameplay.

i'm sure he could come up with some really cute tech demos.... but games..... at the back of the cover there would be a big flashing "USES 1.5TFlops and advanced Shinyshine!!!"

nondescript · Oct 10, 2003

chaphack said:
That be the book to read? Hmmm....say for someone young, with minimal knowledge of computers, is it digestable/fun if yeay interested in 3d graphiX?

This book isn't exactly fun, I've read it. It is a good intro book into basic microprocessor design, and computer system design. Hennessy invented the RISC architecture, so this is a more general, theoretical book, than going into the specifics of, say, the Pentium III. After reading this book, it wouldn't be hard to build your own 8 or 16-bit processor.

If you're in college and your college has an EE department, I'm almost certain you can take a course than uses this book, or once similar to it.

marconelly! · Oct 10, 2003

Please scroll up. i just posted the PR numbers regarding Xbox.

Yep, I've read what it said there. I was talking about the GPU.

Thus, i would assume the 1 trillion thingie IS the fabled NVFLOPS, no?

No. When NVIDIA lists their numbers they normally use FLOPS (which in their case people call nvFLOPS, because they also count non-programmable units towards the final number, which is something noone else does)

In this case they used OPS. That indicates those are not even FLoating operations, but just about any stupid crap the chip does, like memory read/writes, etc. Basically, that number is completely useless for any comparision, much more so than their nvFLOPS.

nonamer · Oct 10, 2003

Hey guys, ever noticed that the "real" FLOPs of every chip shown so far is quiet low. Just a few dozen FLOPs at most and most are even less. This puts even more doubt as to believe that the PS3 can produce 1 TeraFLOP, simply because it seems that no one else is able to even get close. Even if the PS3 is capable of 1TFLOP of performance, the memory subsystem won't be able to suppose sustained performance of that much. Vaguely: to guarantee a full 1TFLOP of power, and assuming you are doing calculations with 32bit floats, you need 10^12/s X 32bits = 32Tb/s or 4TB/s of bandwidth. Since the PS3 will offer nothing even remotely approaching that kind of bandwidth, 1TFLOP will simply be off the map of plausibility.

Paul · Oct 10, 2003

Noone ever said PS3 would ever sustain a Tflops.

How did this topic on Cell mass production go into a piss fight with Chap coming up with more ways of bashing PS3? He doesn't even know what he's talking about anyway.

Oh and Chap, noone wants to talk about NVflops here; go take a course or get a book on 3d graphics.

Vince · Oct 10, 2003

nonamer said:
Since the PS3 will offer nothing even remotely approaching that kind of bandwidth, 1TFLOP will simply be off the map of plausibility.

You are incorrect. As already seen on PC architecture with microprograms/shaders that are quickly increasing in length, computational resources will quickly become the bottleneck. In fact, I believe several DX9 benchmarks have shown bandwith irrelevancy already. My hunch is that the 'bandwith' needs will be more focused on the inter-APU sector (atleast heavily biased towards sub-PE), which due to a high level of concurrecy could probably have an aggregate bandwith that's well into the hundreds of GB/sec.

EDIT: That's single digit TB/sec thanks to the fuzzy math. How did I know Panajev would actually look up the path with and do the numbers? heh.

External Bandwith's time playing King is over. Instead we'll be logic limited, which is what you just stated PS3 is overabundent on.

marconelly said:
No. When NVIDIA lists their numbers they normally use FLOPS (which in their case people call nvFLOPS, because they also count non-programmable units towards the final number, which is something noone else does)

I agree. While there are others working with the architecture (or someone like Dave) whose better aquanted should respond, I think nVidia counts everything. Which isn't very comperable as I don't think counting every fixed iterative function is indicative of much. For example, filtering is very op heavy, as is any form of AA. Bilinear takes 28ops per pixel (16mul, 12 add?) per frame; If my memory serves me correctly. How this possibly compares with, say... the programmability of Cell or the EE or Gecko, I'm having trouble keeping up her Chappers.

Panajev2001a · Oct 10, 2003

nonamer said:
Hey guys, ever noticed that the "real" FLOPs of every chip shown so far is quiet low. Just a few dozen FLOPs at most and most are even less. This puts even more doubt as to believe that the PS3 can produce 1 TeraFLOP, simply because it seems that no one else is able to even get close. Even if the PS3 is capable of 1TFLOP of performance, the memory subsystem won't be able to suppose sustained performance of that much. Vaguely: to guarantee a full 1TFLOP of power, and assuming you are doing calculations with 32bit floats, you need 10^12/s X 32bits = 32Tb/s or 4TB/s of bandwidth. Since the PS3 will offer nothing even remotely approaching that kind of bandwidth, 1TFLOP will simply be off the map of plausibility.

???

I am sorry, but your math looks a little bit funny.

1 TFLOPS could be done with 32 APUs working internally ( only the APU clock would need to be this high ) at 4 GHz producing 32 GFLOPS each.

The data is taken/stored from/into the Register File.

Each cycle we would produce 8 FLOPS per APU.

One 4-way parallel MADD ( Vector FP MADD operation ) would produce those 8 FP ops/cycle we need.

The Register file is connected to the FX/FP Units with a 384 bits bus ( data input [the three input vectors] ) and a 128 bits bus ( data output [the output vector] ).

( 384 / 8 ) * 4 GHz + 4 GHz * ( 128 / 8 ) = 256 GB/s per APU

256 * 32 = 8 TB/s

See, math can be used in funny ways

PC-Engine · Oct 11, 2003

Huh? Since when did the eDRAM acquire 8TB/s bandwidth??? What speed does the eDRAM run at???

Panajev2001a · Oct 11, 2003

I was not talking about e-DRAM.

I was talking about the APU's Register file ( and the SRAM based Local Storage ) which has the bandwidth needed to feed the FP/FX Units ( also we have a lot of registers per FPU/FX unit 32x128 bits register for each of the 4 FP/FX Units ).

I do expect the e-DRAM to run at either 1 GHz SDR or 500 MHz DDR ( or more ): 128+ GB/s

I expect the SRAM based Local Storage to run at 4 GHz as the rest of the APU core logic or at 2 GHz ( half the speed ): this would give 256-512 GB/s of bandwidth for the LS ( 128 KB of LS per APU ).

I expect the Register file to run at the same effective frequency as the APU's execution units: each bus going from the Register file to the group of 4 FP/FX units is 384 bits wide ( to support a Vector MADD instruction which needs three 128 bits operands ).

This would yeld, alone ( I will not even count here, but I did in the previous post so use that as reference ), 192 GB/s per APU.

32 APUs * 192 GB/( s * APU ) = 6 TB/s of aggregate bandwidth.

Nonamer's example is something I do not agree with as the issue is being oversimplificated and that introduces some error.

Let's see how this example would relate to the EE chip.

1 GFLOPS * 32 bits = 4 GB/s

OH NOOOOOOOOOOO, the EE's VUs cannot "mathematically" sustain 1 GFLOPS.

Since the bandwidth of the EE's main bus is 2.4 GB/s, the EE could only sustain, according to that argument, 600 MFLOPS.

According to that argument the two VUs are a total waste and with the single FPU attached to the EE's RISC core we would have saturated EE's main bus.

Uhm, something is fishy with that kind of reasoning ( no offense ).

PC-Engine · Oct 11, 2003

The EE is rated at 6.2 GFLOPS Peak. Let's use that as reference.

Panajev2001a · Oct 11, 2003

PC-Engine said:
The EE is rated at 6.2 GFLOPS Peak. Let's use that as reference.

Why do I have the impression that my posts pass through one of your ears and come out very rapidly from the other ?

PC-Engine · Oct 11, 2003

Panajev2001a said:
PC-Engine said:

The EE is rated at 6.2 GFLOPS Peak. Let's use that as reference.

Click to expand...

Why do I have the impression that my posts pass through one of your ears and come out very rapidly from the other ?

We don't know what the EE can sustain so we have no choice but to use the peak figure given out by SONY.

I have no comment regarding nonamer's math.

Megadrive1988 · Oct 11, 2003

EE probably sustains 1-2 GFLOPs if its lucky. just a wild guess since I have no idea.

Megadrive1988 · Oct 11, 2003

The 1 trillion ops for XGPU (NV2A) and GF4 (NV25) is probably a similar type of figure as Voodoo 2's 50 BOPs or Voodoo 3's 100 BOPs or Riva128's 20 BOPs (billion operations per second). this is neither flops nor NVflops.

The PowerVR Series 1 was rated at 9-10 Billion operations per second.
(PCX1 and PCX2)

I am surprized NV2A's rating stayed at 1 trillion ops, concidering the two clockspeed downgrades it went through.

all of these things: ops, flops, Nvflops, are pretty much useless. The only thing that counts are the final results that get pumped onto our screens, and overall leaps in total performance. in example: Sony PS1 to PS2 or
ATI 8500 to 9700.

nonamer · Oct 12, 2003

Panajev2001a said:
nonamer said:

Hey guys, ever noticed that the "real" FLOPs of every chip shown so far is quiet low. Just a few dozen FLOPs at most and most are even less. This puts even more doubt as to believe that the PS3 can produce 1 TeraFLOP, simply because it seems that no one else is able to even get close. Even if the PS3 is capable of 1TFLOP of performance, the memory subsystem won't be able to suppose sustained performance of that much. Vaguely: to guarantee a full 1TFLOP of power, and assuming you are doing calculations with 32bit floats, you need 10^12/s X 32bits = 32Tb/s or 4TB/s of bandwidth. Since the PS3 will offer nothing even remotely approaching that kind of bandwidth, 1TFLOP will simply be off the map of plausibility.

Click to expand...

???

I am sorry, but your math looks a little bit funny.

1 TFLOPS could be done with 32 APUs working internally ( only the APU clock would need to be this high ) at 4 GHz producing 32 GFLOPS each.

The data is taken/stored from/into the Register File.

Each cycle we would produce 8 FLOPS per APU.

One 4-way parallel MADD ( Vector FP MADD operation ) would produce those 8 FP ops/cycle we need.

The Register file is connected to the FX/FP Units with a 384 bits bus ( data input [the three input vectors] ) and a 128 bits bus ( data output [the output vector] ).

( 384 / 8 ) * 4 GHz + 4 GHz * ( 128 / 8 ) = 256 GB/s per APU

256 * 32 = 8 TB/s

See, math can be used in funny ways

Perhaps I was being to vague and oversimplified the situation. I was really refering to the cache/eDRAM. Main RAM could never be enough, only the on-die data storage could be made fast enough. Registers are mostly irrelevant (what's the point of have so many FPU if registers could feed them?)

Let's see how this example would relate to the EE chip.

1 GFLOPS * 32 bits = 4 GB/s

OH NOOOOOOOOOOO, the EE's VUs cannot "mathematically" sustain 1 GFLOPS.

Since the bandwidth of the EE's main bus is 2.4 GB/s, the EE could only sustain, according to that argument, 600 MFLOPS.

According to that argument the two VUs are a total waste and with the single FPU attached to the EE's RISC core we would have saturated EE's main bus.

Uhm, something is fishy with that kind of reasoning ( no offense ).

No offense, but...um....duh.

If the main bus of any modern CPU could actually provide the necessary bandwidth to feed it, then what was the whole point of cache?

do expect the e-DRAM to run at either 1 GHz SDR or 500 MHz DDR ( or more ): 128+ GB/s

I expect the SRAM based Local Storage to run at 4 GHz as the rest of the APU core logic or at 2 GHz ( half the speed ): this would give 256-512 GB/s of bandwidth for the LS ( 128 KB of LS per APU ).

See, the on-die stuff isn't fast enough. Thanks for proving my point.

Now don't get me wrong. I'm sure if there's anything like a 1TFLOP of power in the PS3 it can all be used. Just not for general applications. The bandwidth is simply far and away not there for that. For some graphics related purposes where direct access to on-die RAM is not needed then 1TFLOP may come in handy.

Panajev2001a · Oct 12, 2003

Nonamer, the VUs on the EE have only Micro-memories which are not HW Caches and in total they make only 40 KB of SRAM ( for instruction and data ).

My point still stands and in fact a the end of your latest post you basically seem to understand what I was trying to highlight.

Perhaps I was being to vague and oversimplified the situation. I was really refering to the cache/eDRAM. Main RAM could never be enough, only the on-die data storage could be made fast enough. Registers are mostly irrelevant (what's the point of have so many FPU if registers could feed them?)

What is the point of having so many FPUs if registers could feed them ? I am sorry, but I do not quite see what is the point you are trying to make wioth a statement like that.

The point of having so many FPUs is to be able to practically speaking "chrunch a lot of math" each cycle.

You say "duh" to me and then say "well in some cases ( Note: namely multimedia applications which is one of the major areas CELL is targeted to ) this power can be used, but not with just any kind of code".

A "duh" would be a good prize for that too

Now don't get me wrong. I'm sure if there's anything like a 1TFLOP of power in the PS3 it can all be used. Just not for general applications.

So you are telling me that for non multi-meia opr vector friendly applications we would not reach 1 TFLOPS ?

That is some horrible news, I do not know if Word 2005 and Excel 2005 will be able to run then... oh noo if I start Mozilla too I will surely get the performance to slow down to a crawl...

Meanwhile, for graphics/vector processing in general/multi-media applications CELL will be able to flex its muscles much better.

It does look like it was meant for and not a "whoopsie" on CELL designers's part.

Registers are not irrelevant, just because you got a little too accustomed to a 8 GPRs + FP Stack architecture ( cough.. IA-32... cough ) doesn't mean that bigger register files can help.

You know it already, a good memory hierachy is one that take good care of the design of each step of a relatively long chain: the purpose of each step of the memory hierarchy is not naive to the point of thinking it will negate the need for a lower level as the ratio of cost and density for each level is well known.

The more realistic purpose is to relieve the pressure off successive steps of the hierarchy in the best way possible considering the application the processor will be targeted at.

To make a long story short, which has been debated before since this point was already brought up by Deadmeat, the idea is to provide anough registers that LOAD/STORE operations from/to LS ( Local Storage, 128 KB of SRAM in each APU ) are kept to the lowest number possible.

I do not see as many main memory LOAD/STORE instructions in register heavvy architectures such as IPF as you can see in common x86 code, guess why...

LS will not be, maybe, exactly as fast as the registers, but thanks to the good amount of registers ( basically 32x128 bits Registers of the 4 FP/FX Units groups in each APU ) the pressure on it will be reduced.

Oh, btw the APU's Functional Unit cannot directly process operands from the e-DRAM, all it processes has to be contained in the LS ( programs have to be subdivided in 128 KB chunks basically to achieve optimal efficiency ).

Pressure is taken away from the e-DRAM as we have that many GPRs per APU and a good amount of LS per APU TB/s of bandwidth for the e-DRAM is not crippling CELL.

Nobody ever said that the VUs were extremely inefficient because they were so unbalanced...

Well, compared to a VU1, each APU has 4x the amount of Micro-memory and 4x the amount of registers and it is trying to feed less execution units ( VU1 can, for example, feed up to 5 parallel FMACs ) and a MUCH higher bandwidth that connects it with the next memory hierachy step.

Cell mass production plan for 2nd half of 2005

rabidrabbit

A Reformed Member

notAFanB

London Geezer

rabidrabbit

A Reformed Member

London Geezer

nondescript

marconelly!

nonamer

Paul

Vince

Panajev2001a

PC-Engine

Panajev2001a

PC-Engine

Panajev2001a

PC-Engine

Megadrive1988

Megadrive1988

nonamer

Panajev2001a

Similar threads