First Cell demo (48 MPEG 2 Videos)

Status
Not open for further replies.
We aren't any closer to an answer of "orders of magnitude" or not (unless you stand behind your numbers claim), so I guess you lose on both counts.
 
randycat99 said:
We aren't any closer to an answer of "orders of magnitude" or not (unless you stand behind your numbers claim), so I guess you lose on both counts.

Uh no you might want to look up the definition of a single order of magnitude, then do the same for the plural form. :LOL:

If you want to believe CELL was running at 100MHz and at 10% utilization to prove that it is orders of magnitude faster then go ahead, it's water under the bridge at this point.

If CELL was orders of magntitude faster then you can be sure even Toshiba would be trumpeting it. ;)

BTW why do you suppose the clock speed was secret? :LOL:

If I had a processor that could do that and only required a couple hundred MHz, I wouldn't hide that fact. The competition already knows the GFLOPS rating of a CELL. ;)
 
So you made a strawman argument that someone here is making claims that Cell will be 100+ times the fastest PC of current day??? Who here has made such a claim? Where do you place a single P4, right now? 20 GFLOPs? 40? 15? So somewhere you see a rising belief that Cell will be 1.5/2.0/4 TFLOPs??? Wow! You broke some serious news! :rolleyes:

Hey, know what? If a P4 is 20-ish, and a single order of magnitude brings 200, then Cell may not be so "off-track", afterall. Maybe that is where you went wrong with your argument?
 
A 3.5GHz dual core HT P4 is not 20ish, besides GFLOPS isn't everything. Regardless both ERP and aaaaa00 must be arguing with that same strawman too. We have enough information to reasonably declare it's not orders of magnitude faster in realworld situations. We haven't even talked about other types of apps that favor a P4 not to mention apps that require double precision. ;)

Really if you want to believe CELL is orders of magnitude faster by simply comparing theoretical SP GFLOPS numbers then feel free nobody is stopping you. I'm more interested in how apps run on these processors that's one of the reasons why I brought up the DVD streams comparison.
 
randycat : Order of magnitude, though not particularly defined, is taken generally to mean 10x (at least when I was at uni.) Therefore a processor an order of magnitude more powerful than another is something like 10x more powerful.

I would say PCEngine's observations, that if an HT DC P4 can manage 40 streams Cell isn't an order of magnitude higher, is a valid statement.

pahcman : No-one's getting annoyed at Cell not being 10^n times more powerful than other processors. We're getting annoyed at PCEngine's attempts to flog some understanding of Cell's performance from facts that haven't been proven yet. If he wasn't harping on about how wrong everyone was to think Cell's a wonder chip (which i don't think anyone here is really bothered about) there wouldn't be this argument of people trying to correct his manic behaviour.

PCEngine : What's stupid with your statements is that you've said we can't tell Cell performance from this as we don't know how this demo taxxed the Cell system (especially as it wasn't a hardware demo), and we haven't confirmed an HT DC P4 can handle 40 streams. The jury is out. The debate is ongoing. You've taken the comments of two programmers as fact and ignored other statements from other sources (including a programmer) who think otherwise.

Add to that your a rude, arrogant twat who persists in insulting people but thinks that's okay, civilised, mature behaviour. ;) :D

You've all the intellectual capabilities of the fungus that grows on the rotting remains of a maggot ;) :LOL:
You've the charm and social standing of a diarroeic camel's rectal discharges ;) :D
You've the reasoning capacity of a the wart on a pregnant baboon's unborn child ;) :LOL: :p
You're a fat :LOL: , retarded ;) , spastic :LOL: :LOL: , nazi ;) :LOL: , m***** f***** gay nigger ;) :p :D :LOL: :cry: :LOL: :p :p :LOL:

Mods ... can we have some rules that explain the point I illustrate above and PCEngine doesn't seem to get? This behaviour is totally out of place.
 
:LOL: Anyway...I made a hypothesis based on some assumptions to conclude it's not orders of magnitude faster which I think is reasonable. If people want to disagree then fine, I have no problem with that. It's the people that think in a vacuum ->CELL = 300GFLOPS therefore CELL = orders of magnitude faster that's the issue. To those people I'd ask how CELL compares in scientific computing apps that need DOUBLE PRECISION? :LOL:
 
nAo said:
JF_Aidan_Pryde said:
Why 640 x 480 x 2? What's the x2?
Decoded streams must be written (+1) and read (+1), that's why we have a 2x factor.
With resizing, don't you just have to read in the 48 streams again (1.47GB/s) and output at 1920x1024? With fancier filtering, the number of operations in the CPU will need to go up but bandwidth should still be no more than 1.47GB/s input, no?
Unfurtunately this is not the case.
Every filter with a domain that is larger than a single sample will require multiple samples per pixel.
This means sample values are going to be reused multiple times and since we don't have a cache that can hold a full decoded stream no matter what we do, some sample is going to be fetched multiple times from memory.
Obviously there are ways to improve this situation, like to store a decoded stream in some hierarchical tiled fashion, and filtering a tile at time, following a special tiling order to maximize sample reuse between tiles.

Wouldn't cache fix such problems? For example, if bilinear filtering is used, four samples are required for each output pixel. Surely the neighbouring pixels would be fetched into cache (like tiles as you said) and the values used multiple times to calculate the value of all resulting pixels. Isn't this the reason why GPUs don't need 4x memory bandwidth to do bilinear filtering? Is there something special about CPU cache that makes this scenario no longer valid?
 
JF_Aidan_Pryde said:
Wouldn't cache fix such problems?
Cache would allievate such problems.
For example, if bilinear filtering is used, four samples are required for each output pixel. Surely the neighbouring pixels would be fetched into cache (like tiles as you said) and the values used multiple times to calculate the value of all resulting pixels.
This is true, but IF cache can't store ALL the unique samples you need to filter the image it would have to drop some sample once you fetch new samples to filter, cause you can't visit al the tiles without dropping some edge (sooner or later you would walk all the tile but since cache can't hold all the samples some tile would be lost and the hw would have to fetch it another time)

Isn't this the reason why GPUs don't need 4x memory bandwidth to do bilinear filtering? Is there something special about CPU cache that makes this scenario no longer valid?
I believe GPU caches are built in a way to maximize hit under bilinear filtering pattern, but the basic principle is the same.
GPUs don't need 4x the memory bandwith to bilerp texels but cache can't just reduce this bandwith requirements to 1x.
That's why GPU caches are 'small' cause desginer don't want to capture ALL the texture, but they want just to reuse samples under a certain walkind order among texture tiles.

EDIT: typos
 
PC-Engine said:
:LOL: Anyway...I made a hypothesis based on some assumptions to conclude it's not orders of magnitude faster which I think is reasonable.
Save that your hypothesis assumes this Toshiba demo is showing the absolute limits of Cell's performance. For all we know a single SPE can process 48 MP2 streams and they could have had a couple of hundred little images on screen. It's like seeing a Ferrari beat a Fiat 500 in a drag race with the Ferrari travelling no faster than 60 miles an hour. It still beats the Fiat, but it doesn't mean the Ferrari can ONLY do 60 MPH. We have no idea what the code optimisation was like. We don't know why one SPE wasn't used. It's possible the Cell processor choked and couldn't handle it. It's also possible that for demonstration purposes showing 8x6 thumbnail videos was clearer than showing 24x18 so Toshiba limited how many videos to work with. The key point being we DON'T KNOW and therefore cannot derive any sensible benchmarks from this demo.

Do you disagree with this?
If people want to disagree then fine, I have no problem with that.
Why do you insult them then?

It's the people that think in a vacuum ->CELL = 300GFLOPS therefore CELL = orders of magnitude faster that's the issue.
What people? :? No-one here was saying that. We were just talking about what this demo does/doesn't show. Read through the first few pages of posts and the debate is polite and intelligent, with a jovial few smart-arse remarks. The first antagonist is you, as is the second. It then drops into a smiley ridden slag-fest with you trying to prove you're right over something that is of no concern to anyone. So what is Cell is or isn't a 1 teraflop uber-processor? No-one's lives are at stake! It wouldn't be the first time promises/hype never came true! Why are you so insistant on trying to convince everyone not to have any faith in Cell?

Honestly, I can't see why you're arguing your point.
 
Thanks for the explaination nAo.
I guess that's why Cell uses a programmer controlled local store so that for something like this, the required pixels for whatever filtering is in already in the local store before the filtering goes underway. :)

Using the stream method between SPEs, this would be very nice. The decoded stream data gets stored in the local store of the next SPE. The next SPE then just reads its local store as a perfect pixel cache to do filtering.

---

Regarding order of magnitude speedups, it is true in terms of raw theoretical specs. It's a reasonable comparison since both are fabbed at 90nm, have similar die sizes and have roughly 200+ million transistors.

Pentium 4 Dual Core 3.5GHz (250 Million transistors)
Single core FP performance: 3.5 x 4 (SSE) = 14 GFLOPS
Dual core = 28 GFLOPS

Cell (234 Million transistors)
Cell @ 3.5GHz = 3.5 x 8 (FMADD) x 8 SPEs = 224GFLOPS

That's to say, Cell has roughly 10 times the floating point capability of a dual core Pentium 4 at the same clock speed.

---
Notes: * Pentium 4 calulation is for the SSE unit only (excludes regular FPU).
* Cell calculation is for SPEs only (excludes PPE).
* Assuming media applications, ie. regular single precision FP instructions
 
For the floating point talk, keep in mind that a: it doesn't strictly follow IEEE SP precision but b: it does do 26 GFLOPS DP. The Blue Gene/L chip does between 5 and 6. Granted its clocked much lower. Take that what you will, anyways, its been an entertaining thread but unlike some *cough cough* who 's raison d'etre is turning every interesting conversation into a pissing contest I've got a life, off to the Canadian superbike opener in Shannonville. Hope things settle down in here and I'll return to the sandbox later. Play nice kiddies.
 
Shifty Geezer said:
pahcman : No-one's getting annoyed at Cell not being 10^n times more powerful than other processors. We're getting annoyed at PCEngine's attempts to flog some understanding of Cell's performance from facts that haven't been proven yet. If he wasn't harping on about how wrong everyone was to think Cell's a wonder chip (which i don't think anyone here is really bothered about) there wouldn't be this argument of people trying to correct his manic behaviour.

Yes pce do seem overboard with his replies, but imho this thread took a danger when someone decide to trumpet this cell demo as wiping pc chips, with wmvhd instead of mpg2 examples to boot!

from there on, replies very misunderstanding spiral. one side say impressive demo but pentiums can do fine, aaaaa0 even set his own tests. other side took it as implying cell not better than p4....all start imho because biased perceptions toward members.

back n forth, i dont see replies joining final cell ps3 with this demo? in fact i see people pressing pce to give a number on cell p4.

as known this demo is no indication of anything just show 48mp2 can run good. can we leave this as nice demo from toshiba?
 
JF_Aidan_Pryde said:
...
Pentium 4 Dual Core 3.5GHz (250 Million transistors)
Single core FP performance: 3.5 x 4 (SSE) = 14 GFLOPS
Dual core = 28 GFLOPS

Cell (234 Million transistors)
Cell @ 3.5GHz = 3.5 x 8 (FMADD) x 8 SPEs = 224GFLOPS
...

I get different figures,

A dual core HT Pentium, 4-threads,

1 core ~ 1 (FPU) + 4(SSE) ~ 5 Flops per cycle

Dual core HT pentium ~ 5+5 ~10 Flops per cycle ~ 35 Gflops @ 3.5 GHz



Cell with 1 PPE, 8 SPEs, (with FMADD), 10-threads,

PPE ~ 2 (FPU) + 8 (VMX) ~ 10 Flops per cycle

8 SPE ~ 8*8 (SPU) ~ 64 Flops per cycle

CELL ~ 10+64 ~ 74 Flops per cycle ~ 259 GFlops @ 3.5 GHz

Unless I missed something?
 
Save that your hypothesis assumes this Toshiba demo is showing the absolute limits of Cell's performance. For all we know a single SPE can process 48 MP2 streams and they could have had a couple of hundred little images on screen. It's like seeing a Ferrari beat a Fiat 500 in a drag race with the Ferrari travelling no faster than 60 miles an hour. It still beats the Fiat, but it doesn't mean the Ferrari can ONLY do 60 MPH. We have no idea what the code optimisation was like. We don't know why one SPE wasn't used. It's possible the Cell processor choked and couldn't handle it. It's also possible that for demonstration purposes showing 8x6 thumbnail videos was clearer than showing 24x18 so Toshiba limited how many videos to work with. The key point being we DON'T KNOW and therefore cannot derive any sensible benchmarks from this demo.
Do you disagree with this?

Uh wasn't the demo shown in a videoclip played in Windows? :LOL:

Take my advice and reread the thread and understand it otherwise you're just wasting my time dude.


Why do you insult them then?

It's obvious you have a problem reading from page 1? It started on page 2 and it wasn't me just in case you're still asleep.

What people? No-one here was saying that. We were just talking about what this demo does/doesn't show. Read through the first few pages of posts and the debate is polite and intelligent, with a jovial few smart-arse remarks. The first antagonist is you, as is the second. It then drops into a smiley ridden slag-fest with you trying to prove you're right over something that is of no concern to anyone. So what is Cell is or isn't a 1 teraflop uber-processor? No-one's lives are at stake! It wouldn't be the first time promises/hype never came true! Why are you so insistant on trying to convince everyone not to have any faith in Cell?

And what does that have ANYTHING to do with the example I gave?? If aaaaa00 gave the example instead of me would it change the point??? And where did you get the idea that this was to convince everyone to not have faith in CELL??? You mean if some anonymous poster came in and posted this instead of me, it would also mean that person has an agenda??? Lay down the pipe man and stop wasting my time. BTW for the last time, I don't think it's me who needs to reread the thread. ;)

I'll put it this way for the technically challenged. On paper CELL is roughly an order magnitude faster than the P4 in my example if strictly talking about GFLOPS. This demo shows 48 DVD streams + downsampling. You telling me CELL could actually do about 400 DVD streams without downsampling???? It doesn't matter if CELL could do more than 48 understand??? Anything below a certain number will make it less than an order of magnitude get it???? Have a nice day. :LOL:
 
Jaws said:
Unless I missed something?

The notes!
JF_Aidan_Pryde's notes said:
Notes:
* Pentium 4 calulation is for the SSE unit only (excludes regular FPU).
* Cell calculation is for SPEs only (excludes PPE).
* Assuming media applications, ie. regular single precision FP instructions
 
rendezvous said:
Jaws said:
Unless I missed something?

The notes!
JF_Aidan_Pryde's notes said:
Notes:
* Pentium 4 calulation is for the SSE unit only (excludes regular FPU).
* Cell calculation is for SPEs only (excludes PPE).
* Assuming media applications, ie. regular single precision FP instructions

Haha!...I never read the small print! :)

Well, it makes sense now...I guess I've seen these figures so many times that it just looked odd to me when I always personally include those 'notes' for consistency... ;)
 
Slightly off topic, but when making comparison 3.5ghz dual core HT P4 vs Cell, I think we should remember that, only the Extreme Edition supports HT, and it's max clockspeed at the moment is only 3.2ghz, and it might take while before they can crank it up, also the model mentioned above will costs billions, whereas Cell will be on ~300-400$ console. Keeping that in mind Intel doesn't look so good.
 
PC-Engine said:
Jaws do you know what the double precision GFLOPS is for that same P4?

Not sure, I know PPC G5 > Pentium for DP flops per clock,

1 core G5, using FMADD ~ 2 (*2 FPU) ~ 4 Flops per cycle

Dual core G5 ~ 4+4~ 8 flops per cycle ~ 28 GFlops @ 3.5 GHz

Because the Pentium can't do FMADDs,

Dual core HT Pentium ~ 1(*2 FPU) ~ 2 Flops per cycle ~ 7 GFlops @ 3.5 GHz

Note, SSE and VMX units can't do DP AFAIK...
 
Oh man I don't know if I should laugh or cry. Yes it costs Intel $1000 each to manufacture Itaniums, Xeons, and P4EEs. Thanks for that post Dr. Evil. Better luck next time. :LOL:

BTW the comparison has nothing to do with Intel vs another company, but of course you need to turn it into X company is better than Y company. :rolleyes:

But I can play that game too. Intel makes HUGE profits on their highend cpus. SONY will be taking losses on the hardware. ;)

Thanks Jaws. :)

I guess that means CELL at 3.5GHz is orders of magnitude faster too when DP is used. :LOL:
 
Status
Not open for further replies.
Back
Top