ISSCC 2005

version · Feb 9, 2005

ok if fully pipelined and work on 6 vertex in same time but code is big

vload r1,r,r
vload r2,r,r
vload r3,r,r
vload r4,r,r
vload r5,r,r
vload r6,r,r
vmul r,r,r1
vmadd r,r,r2
vmadd r,r,r3
vmadd r,r,r4
vmadd r,r,r5
vmadd r,r,r6
vmul r,r,r1
vmadd r,r,r2
vmadd r,r,r3
vmadd r,r,r4
vmadd r,r,r5
vmadd r,r,r6
vmul r,r,r1
vmadd r,r,r2
vmadd r,r,r3
vmadd r,r,r4
vmadd r,r,r5
vmadd r,r,r6
vmul r,r,r1
vmadd r,r,r2
vmadd r,r,r3
vmadd r,r,r4
vmadd r,r,r5
vmadd r,r,r6

this is matrix-vertex mulytiply on 6 vertexs, it is a HUMOR

, but no stall

MrFloopy · Feb 9, 2005

ok if fully pipelined and work on 6 vertex in same time but code is big

Yes size / speed is the tradeoff in such a simple case, however for more complicated cases, the size issue minimizes as requirements for loop unrolling to hide data fetches is no longer required.

Anyway this is getting OT as these are general issues affecting programming on pretty much every processor available in past 10 or so years, and are not unique to cell.

MfA · Feb 9, 2005

He is not loop unrolling for data fetches, or at least not anymore than for the latency of every other instruction he used in there.

Guden Oden · Feb 9, 2005

version said:
this is not problem

How would YOU know how much extra hardware might be needed to run both int and float ops simultaneously?

The SPU is already 21 million trannies, that's what, 2/3 the size of the original AMD Athlon? Besides, local storage only delivers 128 bits of data per cycle anyway, if you don't have all your data in registers already chances are very good you're not going to see any speedup by simultaneous execution anyway. That's likely part of the reason why STI didn't make the bugger do simultaneous execution in the first place.

latency is a BIG, nightmare

You complain way too much. If you're going to keep whining like that, better stay away from the gaming business altogether... Go work for the Microsoft Office team instead. There, nobody is going to ask you to write high-performing tight code!

version · Feb 9, 2005

if add and mul are 6-7 cycle , divide is 30 i mean

MfA · Feb 9, 2005

3dnow does it in 2 iterations, that is a bit of an overestimation IMO.

version · Feb 9, 2005

MfA said:
3dnow does it in 2 iterations, that is a bit of an overestimation IMO.

yes 2 cycle imprecise result, and more cycle with iterations(Newton-Raphson method)

pc999 · Feb 9, 2005

Megadrive1988 said:
pc999 said:

Is that for consumers ,i.e., not server or like?

Click to expand...

Intel's Tanglewood is no doubt for servers, workstations and whatnot, not consumers

In that case is better you ask about performace by die size, they usually make really mosters in those market

nAo · Feb 9, 2005

Dunno if it is true, but a guy called KelleyCook on Ars Forums wrote that:

For what its worth the SPE ISA itself is not new and is already included within GCC 3.4.

It comes with many of Freescale'sâ€”nee Motorolaâ€”PPC cores. They call their base core, e500.

EDIT: I googled around and it seems e500 doesn't support 4-sp vectors :?

ciao,
Marco

MfA · Feb 9, 2005

version said:
yes 2 cycle imprecise result, and more cycle with iterations(Newton-Raphson method)

Actually I was mistaken, only 1 lookup and 1 iteration needed for single precision result with 3DNow! ... the iteration is just broken up in 2 steps because of the lack of FMA.

Anyway as I said, 30 cycles is an overestimation.

archie4oz · Feb 9, 2005

The e500 doesn't support *any* vectors at all without SPE... However I think this person is confusing Motorola's SPE (their 2nd SIMD arch) for the SPE units in Cell...

DudeMiester · Feb 9, 2005

This thread was an interesting read, Cell looks very nice, crosses fingers for high quality real time raytracing!

AutomatedMech · Feb 9, 2005

Edited by moderator

Stop trolling the boards deadmeat. You have been banned in the past. You are not allowed to post here.

aaronspink · Feb 9, 2005

Megadrive1988 said:
''This is still the biggest chip technology advance in probably 20 years,'' said Richard Doherty, research director at Envisioneering Group in Seaford, Nassau County.

If anything, claims of a 10-fold leap in performance are understated, Doherty said. ''Our estimate is 10 to 20, so they're being conservative,'' he said.

He added Cell developers said they could have put 16 cores on the same size chip if they had thought it necessary.
Whether the team's work eclipses that of leader Intel remains to be seen. Monday was also the day on which Intel said it was now making a two-processor chip.

''It's poor timing,'' Doherty said. ''The twin-piston engine comes out the same day as a V-8.''

Ahh, ANALysts. So called because they have their head up.....

10x? bah, I can get coprocessors today that do 100 Gflops+ with all the same restrictions and programming headaches.

Aaron Spink
speaking for myself inc.

Deadmeat4 · Feb 9, 2005

...

Edited by moderator

Strike 2

Megadrive1988 · Feb 9, 2005

Something that i only asked about and never got a solid explaination, since when was is known that Cell does not have any eDRAM ?

what happened to 16-64 MB of eDRAM that was supposed to be one of the major advantages of Cell ?

PC-Engine · Feb 9, 2005

Megadrive1988 said:
Something that i only asked about and never got a solid explaination, since when was is known that Cell does not have any eDRAM ?

what happened to 16-64 MB of eDRAM that was supposed to be one of the major advantages of Cell ?

My guess would be they figured it was too difficult to manufacture the eDRAM at 65nm so they increased the LS and cache instead. Also the eDRAM was probably too slow? If you recall the eDRAM in EE+GS@90nm wasn't using 90nm. It was using 130nm IIRC.

Megadrive1988 · Feb 9, 2005

PC-Engine said:
Megadrive1988 said:

Something that i only asked about and never got a solid explaination, since when was is known that Cell does not have any eDRAM ?

what happened to 16-64 MB of eDRAM that was supposed to be one of the major advantages of Cell ?

Click to expand...

My guess would be they figured it was too difficult to manufacture the eDRAM at 65nm so they increased the LS and cache instead. Also the eDRAM was probably too slow? If you recall the eDRAM in EE+GS@90nm wasn't using 90nm. It was using 130nm IIRC.

ok I suppose that all makes sense. well, Sony had better compensate for the lack of eDRAM with lots more external main memory.
<that's the ram hugger in me talking>

archie4oz · Feb 9, 2005

My guess would be they figured it was too difficult to manufacture the eDRAM at 65nm so they increased the LS and cache instead. Also the eDRAM was probably too slow?

eDRAM would be *much* easier than logic to deal with...

Plus I dunno about the slow part... While eDRAM has a longer latency than SRAMs do, the much higher density offered by eDRAM mean less wire-delay than you get with SRAMs which can almost offset the latency penalty suffered by eDRAMs...

V3 · Feb 9, 2005

Something that i only asked about and never got a solid explaination, since when was is known that Cell does not have any eDRAM ?

what happened to 16-64 MB of eDRAM that was supposed to be one of the major advantages of Cell ?

Like I said before, the patent doesn't mentioned eDRAM. It was because of the 1024 bit bus that people assumed it was eDRAM.

Though with just a single Cell, do you think 25GB/s of memory bandwidth is sufficient for PS3 to feed Cell and NV GPU without eDRAM somewhere in the system ?

ISSCC 2005

version

MrFloopy

MfA

Guden Oden

Senior Member

version

MfA

version

pc999

nAo

Nutella Nutellae

MfA

archie4oz

ea_spouse is H4WT!

DudeMiester

AutomatedMech

aaronspink

Deadmeat4

Megadrive1988

PC-Engine

Megadrive1988

archie4oz

ea_spouse is H4WT!

V3

Similar threads