Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

MfA · Dec 22, 2005

ERP said:
3.0/7.0 is in double precision. the C++ standard is quite explicit on default fp precision being double

It didn't look like C++ to me, ansi C allows expressions to be represented in higher precision ... and the comparison depends on what exactly happened to q, if it was converted from extended to double there's trouble.

C++ has this nasty tidbit which seems to allow the same thing :

The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.

A double doesn't have to be a double, even in C++.

Fafalada · Dec 22, 2005

DemoCoder said:
But if you want a given sequence of code to produce the same result on two different processor architectures when dealing with doubles, then it's not true.

If by "same result" you mean produce exact same FP numbers, then obviously you're dealing with bigger problems then just precision - you have to force the compilers to use the exact same computation order on instruction level (which will further cripple performance of one or both your platforms).
Probably have to force using software implementations of any complex FP operations that one or the other platform might have in hardware as well.

DemoCoder · Dec 22, 2005

Well, exact reproducibility might be desirable for some applications, but it was not specifically what I was referring to. The big problem with extended precision is unpredictable overflow/underflow across platforms (overflow moreso than underflow). Code that assumes IEEE-754 behavior can fail catastrophically, e.g. Arianne rocket.

_xxx_ · Dec 22, 2005

Stop being so nitpicky for god's sake!

Fact is, you can do reasonable calculations with fp instead of ints with acceptible results if ever so needed. Can we agree so far?

Gubbi · Dec 22, 2005

Anyway, integer arithmetic was probably not what was meant with the original question concerning FP vis a vis integer demands.

Cheers

RavenFox · Dec 26, 2005

*whew* finally made it to the end. Great stuff guys.

liolio · Dec 29, 2005

i've read (or i've try to lol) the new article From Anand about the new SUN ultrasparc T 1 processor.
Interesstingly it seems that have chosen the same path as sony+ibm : a lot of simple cores, no OoO logic, short pipeline.
The focus in these cores is ilp while spe aim fp power. however they don't aim the task (ie server with low calculatives tasks vs intensive fp calculation).
I think there is a main diference SUN choose lot of L2 while STI go with LS and huge bandwidth.
the two chips are nearly the same size
i 've heard a lot talk in this forum about the DMA issue of the SPE (no real direct acces to memory).

i ask more knowledgeable person if sony and ibm should have choose a more sun like implementation of the spe in the cell with a nearly 2Mo L2 cahe (500KB+8X256KB) all other things being equal.
Here is the link http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2657&p=1

"Nerve-Damage" · Dec 29, 2005

liolio said:
i've read (or i've try to lol) the new article From Anand about the new SUN ultrasparc T 1 processor.
Interesstingly it seems that have chosen the same path as sony+ibm : a lot of simple cores, no OoO logic, short pipeline.
The focus in these cores is ilp while spe aim fp power. however they don't aim the task (ie server with low calculatives tasks vs intensive fp calculation).
I think there is a main diference SUN choose lot of L2 while STI go with LS and huge bandwidth.
the two chips are nearly the same size
i 've heard a lot talk in this forum about the DMA issue of the SPE (no real direct acces to memory).

i ask more knowledgeable person if sony and ibm should have choose a more sun like implementation of the spe in the cell with a nearly 2Mo L2 cahe (500KB+8X256KB) all other things being equal.
Here is the link http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2657&p=1

From what I understand about the SPE local storage memory...is that it provides direct access (programmable access) to the memory and fewer penalties (if any) for memory misses (i.e. seen in cache misses).

I think Sony & IBM (as far as the PS3 is concerned) main goal in the design of the Cell SPE’s local memory was for efficiency; to process/execute redundant code more quickly than seen in cache based memory. My take anyway………..

aaaaa00 · Dec 29, 2005

Nerve-Damage said:
From what I understand about the SPE local storage memory...is that it provides direct access (programmable access) to the memory and fewer penalties (if any) for memory misses (i.e. seen in cache misses).

No, all it does is move the main memory accesses around and force the programmer to manually schedule them.

On a cached architecture, when you try to access something not in the cache, the CPU stalls until it's loaded from memory. On a hardware threaded CPU like xenon, the other thread kicks in and runs while the first thread is stalled.

When the cache loads something from main memory, it loads the stuff around it as well, this is called a cache line. This reduces future stalls because most of the time well written code tends to access things in an orderly fashion nearby the thing that caused the miss. This is called the principle of locality. If the access isn't orderly or cache friendly, the programmer can rearrange his data structures or algorithms to match more closely the cache lines, or take manual control and issue a prefetch ahead of the time the data is needed to try to avoid a stall, though these techniques are not always possible. On some CPUs like xenon, it's also possible to reserve cache lines and treat them like a local memory.

On the SPE, if you need something in main memory, then you must request a DMA to transfer it to local memory before you can use it. This is essentially the same as a cache miss. DMAs take a really long time, so the programmer needs to structure his SPE code so he can do something else while the DMA is happening.

The point is, if you have 1000 MB of input data to process, then regardless of if you have a cache or local memory, you have to schedule time to move 1000 MB of data into the cache or local memory and time to write the results back out.

All having a local memory does is force the programmer to think about when to schedule the main memory accesses, rather than have the cache implicitly schedule them on his behalf, though even on the cache architecture the programmer will have to consider where to put his prefetches and how to construct his algorithms and lay out his data structures so they're cache friendly.

dcforest · Dec 29, 2005

The is one other interesting difference between the LS of a SPE and the cache architecture of Xenon. In both the SPE case and the Xenon case, in order to get good performance you want your application to run primarily from the cache or local store, since having to fetch data from main memory is very slow from a latency perspective. The advantage the LS in the SPE has is that it is very low latency, similar to a L1 cache. The Xenon L1 cache is very small (32K), so most apps will run from the 1MB L2 cache. The Xenon L2 cache is much higher latency than the LS of a SPE (30+ cycles vs. 6 cycles, if I recall correctly). The fact that the LS runs at L1 speeds, but is large at 256K, is part of what give the SPE its high performance characteristic.

aaaaa00 · Dec 30, 2005

dcforest said:
The is one other interesting difference between the LS of a SPE and the cache architecture of Xenon. In both the SPE case and the Xenon case, in order to get good performance you want your application to run primarily from the cache or local store, since having to fetch data from main memory is very slow from a latency perspective. The advantage the LS in the SPE has is that it is very low latency, similar to a L1 cache. The Xenon L1 cache is very small (32K), so most apps will run from the 1MB L2 cache. The Xenon L2 cache is much higher latency than the LS of a SPE (30+ cycles vs. 6 cycles, if I recall correctly). The fact that the LS runs at L1 speeds, but is large at 256K, is part of what give the SPE its high performance characteristic.

Xenon L1 is 32KB data / 32KB instruction, 64KB in total, per core. The L1 caches together are also much faster than LS on an SPE, per core.

(http://www-128.ibm.com/developerworks/power/library/pa-fpfxbox/)

Edge · Dec 30, 2005

aaaaa00 said:
Xenon L1 is 32KB data / 32KB instruction, 64KB in total, per core. The L1 caches together are also much faster than LS on an SPE, per core.

How much faster, at that page you are linking to does not list how fast, or does it?

scificube · Dec 30, 2005

The speed would still be relative to how long it takes to complete an instruction anyway no?

aaaaa00 · Dec 30, 2005

Edge said:
How much faster, at that page you are linking to does not list how fast, or does it?

The following is speculation from public sources only:

http://www-128.ibm.com/developerworks/library/pa-fpfxbox/figure2.gif

Instruction fetch is 4 instructions per clock (2 per thread). 16-bytes per clock * 3.2 ghz = 51.2 gb/s.

Xenon L1 cache is 32KB/32KB I+D. Let's assume the data cache is roughly the same throughput, so I'd say aggregate per core L1 cache bandwidth is in the range of 100 GB/s.

http://www-128.ibm.com/developerworks/power/library/pa-cellperf/figure1.gif

SPE LS is 51.2 GB/s.

overclocked · Dec 30, 2005

I dont remember if it was on this board or a comment from a dev but whatever there was someone saying that the 6cyckle latency could also be minimized by having each of the different SPEs do a task while one is waiting or something like that.
To many of you out there thats been doing work on IOO cpu how do you manage it?
I remember screwing up so badly! Although the class was only six months in C++ but on a friendly cpu OOO(Athlon MP)

Edge · Dec 30, 2005

aaaaa00 said:
The following is speculation from public sources only:

http://www-128.ibm.com/developerworks/library/pa-fpfxbox/figure2.gif

Instruction fetch is 4 instructions per clock (2 per thread). 16-bytes per clock * 3.2 ghz = 51.2 gb/s.

Xenon L1 cache is 32KB/32KB I+D. Let's assume the data cache is roughly the same throughput, so I'd say aggregate per core L1 cache bandwidth is in the range of 100 GB/s.

http://www-128.ibm.com/developerworks/power/library/pa-cellperf/figure1.gif

SPE LS is 51.2 GB/s.

You are making an assumption on the data cache, and the L1 cache is four times smaller than the SPE LS, so there will be more misses, and thus fetching out to L2 cache, which greatly negates the extra 50 GB/sec bandwidth if it does exist.

Also the SPE LS design by their nature, forces programmers to carefully optimize to that memory size (especially interms of instructions), which will not be done on the L1 cache, especially if the L1 caches are dealing with two threads, greatly increasing cache misses.

360's CPU has three sets of 64KB L1 cache for 192 KB total, versus 1792 KB of SPE local store on CELL and 64 KB of L1 cache on the PPE for a total of 1856 KB. A difference of almost 10 times!!! So CELL has 10 times more data and instructions available to it's processors at very low latency compared to 360's CPU.

version · Dec 30, 2005

L1 latency 5
L2 latency 39
memory latency 525+
LS latency 6

Jawed · Dec 30, 2005

In Xenon an L1 hit costs 5 cycles, but pipeline offsets often reduce this to a norm of 2 cycles.

With symmetric multi-threading, I presume this means that each thread sees a normal effective 1 cycle L1 hit latency. Or 2/3 cycles if the pipeline offsets didn't help.

It's worth remembering that the L1 I+D caches are shared by both threads, so that halves the effective bandwidth to each thread.

An L1 miss but hit on L2 in Xenon is a minimum of 39 cycles. Core 0 L2 latency is lower than cores 1 and 2 as it's physically closer to the cache on the die. Again, I presume with SMT the effective penalty is halved as long as the other thread on the core can continue running - but that depends on the programmer having issued a pre-fetch for the data, so that the L1-miss hardware thread is flushed, thus allowing the other hardware thread to continue.

L2 miss costs roughly 525 cycles.

All miss costs depend on the busyness of the L2 cache, as the request to L2 will be queued.

Also, it's worth remembering that Xenon supports a data streaming cache model, where L2 is excluded entirely from reading. 128-byte lines are fetched directly into L1 D, thus helping to preserve L2 for other tasks (including the output of such a streaming thread) and also fixing the effective latency at about 1 cycle per cache hit.

etc.

Jawed

Fafalada · Dec 30, 2005

aaaa0 said:
SPE LS is 51.2 GB/s.

SPE LS can read/write 128bytes/cycle, which makes it in order of 400GB/s. So no, the L1 caches aren't really faster.

London Geezer · Dec 30, 2005

Ok uhm sorry to sound on-topic, but after about 40 pages of thread pertaining PS3, has the question been answered?

Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

:smile:

Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

MfA

Fafalada

DemoCoder

_xxx_

Gubbi

RavenFox

liolio

Aquoiboniste

"Nerve-Damage"

aaaaa00

dcforest

aaaaa00

Edge

scificube

aaaaa00

overclocked

Edge

version

Jawed

Fafalada

London Geezer

Similar threads