Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

ShootMyMonkey said:
Well, vector ops (SSE) are all single precision (and sometimes less than that when getting into *effective* precision). Scalar ops on any IEEE-compliant FPU are inherently double precision. Even if you're using single precision data types, the operations will be done internally in doubles and rounded down when actually stored.

15.2 GFLOPS for P4 comes from SSE (which is single precision). But you can theoretically get half that out of scalar ops which are double.

actually double precision vector ops were introduced in sse2. i believe some of them go at a rate of 2 flops/clock too.
 
ShootMyMonkey - AFAIK, that's incorrect. I'm pretty sure an FPU implementation that does not round to the correct precision (32 bit, 64 bit, 80 bit or 128bit) after every single operation is not fully IEEE compliant. That is, the registers themselves must contain correctly rounded results, and rounding cannot be deferred until a value is stored to memory.
 
psurge said:
ShootMyMonkey - AFAIK, that's incorrect. I'm pretty sure an FPU implementation that does not round to the correct precision (32 bit, 64 bit, 80 bit or 128bit) after every single operation is not fully IEEE compliant. That is, the registers themselves must contain correctly rounded results, and rounding cannot be deferred until a value is stored to memory.

Section 5 (Operations) of the IEEE-754 standard says:
"All conforming implementations of this standard shall provide
operations to add, subtract, ..., convert between floating-point
and integer formats, ... and compare. ... Except for binary<->decimal
conversion, each of the operations shall be performed as if it
first produced an intermediate result correct to infinite precision
and with unbounded range, and then coerced this intermediate result
to fit in the destination's format. .... Normally, a result is rounded to the
precision of its destination."

So, yep you're right - it is possible to store unrounded values if they fit. This brings up a whole host of issues with optimising compilers that potentially raise your single type to a double on the H/W and then round it arbitrarily in either direction (should really be 'round to even') so it appears to match your single type.

Typically though you don't have 80-bit or 128-bit FPU's (except in GPUs; MMX and SSE use SIMD to bundle several floating point values together remember so you don't have a 128bit FP number, rather a collection) and actually use software to emulate them. In fact, IEEE just states that with >79 bits (called 80-bit because you can now see the hidden bit) you have some extended type but the definition of this is quite lax in comparison to the rest of the standard which strictly defines bit-for-bit how operations should work (though it is assumed if these algorithms work they should hold also for the rest of the bits you add).

i believe some of them go at a rate of 2 flops/clock too

Perhaps but it is not their speed that matters, it is the loading the SSE values that takes time. Why these overwrite the FPU stack entries I will never know, many are the mysteries of x86 (and why it still exists :devilish: )
 
off-topic

Kryton said:
Perhaps but it is not their speed that matters, it is the loading the SSE values that takes time. Why these overwrite the FPU stack entries I will never know, many are the mysteries of x86 (and why it still exists :devilish: )

IIRC that was originally done to make the mmx register file virtually transparent to the os'es of the day, by letting the new reg file to be handled by the fpu context save/restore (the fpu save/restore was actually often carried upon task scheduler's discretion) without putting more stress on the already struggling schedulers. i believe it helped particularly one of them - the just hatched at that time win95 whose context switching was far from stellar (to put it maximally mildly).
 
AlgebraicRing said:
Out of curiosity, what exactly is enlightened reasoning as opposed to ordinary reasoning?

having a pretty thorough knowledge of the actual issues at hand. Things like process transitions, power, die size, etc aren't exactly rocket science.

Aaron Spink
speaking for myself inc.
 
The key is that Sony and Microsoft has broken the law of what is a FLOPS and what´s not.

Is crucial to split FLOPS (double precisión) of VOPS (single precision that are executed on an Vectorial Unit like a SIMD unit or a VLIW unit).
 
Urian said:
The key is that Sony and Microsoft has broken the law of what is a FLOPS and what´s not.

Is crucial to split FLOPS (double precisión) of VOPS (single precision that are executed on an Vectorial Unit like a SIMD unit or a VLIW unit).

Could someone break down these 2 types of numbers for each console (theoretical of course) and what each are used for?
 
aaronspink said:
having a pretty thorough knowledge of the actual issues at hand. Things like process transitions, power, die size, etc aren't exactly rocket science.

Aaron Spink
speaking for myself inc.

Do you work in the industry, then? What is the basis of your thorough knowledge? How often are you wrong?
 
AFAIK, that's incorrect. I'm pretty sure an FPU implementation that does not round to the correct precision (32 bit, 64 bit, 80 bit or 128bit) after every single operation is not fully IEEE compliant. That is, the registers themselves must contain correctly rounded results, and rounding cannot be deferred until a value is stored to memory.
I never actually said "in memory." I only said when a result is stored -- yes, it applies to storing a result in registers as well. Although it's actually quite common for the intermediate computations in a compound expression to remain in double format until the last operation is done (though this is often the fault of a compiler) -- which in turn creates some occasionally fun errors because of some mismatches.

actually double precision vector ops were introduced in sse2. i believe some of them go at a rate of 2 flops/clock too.
I believe they only get 2 FLOPs/clock on anything that would anyway get 2 per clock if you did it with x87.

Why these overwrite the FPU stack entries I will never know, many are the mysteries of x86 (and why it still exists )
Hm. I guess someone must have figured it's really hard to add registers. Just like 640k ought to be enough for anyone, any program can be just fine with zero registers.
 
Urian said:
The key is that Sony and Microsoft has broken the law of what is a FLOPS and what´s not.
Law? :)

To me IEEE does not define what a floating point number is, the actual you know ... floating point does.
 
Hey guys, what exactly does this diagram say about the bandwidth?

http://upload.wikimedia.org/wikipedia/en/a/af/X360bandwidthdiagram.jpg

Does the GPU and CPU have to share the 22.4GB/s bandwidth? Meaning that if it was to be devided equally it's 11.2GB/s for both? Or does it somehow have two busses to the RAM, going from the GPU?

http://pc.watch.impress.co.jp/docs/2005/0701/kaigai_6a.gif

Cell and PSX has their own connection to memory.

Might this be the reason Xenon has the eDram?

I know nothing about this so please correct me if I'm wrong. :smile:
 
Yes, the purpose of the eDRAM is to completely avoid polluting that main bandwidth, which must be shared between the CPU and GPU, with framebuffer ops.
 
Yes, the EDRAM makes the design of the interface to the main 512MB viable in overall system performance terms.

Being conservative you could say that the EDRAM provides, in effect, about 64GB/s of bandwidth dedicated to framebuffer operations.

Since framebuffer operations are the single most demanding use of memory bandwidth in games with high-end graphics, it makes sense that the rest of XB360 will be happy with the 22.4GB/s of memory available.

You'll find months and months'-worth of mostly tedious discussion on this topic all over the Console forums, if you search :LOL:

Jawed
 
Jawed said:
Being conservative you could say that the EDRAM provides, in effect, about 64GB/s of bandwidth dedicated to framebuffer operations.

Why only 64GB/s? I thought the bandwith for ops in the daughter die was 256GB/s
 
It's a guess.

In conventional PC GPUs lots of data compression is used to improve the efficacy of memory accesses - 32GB/s of compressed bandwidth might equate to the use of 128GB/s if compression weren't used (I dunno, just a guess.)

Xenos uses a mixed compressed (GPU<->EDRAM, 32GB/s) and uncompressed (ROPs<->RAM, 256GB/s) scheme, which makes it fiddlesome to characterise its effective bandwidth for framebuffer operations, in terms that are comparable with RSX (or PC GPUs).

In other words, because the EDRAM uses an uncompressed format, it's not fair to say that it's directly comparable to the bandwidth available to other GPUs, as they use compression.

ATI dropped compression because it lowers the transistor count (presumably quite heavily) and with the RAM being embedded the performance will be fantastic anyway.

Jawed
 
Compression

Jawed said:
Being conservative you could say that the EDRAM provides, in effect, about 64GB/s of bandwidth dedicated to framebuffer operations.

But 32GB/s bridge data is not compressed but also no MSAA multiple samples, no? So is your calculation of equivelant bandwidth including savings for regular GPU from compression (makes equivelant bandwidth lower) or only increase from AA samples (makes equivelant bandwidth higher)? Thank you.
 
Back
Top