ISSCC 2005

version · Feb 11, 2005

512 MB , my speculation:

nAo · Feb 11, 2005

Fafalada said:
Well there is another level to the speculation - how often can the lookup be issued in the first place?(we only "know" the throughput for MADDs so far ) If I can issue one every cycle that'd be a major boon not to mention It'd mean smallest transform would be limited solely by MADDs in that case.

If Mfa is right and the reciprocal estimate 'just' performs a look-up in a table I can't see why it shouldn't be possible (hw wise) to issue a reciprocal even every clock cycle.
I hope reciprocal does a bit more than that.., it would be very nice indeed if 1/x could be estimated without any intervention of the even pipe.

ciao,
Marco

MfA · Feb 11, 2005

Well the pipeline depth is 3, so you could only do estimates 3/4's of the time at best.

You need one or more single precision multiply-adds to do the Newton-Rhapson iteration, it does not make much sense to not use the even pipeline for that ... you'd risk having floating point circuitry in the odd pipeline with very low utilization.

Fafalada · Feb 11, 2005

Mfa said:
Well the pipeline depth is 3, so you could only do estimates 3/4's of the time at best.

Which is still quite dandy.

You need one or more single precision multiply-adds to do the Newton-Rhapson iteration, it does not make much sense to not use the even pipeline for that ... you'd risk having floating point circuitry in the odd pipeline with very low utilization.

Well VU1 had exactly that - they included additional FMAC and FIDV in the odd pipeline, and particularly the FMAC utilization was quite low as it wasn't even part of any really commonly used instructions.
Anyway I'm with nAo on this - would definately prefer to get the full estimate on the odd pipeline too, but won't hold my breath for it.

I'm already kinda ticked off that we'll be wasting even pipeline cycles to do things such as counter increments :?

MfA · Feb 11, 2005

It's different though, that lowers utilization of the even pipeline ... a newton-rhapson iteration is pretty close to an ideal use of it.

I dont quite see why seperate scalar/vector pipelines/register-sets would not have been better either (using LIW). Having to do flow control with the vector paths and counter arithmetic with the floating point pipeline doesnt seem to make a whole lotta sense ... but I guess they tried all such configurations and this turned out the best use of resources for average workloads, shrug.

version · Feb 11, 2005

1/x=exp(lg 1/x) = exp (lg 1 - lg x) = exp(-lg x)

you search fast lg and exp functions

lookuptable?

MfA · Feb 11, 2005

It doesnt work like that, lookup tables only give you estimates for elementary functions ... you use iterative algorithms to get precise results, using exp and log is harder than just approximating 1/x directly.

version · Feb 11, 2005

MfA said:
It doesnt work like that, lookup tables only give you estimates for elementary functions ... you use iterative algorithms to get precise results, using exp and log is harder than just approximating 1/x directly.

i mean a big lookuptable in LS

SPE1 : work matrix-vertex multiply
SPE2 : lg lookuptable
SPE3 : exp lookuptable

it is fast, if 256kb enough space for lg and exp table

JF_Aidan_Pryde · Feb 11, 2005

Wouldn't it make more sense if the GPU and CPU memory were shared?

version · Feb 11, 2005

JF_Aidan_Pryde said:
Wouldn't it make more sense if the GPU and CPU memory were shared?

in my speculation memory is shared, for CPU the vertexdata, for gpu texture data, but cpu can read from gpu's mem by flexio

Megadrive1988 · Feb 11, 2005

I like this statement

Members of the CELL processor family share basic building blocks, and depending on the requirement of the application, specific versions of the CELL processor can be quickly configured and manufactured to meet that need. The basic building blocks shared by members of the CELL family of processor are the following:

http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318

and something else of interest, CELL's double precision floating point capability:

As described previously, the prototype CELL processorâ€™s claim to fame is its ability to sustain a high throughput rate of floating point operations. The peak rating of 256 GFlops for the prototype CELL processor is unmatched by any other device announced to date. However, the SPEâ€™s are designed for speed rather than accuracy, and the 8 floating point operations per cycle are single precision (SP) operations. Moreover, these SP operations are not fully IEEE754 compliant in terms of rounding modes. In particular, the SP FPU in the SPE rounds to zero. In this manner, the CELL processor reveals its roots in Sony's Emotion Engine. Similar to the Emotion Engine, the SPEâ€™s single precision FPU also eschewed rounding mode trivialities for speed. Unlike the Emotion Engine, the SPE contains a double precision (DP) unit. According to IBM, the SPEâ€™s double precision unit is fully IEEE854 compliant. This improvement represents a significant capability, as it allows the SPE to handle applications that require DP arithmetic, which was not possible for the Emotion Engine.

Naturally, nothing comes for free and the cost of computation using the DP FPU is performance. Since multiple iterations of the same FPU resources are needed for each DP computation, peak throughput of DP FP computation is substantially lower than the peak throughput of SP FP computation. The estimate given by IBM at ISSCC 2005 was that the DP FP computation in the SPE has an approximate 10:1 disadvantage in terms of throughput compared to SP FP computation. Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops when the DP FP capability of the PPE is also taken into consideration. In comparison, Earth Simulator, the machine that previously held the honor as the worldâ€™s fastest supercomputer, uses a variant of NECâ€™s SX-5 CPU (0.15um, 500 MHz) and achieves a rating of 8 GFlops per CPU. Clearly, the CELL processor contains enough compute power to present itself as a serious competitor not only in the multimedia-entertainment industry, but also in the scientific community that covets DP FP performance. That is, if the non-trivial challenges of the programming model and memory capacity of the CELL processor can be overcome, the CELL processor may be a serious competitor in applications that its predecessor, the Emotion Engine, could not cover.

http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=4

nAo · Feb 11, 2005

as Faf I'd prefer to have an FMADD unit on the odd pipeline to be also used in reciprocal estimate. I really would not care if it's not fully pipelined (could that save some die area?) and it has a low troughput.
Well, even if it's not done in this way and a single FMADD can be used to make a first iteration, so maybe the shortest tranform takes 6 cycles -> 5.3 GPoly/s

Other case: am I wrong or isn't there a dot product instruction?

ciao,
Marco

Fafalada · Feb 11, 2005

nAo said:
Other case: am I wrong or isn't there a dot product instruction?

Well that table is horribly vague - referring to the entire arithmetic set as Multiply Acumulate.
If I take that literally, there is no add/sub, or madd/msub either

Plus no reference to broadcasts or swizzling - or such things as explicit outer product etc. (if there's no real swizzling).

I mean seriously if THAT table is ALL there is, this is a seriously handicapped ISA. Now, I'm not saying it has to be as uber-featured as PSP VFPU which comes with just about everything but the kitchen sink (seriously, that thing makes every FPU SIMD I've seen to date look primitive), but damn, at least the basics need to be covered if this will serve as math processor.
So for now I'll assume there's some things there that aren't mentioned, such as vector member access, some kind of swizzle/broadcasts, and hopefully a dotproduct too

rendezvous · Feb 11, 2005

Danack said:
Question from the Ascii24 article,

"The instruction length of SPE at 32bit fixed length, becomes the instruction format, 3 source registers and 1 target register."

There are 128 registers ergo 28bits are needed to specify four registers, leaving 4bits for instructions. This doesn't seem like much - presumably a lot of instructions don't use four registers and so can use the extra bits but it still seems....odd. Anyone have any clues onto how the instruction set will work?

Oh and if the GPU isn't connected to the EIB on the PS3 I will be surprised.

Most instructions would only need need 2 source registers, the only instruction that needs three source registers that i can think of right now is MAC where i assume that the destination registers is also one of the source registers.
I.E.
mac R1, R2, R3 => R3 = R3 + R1 * R2 (totally fictional ISA)

DeanoC · Feb 11, 2005

nAo said:
Other case: am I wrong or isn't there a dot product instruction?

IBM considered a high clock single cycle dot-product "impossible", so I'm not surprised that unless Sony pushed it, its not there.

A dot-product consists of a vector FMUL, a permute and a vector FADD. That permute in the middle upsets hardware people.

Not to say it can't be done in a vector unit if you push (i.e. pay) enough...

Fafalada · Feb 11, 2005

DeanoC said:
IBM considered a high clock single cycle dot-product "impossible", so I'm not surprised that unless Sony pushed it, its not there.
A dot-product consists of a vector FMUL, a permute and a vector FADD. That permute in the middle upsets hardware people.
Not to say it can't be done in a vector unit if you push (i.e. pay) enough...

Well - having to spend the same amount of instructions&time on (rotational)matrix*vector transform as a 2-vector dotproduct is what greatly upsets software ppl.
Or for that matter, calculating the vector square length being as expensive as a full matrix*vector transform is equally annoying.

Actually we don't need single cycle dotproduct - any sequence of consecutive dotproducts can be written out as series of MADDs.
What we NEED is a single instruction Dotproduct, that doesn't stall subsequent instructions other then another dotproduct (no I didn't try to figure out if that's possible right now, just making a point here

).

Now - in VU1 there is this thing called elementary function unit that at least takes some of the frustration off - because even though it's stupidly high latency, I can use length and element sum instructions that are executed entirely on the odd pipeline in any decently sized loop.
In SPU it doesn't look like we're gonna be given any such grace so missing a dotproduct makes things many times worse.

nAo · Feb 11, 2005

Fafalada said:
Well - having to spend the same amount of instructions&time on (rotational)matrix*vector transform as a 2-vector dotproduct is what greatly upsets software ppl.
Or for that matter, calculating the vector square length being as expensive as a full matrix*vector transform is equally annoying.

Yeah, that's why I asked.
The only way to get aound those kind of problems is to process more stuff at the same time and to accordingly store your data (ie. my vertex normals are stored as a block of 4 transposed normals in a 3x4 matrix, doing that way a DOT3 takes 0.75 cycles, but it's annoying at least to code in this manner, it makes everything overcomplex)

ciao,
Marco

Fafalada · Feb 11, 2005

nAo said:
(ie. my vertex normals are stored as a block of 4 transposed normals in a 3x4 matrix

Ewww :? just thinking about it makes me go brrr... I prefer to always work with 3-4 lights and rotate them into a matrix myself.
Then again I also have 2 sets of UVs packed into each vector in memory so all my vertex loops have to process a multiple of 2 vertices per iteration and silly things like that.

This is why I love PSPs vector unit - for instance with your normals examples, you can store them in a normal fashion, but if you load all four of them to registers you can directly access them in any transposed fashion that suits you most.
Of course, PSP also has explicit dotproduct which includes ability to directly dot between vertical and horizontal elements of a matrix... oh well...

London Geezer · Feb 11, 2005

You do love your PSP, Faf, don't u...

nAo · Feb 11, 2005

About the GPU:
If Nvidia is not going to adapt their next generation design to embrace eDram, what are they going to do?
It's a given they are modifying the GPU to interface it nicely with the CELL CPU.
At this time we know the CELL CPU can have a max of 256 MBytes / 25.6 GB/s. The amount of ram can be changed..but what about the external bandwith?
25.6 GigaBytes/s can be doubled with more modules and more channels, or doubling the channel frequency. One thing is for sure, 25.6 GB/s is too little to sustain a CELL CPU and modern GPU (without edram).
Is it feasible to have other 256 MB of XDRAM for the GPU (that amount of ram is needed to have a decent bandwith for the GPU), it seems quite odd to me. Are we going to see a PS3 GPU coupled with GDDR3/4 ram?
What I mean is that maybe the eDram thing wasn't that bad

ciao,
Marco

ISSCC 2005

version

nAo

Nutella Nutellae

MfA

Fafalada

MfA

version

MfA

version

JF_Aidan_Pryde

version

Megadrive1988

nAo

Nutella Nutellae

Fafalada

rendezvous

DeanoC

Trust me, I'm a renderer person!

Fafalada

nAo

Nutella Nutellae

Fafalada

London Geezer

nAo

Nutella Nutellae

Similar threads