If Mfa is right and the reciprocal estimate 'just' performs a look-up in a table I can't see why it shouldn't be possible (hw wise) to issue a reciprocal even every clock cycle.Fafalada said:Well there is another level to the speculation - how often can the lookup be issued in the first place?(we only "know" the throughput for MADDs so far ) If I can issue one every cycle that'd be a major boon not to mention It'd mean smallest transform would be limited solely by MADDs in that case.
Which is still quite dandy.Mfa said:Well the pipeline depth is 3, so you could only do estimates 3/4's of the time at best.
Well VU1 had exactly that - they included additional FMAC and FIDV in the odd pipeline, and particularly the FMAC utilization was quite low as it wasn't even part of any really commonly used instructions.You need one or more single precision multiply-adds to do the Newton-Rhapson iteration, it does not make much sense to not use the even pipeline for that ... you'd risk having floating point circuitry in the odd pipeline with very low utilization.
MfA said:It doesnt work like that, lookup tables only give you estimates for elementary functions ... you use iterative algorithms to get precise results, using exp and log is harder than just approximating 1/x directly.
JF_Aidan_Pryde said:Wouldn't it make more sense if the GPU and CPU memory were shared?
Members of the CELL processor family share basic building blocks, and depending on the requirement of the application, specific versions of the CELL processor can be quickly configured and manufactured to meet that need. The basic building blocks shared by members of the CELL family of processor are the following:
As described previously, the prototype CELL processor’s claim to fame is its ability to sustain a high throughput rate of floating point operations. The peak rating of 256 GFlops for the prototype CELL processor is unmatched by any other device announced to date. However, the SPE’s are designed for speed rather than accuracy, and the 8 floating point operations per cycle are single precision (SP) operations. Moreover, these SP operations are not fully IEEE754 compliant in terms of rounding modes. In particular, the SP FPU in the SPE rounds to zero. In this manner, the CELL processor reveals its roots in Sony's Emotion Engine. Similar to the Emotion Engine, the SPE’s single precision FPU also eschewed rounding mode trivialities for speed. Unlike the Emotion Engine, the SPE contains a double precision (DP) unit. According to IBM, the SPE’s double precision unit is fully IEEE854 compliant. This improvement represents a significant capability, as it allows the SPE to handle applications that require DP arithmetic, which was not possible for the Emotion Engine.
Naturally, nothing comes for free and the cost of computation using the DP FPU is performance. Since multiple iterations of the same FPU resources are needed for each DP computation, peak throughput of DP FP computation is substantially lower than the peak throughput of SP FP computation. The estimate given by IBM at ISSCC 2005 was that the DP FP computation in the SPE has an approximate 10:1 disadvantage in terms of throughput compared to SP FP computation. Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops when the DP FP capability of the PPE is also taken into consideration. In comparison, Earth Simulator, the machine that previously held the honor as the world’s fastest supercomputer, uses a variant of NEC’s SX-5 CPU (0.15um, 500 MHz) and achieves a rating of 8 GFlops per CPU. Clearly, the CELL processor contains enough compute power to present itself as a serious competitor not only in the multimedia-entertainment industry, but also in the scientific community that covets DP FP performance. That is, if the non-trivial challenges of the programming model and memory capacity of the CELL processor can be overcome, the CELL processor may be a serious competitor in applications that its predecessor, the Emotion Engine, could not cover.
Well that table is horribly vague - referring to the entire arithmetic set as Multiply Acumulate.nAo said:Other case: am I wrong or isn't there a dot product instruction?
Danack said:Question from the Ascii24 article,
"The instruction length of SPE at 32bit fixed length, becomes the instruction format, 3 source registers and 1 target register."
There are 128 registers ergo 28bits are needed to specify four registers, leaving 4bits for instructions. This doesn't seem like much - presumably a lot of instructions don't use four registers and so can use the extra bits but it still seems....odd. Anyone have any clues onto how the instruction set will work?
Oh and if the GPU isn't connected to the EIB on the PS3 I will be surprised.
nAo said:Other case: am I wrong or isn't there a dot product instruction?
Well - having to spend the same amount of instructions&time on (rotational)matrix*vector transform as a 2-vector dotproduct is what greatly upsets software ppl.DeanoC said:IBM considered a high clock single cycle dot-product "impossible", so I'm not surprised that unless Sony pushed it, its not there.
A dot-product consists of a vector FMUL, a permute and a vector FADD. That permute in the middle upsets hardware people.
Not to say it can't be done in a vector unit if you push (i.e. pay) enough...
Yeah, that's why I asked.Fafalada said:Well - having to spend the same amount of instructions&time on (rotational)matrix*vector transform as a 2-vector dotproduct is what greatly upsets software ppl.
Or for that matter, calculating the vector square length being as expensive as a full matrix*vector transform is equally annoying.
Ewww :? just thinking about it makes me go brrr... I prefer to always work with 3-4 lights and rotate them into a matrix myself.nAo said:(ie. my vertex normals are stored as a block of 4 transposed normals in a 3x4 matrix