Cell's PPE and Xenon compared.

Jawed said:
Also, see this thread:

http://www.beyond3d.com/forum/showthread.php?t=23361

I can't believe the continued, idiotic, resistance to the idea that Cell PPE and Xenon are 1-VMX designs when we have all this evidence.

There's no doubt that Cell PPE has "something up its sleeve", but it isn't two VMXs.

Jawed

Well, realworldtech compared the DD1 and DD2 in their article, by comparing the DiE and transistor logic circuits and their conclusion is that a large part of FPU/Vector logic has been doubled so this is most likely a second VMX unit which would also confirm to what CryTek have said about the performance "better than SMT".
 
Oooops, by bad. I meant sub units/execution units within the VMX unit. One for arithmetic and one Load/Store - or something equivalent. I still think that the XCPU has had to lose something in order to gain something and that the PPE VMX unit is basically the standard unit as provided by IBM.
 
3N1gM4 said:
Both the PPE and XCPU will probably have 2 VMX units. Having just one unit would be too much of a backward step in terms of performance.
Step back as compared to what? The norm is for a CPU to either have NO vector unit, or one. xCPU will have THREE in total.

Just having a lot of computational resources isn't any good if you can't keep them fed either, and there's a lot to suggest that this will be a bottleneck in MSs chip, even without any mythical 2nd VMX unit for each CPU core.

Besides, MS has already published flops figures for the thing, and they correlate well with one vector processing element per core.
 
Guden Oden said:
Besides, MS has already published flops figures for the thing, and they correlate well with one vector processing element per core.
But seemingly ignore the DP instruction for some reason.

Jawed
 
Jawed said:
But seemingly ignore the DP instruction for some reason.
Um, MS did state the number of DPs the chip can calculate/sec afair. What makes you say they ignored it, and assuming they did, what effect would that have, in your opinion?
 
My point of view is that the DPs must be multi-vector DPs (rather than performing the DP for one vector at a time).

The reason I suggest this is solely based on the length of the Xenon DP pipeline. It's fourteen stages (excluding register fetches). The Cell SPE floating point pipeline is 6 stages.

Obviously this is just a guess. The cloaks of NDA are pretty annoying about these basic things.

There's little doubt that what appears to be the four FP pipelines (vec4) in Cell PPE (bottom right-hand corner in the first post of this thread) take up about double the area of the four FP pipelines in Xenon. :devilish:

Jawed
 
Does the DP function run concurrently with other float ops? I was of the opinion it was just an instruction, not an extra processing unit, and so consumed FP capacity of the VMX units with only a little increase in efficiency, but then I'm not at all well read on the XeCPU!
 
Fafalada said:
4 component DP is 7Flops/cycle, MADD is 8. Why would anyone count with DPs when that would reduce the FP rating?
I'm guessing that it's a four-vector DP, not a four-component DP.

I'm trying to rationalise the 29 stages of the FP pipeline in Xenon, that's all. The register fetch prolly takes 3 stages (instead of 2) because of the increased register file size. We also know that the FP pipeline can de-compress (although, frankly, I don't really get this, surely by the time data is in a register it is already decompressed) and re-orient vectors (SoA versus AoS) on the fly. So that will account for some of the 29 stages.

The Instruction Queue and Dependency Check stages appear to be common to both PPE and Xenon. In Xenon its 8 stages (out of the 29), but we dunno how much it is in PPE. SPE's don't seem to have such a unit - so it makes comparing the FP pipeline of SPE and PPE somewhat difficult :devilish:

Jawed
 
Last edited by a moderator:
Shifty Geezer said:
Does the DP function run concurrently with other float ops?
In all these architectures (Cell PPE, Cell SPE, Xenon) the FP pipeline appears to be able to dual-issue a math op with a load/store/permute.

I was of the opinion it was just an instruction, not an extra processing unit, and so consumed FP capacity of the VMX units with only a little increase in efficiency, but then I'm not at all well read on the XeCPU!
DP is just a pipeline. When doing a math op, the Xenon VMX runs down one of the available math pipelines: DP, Vector, Vector Simple (dunno what that means! - add?), scalar.

http://www.beyond3d.com/forum/showthread.php?t=23361

Jawed
 
Jawed said:
I'm guessing that it's a four-vector DP, not a four-component DP.
Matrix SIMDs are a rather inefficient use of die-space, if you're not targetting some extremely specialized application field. And XeCPU field isn't nearly that specialized.

SPE's don't seem to have such a unit - so it makes comparing the FP pipeline of SPE and PPE somewhat difficult
True, but we do have other CPUs out there with DOT instruction and their DP pipelines are longer then MADD too. It's a horizontal operation and all that - we had quite a bit of discussion on issues with DP before(when first SPE info was unveiled), especially in regards to issues with high-clocked processors having such operations.
 
jawed,

dp is strictly horizontal. if you had a vertical dp (i.e. multi-vector dp) that would have been a madd/macc, not a dp.

Fafalada said:
Matrix SIMDs are a rather inefficient use of die-space, if you're not targetting some extremely specialized application field. And XeCPU field isn't nearly that specialized.

i actually find the sh4 matrix-vector multiplication op quite clever and universal. it proved to be of use to ; )
 
darkblu said:
jawed,

dp is strictly horizontal. if you had a vertical dp (i.e. multi-vector dp) that would have been a madd/macc, not a dp.



i actually find the sh4 matrix-vector multiplication op quite clever and universal. it proved to be of use to ; )

And it was still sometimes faster to do the individual dotproducts because it gave you more latitude in scheduling.
 
darkblu said:
i actually find the sh4 matrix-vector multiplication op quite clever and universal. it proved to be of use to ; )
You know damn well I was referring to execution resources not damn repeat instructions :p
There are recent CPUs with ISAs offering full matrix support(not just an odd instruction here and there) but they still stick with one vector worth of execution resources.
 
Fafalada said:
You know damn well I was referring to execution resources not damn repeat instructions :p
There are recent CPUs with ISAs offering full matrix support(not just an odd instruction here and there) but they still stick with one vector worth of execution resources.
you know, one of these days we'll corner you and won't let you go until you spill everything you know about that bloody vfpu ; )
 
darkblu said:
you know, one of these days we'll corner you and won't let you go until you spill everything you know about that bloody vfpu ; )

I got the pitchfork, you got the torches... we need a hound-dog... well that can be nAo (Italian pro programmers' especially from that region of Italy have a great nose)... we know pretty much where to look for... LET THE HUNT BEGIN!!!!
 
Back
Top