Xenon VMX units - what have we learned?

ADEX · Aug 19, 2007

nAo said:
I'm talking abou this:

fmad r3, r2, r1, r0.xxxx

hence the ability to splat a scalar without having to explictely use an additional instruction, it would save registers space (no need to keep temporary splatted copies of subcomponents of a vector) and it would save additional splat instructions -> big win (but it would cost chip area and ISA 'area' )

Marco

I assume you mean multiplying add (etc.) a vector (or 3) by a sub-component of another vector(sorry I haven't used assembly).

That would be a complex instruction from a hardware point of view, firstly the instruction may not fit in the standard instruction format screwing everything up.
A simpler version would be to multiply one or 2 vectors by the sub part of another, that would fit but it'll still complicate things. It'll involve having a splat hardware unit in front of the multiplier, that'll potentially increase latencies for everything behind it or alternately require the clock to be lowered - so you get one instruction faster but it slows everything else down.

The whole RISC philosophy is to make the the most common instructions in the smallest, simplest hardware as possible. It's cheaper to make and easier to clock higher. In the case of the XCPU and Cell, they've used the small core size to increase the number of cores. The downsize of course is the code size increases...

Fafalada · Aug 19, 2007

ADEX said:
That would be a complex instruction from a hardware point of view. firstly the instruction may not fit in the standard instruction format screwing everything up

Standard MADD is already part of ISAs we're discussing. And broadcast MADD is 4 extra instructions, I'm sure they have that much space left in instruction descriptor.

so you get one instruction faster

It's not just instruction count - a trivially dumb example (the second simplest arithmetic operation with SIMD IMO) - a matrix multiply, has up to 4x worse latency with splats then equivalent broadcast syntax (and 2x instruction count, and ~3x more register usage).
Take any algorithm that intersects arithmetic with conditionals (say, collision tree traversal) and that latency penalty will almost directly translate to performance because there's nowhere to schedule splats around.

Other 'smart' stuff includes things like direct component access, swizzles, multi-direction register access, component load/stores... but broadcast is pretty much the worst omission since it accounts for majority of permutes in arithmetic code.

assen · Aug 19, 2007

BadTB25 said:
So the VMX units are "add-on" units as flec04 stated?

Correction: VMX is something that a lot of PowerPCs have, including - if I'm not mistaken - the Cell PPE. What Xenon has additionally is "VMX128". What it changes is that a) the CPU has more registers of a the vector float type and b) one of the execution blocks inside the CPU, the one responsible for vector math, is modified.

BadTB25 said:
Are you saying that to lessen or negate the LHS penalty it would mean a sacrifice of increased transistors? Could elaborate on this please?

More transistors, of course.

BadTB25 said:
Is there any other negatives to having a lot of registers other than loss of space on chip (trade-offs)?

No idea. I don't understand why they didn't add more int and float registers, as well. Maybe there's something about diminishing returns?

BadTB25 · Aug 19, 2007

Thanks, I did mean VMX128.

How is the vector execution block modified?

assen · Aug 19, 2007

I think this was discussed earlier in the thread - read carefully.

BadTB25 · Aug 19, 2007

Oh sorry

You even gave me a hint on it and I still missed it.
"a) the CPU has more registers of a the vector float type and b) one of the execution blocks inside the CPU, the one responsible for vector math, is modified."

I take it you mean the Vector Floating Point Unit.
I do not know enough about the differences with that and the Vector Complex Integer Unit. Thanks, I wll read more up on it on the links provided by Alstrong.

nAo · Aug 19, 2007

ADEX said:
I assume you mean multiplying add (etc.) a vector (or 3) by a sub-component of another vector(sorry I haven't used assembly)

exactly

That would be a complex instruction from a hardware point of view

I don't think so, it works exactly how currents fmadds are already implemented in VMX units, the only difference is that they would need to sacrifice die area to insert a crossbar that can broadcast one component to the others after a register is being read from the register file.

firstly the instruction may not fit in the standard instruction format screwing everything up.

As I previously wrote I'm aware this would consume 'ISA area', but I'm not asking for a general broadcasting scheme, being able to broadcast a single component to the remaining 3 would consume just a couple of bits.

It'll involve having a splat hardware unit in front of the multiplier, that'll potentially increase latencies for everything behind it or alternately require the clock to be lowered - so you get one instruction faster but it slows everything else down.

Well.. IBM implemented dot products on 360 that you really don't want to implement if latency is an issue, I'm sure that broadcasting contribute to latency wouldn't be that bad..I mean..PS2 EE was clocking very aggressively and it was supporting that! (and here I declare my love for PS2 VUs..

)

The whole RISC philosophy is to make the the most common instructions in the smallest, simplest hardware as possible. It's cheaper to make and easier to clock higher. In the case of the XCPU and Cell, they've used the small core size to increase the number of cores. The downsize of course is the code size increases...

It's all about trade offs in the end, as usual.
I think IBM (and Sony and Toshiba) did quite an amazing job with SPUs for example, but I don't agree with all their choices.

Mintmaster · Aug 20, 2007

assen said:
Re: floating point values as loop counters - it's not the best thing to do, as you have a pipeline flush penalty when branching on the result of a floating-point comparison operation - gee, who would want to branch after a compare?!

I meant a parallel counter. I'm just trying to imagine a situation where you really need integer-to-float conversion in code segments that have been found to be performance critical with a profiler.

nAo · Aug 20, 2007

Mintmaster said:
I'm just trying to imagine a situation where you really need integer-to-float conversion in code segments that have been found to be performance critical with a profiler.

You can use integer to floating point conversion on floating point numbers to perform a very cheap exponential, or convert a floating point number..to a floating point number to perform a cheap logarithm.

Panajev2001a · Aug 21, 2007

nAo said:
You can use integer to floating point conversion on floating point numbers to perform a very cheap exponential, or convert a floating point number..to a floating point number to perform a cheap logarithm.

The nAo's MAGIC trick resurfaces once again

.

Terarrim · Aug 21, 2007

When I first started reading about the Cell I am sure I read something about that some of the registers in the SPE where holdovers from the PS2 VMX units is this true or did I misread?

idsn6 · Aug 21, 2007

Terarrim said:
When I first started reading about the Cell I am sure I read something about that some of the registers in the SPE where holdovers from the PS2 VMX units is this true or did I misread?

PS2 didn't have VMX.
Regardless, no.

Gubbi · Aug 22, 2007

nAo said:
I don't think so, it works exactly how currents fmadds are already implemented in VMX units, the only difference is that they would need to sacrifice die area to insert a crossbar that can broadcast one component to the others after a register is being read from the register file.
As I previously wrote I'm aware this would consume 'ISA area', but I'm not asking for a general broadcasting scheme, being able to broadcast a single component to the remaining 3 would consume just a couple of bits.

There are two issues with this.
1. The latency induced by the MUX-DEMUX required for the crossbar functionality is non trivial (ie. look at latency for permute)
2. To maintain the orthogonality of VMX you'd ideally want to support the same crossbar functionality for 8bit entities, - that would require 16 4-bit fields instead of the 4 2-bit ones for SP FP.

Seperating the shuffling (permute) from the arithmetic is perfectly valid IMO. The real killer is the combination of long latency operations: dot-product (vertical add), permute and to some extend even the arithmetic ops, with the in-order nature of the CPU (ie. very vulnerable to RAW hazards). The solution is, as you already mentioned, to go SOA, - with all the pains that brings.

Cheers

ERP · Aug 22, 2007

Ignore me replying to something on a previous page.

Fafalada · Aug 23, 2007

Gubbi said:
1. The latency induced by the MUX-DEMUX required for the crossbar functionality is non trivial (ie. look at latency for permute)

But this isn't a crossbar - all we are asking for is repeat of one of the 4 components of only the second source operand. Eg. VUs do not support arbitrary permute at All, because as you note, it can be non-trivial latency wise, and they were going for 4clocks across the board - but they did manage rotates, broadcasts, and field masking.
PSP adds the mising bits@same clock&latency, but that was 5 years of R&D later (I particularly like their solution to keep instruction count down even with arbitrary field accesses and swizzles).

2. To maintain the orthogonality of VMX you'd ideally want to support the same crossbar functionality for 8bit entities, - that would require 16 4-bit fields instead of the 4 2-bit ones for SP FP.

BC are 4 extra instructions(12 if you include them for separate add and mul) either way - there's no need to make them work across all data types (especially on SPE). You aren't seeing the DotProduct in 360 working with 16/8/4 bit entities either.

Gubbi · Aug 24, 2007

Fafalada said:
But this isn't a crossbar - all we are asking for is repeat of one of the 4 components of only the second source operand. Eg. VUs do not support arbitrary permute at All, because as you note, it can be non-trivial latency wise, and they were going for 4clocks across the board - but they did manage rotates, broadcasts, and field masking.
PSP adds the mising bits@same clock&latency, but that was 5 years of R&D later (I particularly like their solution to keep instruction count down even with arbitrary field accesses and swizzles).

But what you're asking for has more or less the same complexity in hardware: Each second souce operand component input to the ALU can get their value from any one of four compenent values in the source register.

The reason why permutes (and broadcasts) are expensive today is because wire-delay is a more and more dominant factor in chip design compared to transistor switch speeds. Look back at Altivec in the G4: Permutes were single cycle, FMADDs were 4 cycles.

Exposing/seperating the inherent cost of permute/broadcasts is a Good Thing IMO, even if it's a PITA for developers.

A little OOO execution would have helped alot with scheduling around the long data-dependency latencies of the SIMD instructions.

Cheers

ADEX · Aug 24, 2007

The reason why permutes (and broadcasts) are expensive today is because wire-delay is a more and more dominant factor in chip design compared to transistor switch speeds. Look back at Altivec in the G4: Permutes were single cycle, FMADDs were 4 cycles.

Wire delay act over long wires, a permute unit will only send data a fraction of a millimetre. The G4 was quite different than cell though in that it has a very shallow pipeline, i.e. it has a small number of pipeline stages but does more per stage. This has an advantage for frequency but also means lower throughput.

Cell is other way around, it has a longer pipeline but does less per stage. Longer latencies but the higher clock possible as a result means much higher throughput.

A little OOO execution would have helped alot with scheduling around the long data-dependency latencies of the SIMD instructions.

Probably true but any advantage gained would likely be lost by a frequency drop. In this case it's probably best done in the compiler.

That said there are OOO "lite" approaches which might work without significant effects - e.g. The later G4s can issue AltiVec OOO but it's very limited.

Gubbi · Aug 27, 2007

ADEX said:
Wire delay act over long wires, a permute unit will only send data a fraction of a millimetre. The G4 was quite different than cell though in that it has a very shallow pipeline, i.e. it has a small number of pipeline stages but does more per stage. This has an advantage for frequency but also means lower throughput.

Wire delay is worst at the finest geometry which is at the lower metal levels which is local connect only.

G4 was less aggresively pipelined, but its VMX units *are* fully pipelined, and here you have a great increase in latency (measured in cycles) for permutes and only a modest increase in latency for FMADDs.

Cheers

nbohr1more · Jan 1, 2009

http://online.wsj.com/article/SB123069467545545011.html ... some slashdot users are claiming that the Xenon VMX unit has most of the innovations/customizations of the Cell SPE unit...

Carl B · Jan 1, 2009

nbohr1more said:
some slashdot users are claiming that the Xenon VMX unit has most of the innovations/customizations of the Cell SPE unit...

You could view the XeCPU VMX unit certainly as a means of addressing the computational throughput of the SPEs, but they're really not similar to one another at all.

Xenon VMX units - what have we learned?

ADEX

Fafalada

assen

BadTB25

assen

BadTB25

nAo

Nutella Nutellae

Mintmaster

nAo

Nutella Nutellae

Panajev2001a

Terarrim

idsn6

Gubbi

ERP

Fafalada

Gubbi

ADEX

Gubbi

nbohr1more

Carl B

Friends call me xbd

Similar threads