Let's move over here, talking NI now.
Oh. I still was thinking CPU, I really have to change my thinking a bit now. I understand now that basically x,y,z,w,t are a little bit like an ALU-pool that can be connected/configured by the VLIW-instruction in quite a flexible way. The DOT4 instruction (for example) is not simply a serial product of it's inner products, it's really an instruction to create a specific configuration of the network of ALU nodes to accomplish the DOT4 not only in 1 clock in-order but also more exact. Right so far?
Yep.
Okay, this brings up a truckload of questions and ideas.
The VLIWs I suppose are too complex to be decoded into completely distinct signal-sets, I suppose the bits in the VLIW almost directly map to pathway on/offs.
Wouldn't this be a clear incentive to explore the VLIW-instruction space? Trying to detect VLIW-configurations which are not documented but work (because they follow the mechanics of VLIW-instruction expressions)
It is interesting to think out that basically any permutation of all ALUs available in the pool could be expressed and executed as VLIW-instruction.
Including possible rerouting of t-unit outputs into the other ALUs inputs. Something like MULSIN.
If it's not possible yet, it's definitely a great way to generalize the current architecture, making it extreme powerfull.
I think I still thought out-of-order, means I thought you get the same throughput of a single DOT4 instruction with multiple equivalent MUL/ADD instructions (using distinct outputs though).
Okay, my brain fires up, better wait for confirmation (GPUs == in-order?). Later the hailing of ATI's x,y,z,w,t concept. No wonder Fermi has so monstrous dimensions ...
Processing of ALU instructions is strictly in-order in ATI.
Okay, that's another think is possible to modify just a little for big effect.
Though I still can't really connect what I know now with the assembler-output:
Code:
60 x: MUL_e T0.x, PV59.z, R8.x
y: MUL_e T1.y, PV59.z, R6.x VEC_021
z: MUL_e T0.z, PV59.z, R7.x VEC_102
w: ADD ____, PV59.w, T0.z
In theory the three MULs within this x,y,z,w,t-block are uncorrelated, which means there could be a throughput of 1/3, doing all three MULs in a single clock (there must be 4 multipliers to support the 1 clock DOT4). There could even be a 1/4 throughput (if the assembler realizes that T0.z is temporary and trashed directly afterwards), because the last ADD can be integrated into a MULADD, leading to a single clock for the entire operation.
So, what I don't really understand is, in which relation the identifiers in front of the line are with the identifier on the register.
The destinations-registers appear all to be identical named as the identifier in front, with the t-unit it's different:
Code:
120 x: MUL_e R5.x, R1.x, PV119.z
t: MUL_e R27.x, R0.x, PV119.z
So what I wonder is if all this identifier-thing is basically the assembler-expression of the wiring to apply between the ALUs. With "____" being a buffer-less (the value does not go to the register-file and does not receive $100
) wiring.
I suspect to make the shader-internal function OOO is not really as simple (in terms of additional transistors) as speak it out, but it's a very local change with a possibly huge effect.
Once OOO is there the calculations are basically wire-limited, you could technically do a DOT4 explicit as MUL/ALL instructions if you'd have enough wires x,y,z,u,v,w,a,b.
Well, this is just crazy outburst without having a deep understanding how a particular shader-unit x,y,z,w,t exactly looks and behaves like (I mean a real logics plan and a FSM-description).
There's some debate about whether NVidia's GPUs actually re-order ALU instructions - I think they do.
Jawed
It's not determinable? Let's say you see a value pulled out of the cache before the suppose to follow-up write to global memory (inversion)?