There you go, answering your own questionYou might be able to have them take up the same amount of space by just working on more pixels per block.
Well, yeah, it requires wider blocks. But since the basic unit is a quad, 16 scalar ALU's grouped into one block would be a sort of natural number.There you go, answering your own question
Now, of course, in theory, that'd quadruple the number of threads in flight in the ALUs at a given time, compared to a Vec4 implementation, so there'd be some overhead there. As I love to repeat, though, there is a very simple way to "fix" that though, but people would most likely kill me if I said "the word"...
I don't think it does because it takes longer to complete vector instructions.Now, of course, in theory, that'd quadruple the number of threads in flight in the ALUs at a given time, compared to a Vec4 implementation, so there'd be some overhead there. As I love to repeat, though, there is a very simple way to "fix" that though, but people would most likely kill me if I said "the word"...
Certainly; however, if in a Vec4 architecture, the instruction latency is 10 cycles and the data latency is 200 cycles, the total latency hiding necessary would be 210. Now, with a scalar architecture, it'd become 240, so that's a ~14% increase. While it might be quite small, it's still not negligible.The number of threads in flight is only there to hide data latency (instruction latency is over an order of magnitude less). You'd only quadruple the number of threads if you want to get data from the texture units between each scalar instruction, which is silly. Do 4 scalar instructions instead.
Just a thought I had...if you're going to be doing SIMD on blocks of pixels, why should scalar pipelines need to be any bigger than vector pipelines? You might be able to have them take up the same amount of space by just working on more pixels per block.
Now, the big question is if G80 implements this, and I honestly don't know. I wouldn't be surprised if it didn't, but it'd certainly make sense if it did; and a very basic implementation (KISS!) of this wouldn't be too complex.
Uttar
From what DemoCoder said in the other thread, the processors are 8-wide SIMD, with 2 groups of these in each cluster.
I'm not really sure. Maybe it helps flexibility in some way.I forget - what's the thinking behind such a configuration again? What's the advantage vs 16-wide SIMD per cluster?
Obviously, not sure if I presented it as something quite that dramatic. I did say: "that'd quadruple the number of threads in flight in the ALUs at a given time" but hehUttar, that 14% difference is certainly smaller than "quadrupling the number of threads in flight".
The compiler already had to do it for co-issue up to a certain extend, but I would assume that the hardware worked on what basically amounts to instructions "blocks" that included descriptions of what each unit had to do for a given cycle. A relatively good proof of that lies in R300's instruction limits for DX9; ATI claimed that it was higher than the specification, because they could handle both 96 Vec3 instructions AND 96 Scalar instructions. So you'd clearly assume they were stored together.. And anyway, if that wasn't the case, you'd basically have an OoOE engine in-hardware...I'm pretty sure that the compiler talk you speak of has been around since R300. If you want to take advantage of co-issue, you really need to be able to separate independent instruction streams from a shader.
archie, G80's "scalar" processors are SIMD.
They do the same scalar instruction on one channel of many pixels (32 pixels per batch) instead of the same vector instruction on all 4 channels of the pixels in a quad. From what DemoCoder said in the other thread, the processors are 8-wide SIMD, with 2 groups of these in each cluster.
I think it's a combination of both hardware level and driver level support for instruction packing. If G80's instruction scheduler (driver side) doesn't implement checkpointing or a small sliding instruction window (less likely), before asking the hardware to do work, I'd be pretty surprised at this point.Performance wise you may be able to make the two equivalent, but implementation wise, scalar seems alot simpler to implement.
Of course all GPUs are SIMD. Single instruction, multiple data, right?I think it is worth pointing out that while one could consider the G8x to be SIMD (what do you call Xenos then, super-SIMD? SIMD-extreme? Seems like any GPU with a batch size > 1 would be labeled SIMD then), each have different granularity issues. Threads have local storage context, and there are definate differences to allocating scalar registers to thread contexts vs allocating vectorized registers. You need extra hardware to implement swizzling and replication which you do not need on the G8x for example.
You referring to the rumours of how R600 works? I'll believe it when I see it, because it seems like a dumb way of increasing utilization given the obvious SOA-esque method of G80.Also, arranging independent instruction streams to optimally pack the ALUs and registers is much more complication than simply grouping quads into batches of 2 or 4, all sharing the same PC, but being dispatched to 8-16 SPs.
SIMD => Single Instruction Multiple Data.Of course all GPUs are SIMD. Simultaneous instruction, multiple data, right?
MUL R0, R1, R2
ADD R3, R0, R2
MUL R0x, R1x, R2x
MUL R0y, R1y, R2y
MUL R0z, R1z, R2z
MUL R0w, R1w, R2w
ADD R3x, R0x, R2x
ADD R3y, R0y, R2y
ADD R3z, R0z, R2z
ADD R3w, R0w, R2w
MUL r1x, r0x, r0x
MUL r2x, r1x, r1x
MUL r3x, r2x, r2x
MUL r1x, r0x, r0x
*checkpoint, please do not progress until this MUL is done*
MUL r2x, r1x, r1x
*checkpoint, please do not progress until this MUL is done*
MUL r3x, r2x, r2x
*checkpoint...*
I'm not sure which is more expensive, but I think it depends on how everything is implemented.
Checkpointing (if that's even a word) in this instance would also have to take into account multiple thread types and ALU pipelining, when helping the hardware schedule threads. Dropping in a checkpoint on dependent instructions for one thread type like that is just automatic, surely?