vector vs scalar shaders

Mat3

Newcomer
I'm interested in knowing more about the move away from vector units in GPUs. I have some questions comparing the 360 GPU, R600, and G80 series.

The 360 shaders are vec4 + 1 scalar. Do pixel shaders ever get full use out of that 4th vector data component?

Comparing the vec4 + 1 to the 5-way supercalar in the R600 architecture, how much denser is the 360 shader array? Also, how much denser is the former compared to five individual shaders as in G80? I assume density is the advantage of going superscalar vs scalar?

Just curious, would a vec2 + vec2 + 1 organisation work too?

Thanks
 
Comparing the vec4 + 1 to the 5-way supercalar in the R600 architecture, how much denser is the 360 shader array?

I would guess the big difference here is that with vec4 + 1 you could have just two register files, a wide vector and a scalar. Each register file would require 3 reads at most when all ALUs are set to MADD operation.

In the case of the superscalar design each of the 5 ALUs can address different scalar registers in the reg file. You could have 1 large scalar register file, but then if all ALUs were performing a MADD operation on separate registers, you would need to read up to 15 (3 operands * 5 ALUs)
registers for this computation.

Similar things could be said about the writeback of the computation back to the register file.

I am not sure how this translate into hardware density though, I am just trying to dig a bit deeper into the comparison.
 
Pixel shaders can get use out of the 4th component, depending on what they're doing. Shaders have moved way beyond simply multiplying/adding colors fetched from texture, and are now doing geometric calculations (for per-pixel lighting) and tons of other random math. Also, a clever compiler can combine two vec2 operations (on texcoords, for example) into a single vec4 operation in some cases.

The NV4x/G7x shaders could issue vec2+vec2, vec3+scalar, or vec4, so more flexible organizations certainly can work.

Density is a little misleading. One major advantage of going scalar is better utilization, so although vector and/or superscalar will be denser for a given *peak* throughput, scalar can deliver the same actual performance with less peak throughput. So which has better achievable perf/mm^2 isn't obvious.
 
G80 is superscalar, issuing one instruction to each of the MAD and multi-function ALUs per clock.

Jawed
 
AFAIK, each TCP cluster can proceed two instruction streams from the same batch, spanned on a group of eight MADDs (from total of 16). Four "clicks" per op would then make for 32 pixels the size of a batch.

So, by this mean, it should be called "parallel serialization", or sort of. :p
 
Superscalar or VLIW? I'd be surprised if it were the former.
G8x/G9x is definitely not VLIW, unlike R6xx which is.

Here's a simple way to look at it: G80 has a 8-wide ALU and a 16-wide scheduler running at half-speed, so they're effectively at the same rate. In the VS/GS, no interpolation is required, so the batch size is 16 and one instruction can be issued per clock.

In the PS, you need interpolation and it's a distinct unit. So, one clock cycle the scheduler works on a batch of 32 pixels for the ALU, and another cycle it works on another batch of 32 pixels for the interpolator. That way, they effectively have a dual-issue pipeline even though their scheduler is only single-issue, saving hardware.

As for the SFU, in the PS, it just takes the scheduling spot of an interpolation instruction. In the VS/GS, it runs *instead* of the ALU instruction, and you are effectively wasting either the ALU or the SFU. This might be a little more subtle than that, however, because some Special Function instructions need to use the ALUs for pre-processing (to put the values in range for example), so I'm not sure
what the throughput is for that in the VS.

And to maximize how much latency G8x/G9x can hide, multiple scalar instructions from the *same* thread/batch can run at the same time. All it does is check a scoreboard to see if all data is available or not.

Obviously for a stream of dot products or dependent instructions, that won't do any miracle, but that's not a very common case either. In the VS where your programs might resemble that more, you don't tend to need to hide memory latency, so it doesn't really matter. And if you do, you'll probably have other instructions to hide it anyway.
 
G8x/G9x is definitely not VLIW, unlike R6xx which is.

Here's a simple way to look at it: G80 has a 8-wide ALU and a 16-wide scheduler running at half-speed, so they're effectively at the same rate.
All that tells us is that an instruction lasts for a minimum of 2 clocks, so that the intrinsic batch size of G80 is 16 elements.

In the VS/GS, no interpolation is required, so the batch size is 16 and one instruction can be issued per clock.
VS/GS is much subtler than this. In my view most ALU instructions in these shaders are biased towards being vec4 - a vec4 MAD co-issued with an SF means that both instructions produce a completed result after 4 clocks for the element.

For pixel shaders (and CUDA) G80 uses a "convoy"

http://forum.beyond3d.com/showpost.php?p=1032788&postcount=380

of 2 batches because their ALU instructions are biased towards scalar/vec2. Without the use of a convoy, a vec2 co-issued with an SF results in the MAD ALU idling for 2 cycles.

Though some SFs run at half rate, so there is still some lost utilisation in extreme cases.

In the PS, you need interpolation and it's a distinct unit. So, one clock cycle the scheduler works on a batch of 32 pixels for the ALU, and another cycle it works on another batch of 32 pixels for the interpolator. That way, they effectively have a dual-issue pipeline even though their scheduler is only single-issue, saving hardware.
No it interleaves two 16-element batches in PS/CUDA to get the hardware saving you're talking about.

The important point to bear in mind is that this is statically compiled. There is nothing dynamic about the relationship of a MAD in one batch being issued against an SF (or interpolation) in the other batch.

And to maximize how much latency G8x/G9x can hide, multiple scalar instructions from the *same* thread/batch can run at the same time. All it does is check a scoreboard to see if all data is available or not.
Subject to read-after-write latency posed by the register file.

This is the big difference between G80 and R600. R600 uses in-pipeline temporary registers so that there is no read-after-write hazard. This also saves register file space as not all intermediate results in R600 need to be kept in the register file.

Obviously for a stream of dot products or dependent instructions, that won't do any miracle, but that's not a very common case either. In the VS where your programs might resemble that more, you don't tend to need to hide memory latency, so it doesn't really matter. And if you do, you'll probably have other instructions to hide it anyway.
Yep, the GS/VS streaming memory access patterns mean that few 16-element batches are needed to be in flight at any time. With the payoff that dynamic branching in these shaders gains from the tighter coherency.

Jawed
 
All that tells us is that an instruction lasts for a minimum of 2 clocks, so that the intrinsic batch size of G80 is 16 elements.
It's not supposed to say anything else.

VS/GS is much subtler than this. In my view most ALU instructions in these shaders are biased towards being vec4 - a vec4 MAD co-issued with an SF means that both instructions produce a completed result after 4 clocks for the element.
There is no evidence of G80 having anything else than scalar instructions. This would needlessly complicate the scheduler.

For pixel shaders (and CUDA) G80 uses a "convoy"
The nomenclature doesn't matter. It is probably fair to say that a convoy is a batch, while a warp represents something a bit more subtle (and badly defined in the CUDA docs)

Without the use of a convoy, a vec2 co-issued with an SF results in the MAD ALU idling for 2 cycles.
You still don't understand how G80 works, do you?

Though some SFs run at half rate, so there is still some lost utilisation in extreme cases.
Some SFs run at half rate? What? Interpolation is 128/hot clock and SF is 32/hot clock, period. There are no special cases.

No it interleaves two 16-element batches in PS/CUDA to get the hardware saving you're talking about.

The important point to bear in mind is that this is statically compiled. There is nothing dynamic about the relationship of a MAD in one batch being issued against an SF (or interpolation) in the other batch.
No, it is dynamic. A stream of several dependent ALU ops followed by dependent SF ops is going to run at more than the rate you'd expect if it was statically scheduled.

Subject to read-after-write latency posed by the register file.
Which is effectively negligible, and even if it was as high as you claimed, it still wouldn't be a problem for what I am describing.

This is the big difference between G80 and R600. R600 uses in-pipeline temporary registers so that there is no read-after-write hazard. This also saves register file space as not all intermediate results in R600 need to be kept in the register file.
G80 does its fair share of smart things to save space in the register file, but yes, only G86 has 'temporary registers' like R600.

the GS/VS streaming memory access patterns mean that few 16-element batches are needed to be in flight at any time. With the payoff that dynamic branching in these shaders gains from the tighter coherency.
Streaming memory access patterns? What access patterns? A pre-DX10 VS won't even access memory at all, except for some very-very-high-hit-rates constants... And I wouldn't describe DX10 VS/GS memory access patterns as 'streaming'.

Sorry for being a tad aggressive, but some of your assumptions are downright wrong. There's no point arguing based on flawed premises... Anyway, it's probably more fair to say that G8x/G9x is scalar, rather than superscalar, since the scheduler only issues one scalar instruction per clock.
 
For the sake of simplification:

64604563yw1.jpg


33026461pa4.jpg
 
G80 is not VLIW. I recall reading an interview or an article with some of the Nvidia chiefs where he stated that NV3x was VLIW, but they were dissapointed with the results and dropped this approach with G80. I can't give a link though so just treat me as an unreliable source :)
 
I have another question about the R600. The shader arrays are fed threads which are shader routines for a block of pixels . Are these blocks of pixels exclusive from one another? For example, if one block of pixels would need to use 3 of the 5 scalar units, and another block (thread) would need to use 2 of them, can they be combined into singe instruction word to be sent to one of the shader arrays?

Also, going back to my original question, does anyone have a good idea what the difference, transistor-wise, between the 360 GPU shaders and the R600?
 
I have another question about the R600. The shader arrays are fed threads which are shader routines for a block of pixels . Are these blocks of pixels exclusive from one another? For example, if one block of pixels would need to use 3 of the 5 scalar units, and another block (thread) would need to use 2 of them, can they be combined into singe instruction word to be sent to one of the shader arrays?
Yes, they're exclusive. I jumped through hoops dreaming up a way to make this work around a year ago I think it was - the main problem is reading from the register file, because you can only get efficient register reads/writes if you group them together into blocks. As soon as you start merging groups of pixels "arbitrarily" you lose the organisation that makes for efficient register file access.

In theory you can use staging memory to solve this problem. In fact it seems both R600 and G80 use a staging memory to help the efficiency of register file accesses. The staging memory works by transforming the dimensional blocking problem into a timing blocking problem. e.g. instead of trying to fetch locations 2, 12, 17 and 31 in one clock for 16 pixels, you can fetch location 2 for 64 pixels, 12 for 64 pixels, etc. in advance, then fetch from the staging memory in the pattern required.

The issue then is to have a large and fast enough staging memory to cope with all the awkward combinations. It ends up being like a mini, high-granularity register file - extremely expensive per unit of memory, but flexible.

So the staging memory we see in R600 and G80 is quite small and not flexible enough to perform the kind of operations you're thinking of.

Jawed
 
Vector approach can keep some elements unused and the compiler needs to be more complicated to optimize. SIMD approach can help but for a code with too branches and conditionals ( I'm thinking about raytracing with packed rays ) you would need to play a lot
with masks.

Well, is hard to say but in theory I would bet for the scalar approach.
 
Back
Top