Change of insight

Frank · Apr 8, 2007

Isn't it interesting, that after the FX/R300 debacle, both companies have more or less embraced the technical modus operandi of the other?

R300 showed us that a single quad pipeline (that held more or less texture units and shaders) was inefficient, by introducing multiple, independent ALUs. nVidia has embraced that idea wholeheratedly, and even gone as far as breaking the vector units and quads down into individual, scalar ALUs. While ATi/AMD seems to have gone in the opposite direction: multiple, strict quad and vector pipelines, (albeit unified).

PeterAce · Apr 8, 2007

In the same vein....

It's the number of schedulers/sequencers/arbiters (and their relative cost in die size) that will be interesting to compare on G80 vs R600.

stepz · Apr 9, 2007

The AMD/ATI and NVidia approaches are not all that different. NVidia is not using individual scalar ALU's, despite what their marketing department is trying to spin. Both companies are using the same basic SIMD vector units (although of different width and count). Main difference being how are shader threads combined into the vectors and scheduling capability.

silent_guy · Apr 9, 2007

stepz said:
The AMD/ATI and NVidia approaches are not all that different. NVidia is not using individual scalar ALU's, despite what their marketing department is trying to spin. Both companies are using the same basic SIMD vector units (although of different width and count). Main difference being how are shader threads combined into the vectors and scheduling capability.

If one implementation has a vector ALU and 1 program counter/thread for 4 ALU's, while the other does vector arithmetic by doing 3 or 4 MAD's sequentially and has 1 program counter for 1 ALU, I see no reason why the latter can't reasonably be called scalar. Marketing or no marketing...

trinibwoy · Apr 9, 2007

stepz said:
Main difference being how are shader threads combined into the vectors and scheduling capability.

So given those advantages you find it dishonest to market it as such? It seems to me that your argument revolves around physical characteristics which aren't particularly relevant.

Frank · Apr 9, 2007

PeterAce said:
In the same vein....

It's the number of schedulers/sequencers/arbiters (and their relative cost in die size) that will be interesting to compare on G80 vs R600.

Or, put another way: does it turn out to be more efficient to run a strict workload (much less scheduling needed, but with a loss of efficiency when branching or when outside a triangle), or is it better to run each pixel (or quad, depending on how you look at it) more or less independently, albeit with a larger granularity, due to scheduling and storage constraints?

stepz · Apr 9, 2007

silent_guy said:
If one implementation has a vector ALU and 1 program counter/thread for 4 ALU's, while the other does vector arithmetic by doing 3 or 4 MAD's sequentially and has 1 program counter for 1 ALU, I see no reason why the latter can't reasonably be called scalar. Marketing or no marketing...

SIMD means 1 program counter per vector. That's why the G80 has branching granularity. If all fragments in the batch branch the same way, then there is no point in having multiple PC's because they inevitably run in lockstep.

edit:

trinibwoy said:
So given those advantages you find it dishonest to market it as such? It seems to me that your argument revolves around physical characteristics which aren't particularly relevant.

Would you find it dishonest to market a 2.66GHz quad core processor as a 10.66GHz processor? Or claiming that C2D has 8 floating point units, thanks to SSSE3 shuffling capability? I don't get the point of doing a technical comparison of architectures on arbitrarily defined terms. We might as well discuss whether it is better to increase the amount of flazongs in the zoodad or instead decrease the thungbar wamosity.

Razor1 · Apr 9, 2007

you can't really use CPU's as an analogy, they don't break down work like the g80 would no matter what kind of vector operation it needs to do.

stepz · Apr 10, 2007

Well the G80 also doesn't break down work, that's the job of a compiler, but point taken - architectural enhancements are a whole lot more important in the GPU world. Mostly thanks to explicitly parallel programming paradigm and JIT compiling.

Any way, to closer to the topic on hand, does someone better versed in the art of GPU architecture know what kind of additions to the scheduler are necessary to enable single component per fragment ALU's. That is, why wasn't the G80 approach done earlier. Other than more granular predicate masking I can't imagine why the ALU's would give a damn if the data comes from a larger amount of fragments. In the scheduler there are probably some savings to be made by using fixed fragment allocation and having less per-fragment data in flight, but I don't quite see what they are. I suppose I could figure it out if I really wanted, but I'm too lazy, so I'm hoping that someone will enlighten me.

Razor1 · Apr 10, 2007

true, the driver does the breakdown

I think it would need a ton more cache to do it on a per fragment basis like its done in the g80 which wasn't there some hint of cache sizes being quite large for the g80?

Schedular wise I think you would need a dispatcher that would be rubost enough to keep all the threads in flight properlly feed the ALU's. The more threads that more complex and more silicon needed for it and of course need a place to store the info, where the cache comes in.

This is general, but I'm not the right person give a definite answer, but just wanted to take a crack at it

trinibwoy · Apr 10, 2007

stepz said:
Would you find it dishonest to market a 2.66GHz quad core processor as a 10.66GHz processor?

Of course I would but that's a pretty poor analogy - describing G80 as scalar isn't artificially inflating its performance. You seem to be saying that G80 processes vectors of elements just like architectures past, which is true. But I don't see what better term there is besides scalar to describe how individual elements are processed by the GPU. All primitives are processed one scalar component at a time and I think that's the message here. I don't think there was any effort to hide the fact that the architecture was still SIMD.

silent_guy · Apr 10, 2007

stepz said:
SIMD means 1 program counter per vector. That's why the G80 has branching granularity. If all fragments in the batch branch the same way, then there is no point in having multiple PC's because they inevitably run in lockstep.

I wasn't talking about SIMD or non-SIMD, which is orthogonal to the concept of scalar/non-scalar. But let's rephrase it without using the word 'program counter': I consider something scalar if its instruction set doesn't have instructions that can issue a vector multiply/add in one go. Fair enough?

As you said later on, I would assume that thread handling is conceptually the same irrespective of them being scalar or not.
And I've been wondering the same thing: why did it take so long to go scalar? It's easy to come up with cases where a scalar architecture is inherently more efficient than vector one, though they're probably not very common in real GPU life.
Given the same amount of ALUs, the scalar architecture will have more threads, so more overhead because of that? But that argument still holds true now. Programs will require more instructions, so cache memory must be larger?
Maybe the increased emphasis on GPGPU, where not everything can necessarily be modeled in Vec3 or Vec4 operations, played a role?

PeterAce · Apr 10, 2007

ATi decided that highly threaded designs made made sence with 'SM3.0 infection point', both designs (C1/R500/Xenos and R5x0) obviously use a similar threading concept. Infact it's been mentioned by a certain AMD employee (that used to be head fring-pan waver here at B3D) that there were deep divisions within ATi that R520 sould of been from the unified-shader branch of design!

http://forum.beyond3d.com/showpost.php?p=746536&postcount=16

Highly threaded architectures (personally I hate marketing terms like Ultra-threaded, Super-threaded and Giga-threaded) makes sence when performant dynamic branching is a requirement.

NV4X (including the G7x expantion of the design) had poor dynamic branching performance (because of it's hugh batch/thread sizes).

Nvidia decided that if they have to go highly threaded with the 'SM4.0 infection point' that they might as well go scalar at the same time.

As information on R600's ALU structure (vector or scalar?) is not confirmed yet, I still have a hope that as R600 will be 'Unified Shader v2' and therefor be scalar.

Jawed · Apr 10, 2007

Until SM3, you could argue that "pixel shading" was sufficiently texturing-bound (algorithmically) that it didn't make sense to process textures sequentially (per channel, scalar) rather than as vectors. Memory fetches are essentially vector operations. Similarly ROPs are naturally vector operations.

In both TMUs and ROPs, there are certain data formats, e.g. single-channel 32-bit integer/floating-point that align with the kind of hardware available in the TMUs/ROPs. There are also multi-channel versions, i.e. 64-bit or 128-bit texels/pixels, which tend to be treated sequentially (scalar). Though G80 treats fp16 texels in vector fashion (i.e. 64-bit fetching/filtering per clock).

Additionally, TMUs/ROPs also use sequential operations to perform more advanced operations, e.g. 16xAF or 4xAA. Again, G80 shifts things around a bit, ratcheting up the single-cycle capability of its unit. i.e. G80's fixed function units are tending towards "wider vector" whilst at the same time its arithmetic units are sequential.

With TMUs becoming fully-decoupled from the ALUs, the general restriction on the width of pixel shading math disappeared. It was then possible to process texels as vectors (or scalars) while doing arithmetic sequentially (scalar).

Obviously, ATI had its partially decoupled TMUs from R300 onwards, but that has not been about the math/texturing sequencing (scalar versus vector), merely the ratio between them (in code) and simplification of caching (no L2 for texels and distributed L1 for the ROPs) and the use of a tiled-rasteriser.

Jawed

stepz · Apr 10, 2007

silent_guy said:
I consider something scalar if its instruction set doesn't have instructions that can issue a vector multiply/add in one go. Fair enough?

I'm fairly certain that the G80 ALU can only execute vector instructions.

That is of course vectors of 8 fragments. I fully understand what you mean by scalar. I'm nitpicking on the notion of scalar ALU's. The ALU's are still vector ALU's, the scalarity is fully the function of distributing the fragment threads in the rasterizer and gathering them in the ROPs.

Anyway I just had a slight spark of inspiration. Now the G80 architecture is starting to click together for me. (hopefully correctly) By generating scalar threads you effectively quadruple your batch size unless you compensate by either increasing the scheduling rate (faster hotter bigger schedulers needed to maintain clockrate) or decreasing the vector width (need more schedulers to maintain total width). Given the comments about G80 shader processor consisting of 8 wide vectors and the double-pumped ALU's, it seems that both have been done to get effectively the same batch size as R520. So a G80 shader processor should then contain 2 schedulers scheduling 8 wide vector instructions once every 2 core clocks or 4 ALU clocks compared to a single scheduler scheduling 16 way vector instructions every 4 core clocks. Atleast I can imagine the way such a scheme would work in hardware, hopefully the reality is something similar. It would be nice if someone knew what kinds of limits are on concurrently live threads in one shader processor, what kinds of state changes trigger shader processor flushes etc.

Davros · Apr 10, 2007

"scalar ALU's / vector ALU's"

Can someone explain those terms please ?
tnx in advance..

silent_guy · Apr 11, 2007

stepz said:
I'm fairly certain that the G80 ALU can only execute vector instructions. That is of course vectors of 8 fragments. I fully understand what you mean by scalar. I'm nitpicking on the notion of scalar ALU's.

Ok, so you're making a distinction between a scalar theoretical machine and a scalar ALU. The first is a concept that really explains the philosophy of the machine, the second is a related to the way one or more threads are handled in parallel. I think this is highly confusing: it's hard to communicate effectively if one first has to negotiate an alternative meaning to established terminology. A bit of a waste of time, no?

The ALU's are still vector ALU's, the scalarity is fully the function of distributing the fragment threads in the rasterizer and gathering them in the ROPs.

Yes, I see what you meant. But nobody would ever call this a vector ALU. SIMD is the right word for this, as it automatically implies parallel units executing the same instruction.

stepz · Apr 11, 2007

I think the problem is that my terminology comes from my unfinished degree in VLSI and microprocessor design. This differs from the terminology established in this forum. For me it has so far been SIMD ALU = vector ALU, but I'll try to speak the same language.

Simon F · Apr 11, 2007

Frank said:
nVidia has embraced that idea wholeheratedly, and even gone as far as breaking the vector units and quads down into individual, scalar ALUs.

Isn't that a bit more like 3DLabs' approach from a few years back?

Pete · Apr 14, 2007

Davros said:
"scalar ALU's / vector ALU's"

Can someone explain those terms please ?
tnx in advance..

This post finally made the :idea:

go off in my head (or at least glow brighter). It seems G80's scalar organization of its ALUs (regardless of their physical grouping) means a single ALU is scheduled to process a single vector's multiple dimensions in series (over multiple clock cycles; e.g., n cycles for a MAD op on an n-dimensional vector), whereas the vector grouping of previous GPU generations' ALUs (and possibly R600) meant a group of ALUs (e.g., "5D," or 4D + scalar) work on a single vector in parallel (though that "+ scalar" is still a mystery to me as to whether it simultaneously works on another vector or something else--time to re-read some B3D reviews).

The discussion in this thread is very helpful in understanding how GPUs process data now. Thanks, guys. Well, assuming I'm not bungling things. I'm still not sure how I needed to be reminded of this after B3D's G80 pieces.

Change of insight

Frank

Certified not a majority

PeterAce

stepz

silent_guy

trinibwoy

Meh

Frank

Certified not a majority

stepz

Razor1

stepz

Razor1

trinibwoy

Meh

silent_guy

PeterAce

Jawed

stepz

Davros

silent_guy

stepz

Simon F

Tea maker

Pete

Moderate Nuisance

Similar threads