G80 to have unified shader architechture??!?!

KimB · Jul 29, 2005

DemoCoder said:
Since graphics is "embarassingly parallel" and the workloads fit the profile of TLP instead of ILP, that suggests to me that the ideal graphics architecture maximizes concurrent threads, rather than trying to use ILP to increase instruction throughput within a thread. Bulking up each pipeline with more and more execution units invariable will lead to idle units, since the workloads won't always allow optimal scheduling/packing multiple instructions into a multiple dispatch.

Ah, but what if increasing ILP is cheaper than adding full parallelism? In this case, there's some optimal amount of ILP, which will be architecture-dependent.

Edit: Oh, and this isn't exactly ILP, since the NV4x and G70 have two units in serial, not parallel (with additional possibility of parallel instructions within units, though).

Particularly with the G70, though, the two units are quite similar, and thus can be kept active without worrying quite so much about which order the instructions come in.

DemoCoder · Jul 29, 2005

What does it mean to have two units in "serial" in this regard? Either each clock cycle, the two units can execute independent instructions, or they can't. If they can executive two non-dependent instructions at a throughput of 2/cycle, than the units are executing in parallel, regardless of how the block diagram shows them. "Serial" to me suggests a dependence, that the second unit depends on input from the first, otherwise, if there is true independence, than it is parallel. I don't see how calling the NV40 units "serial" creates any meanginful distinction that somehow washes away the difficulty of scheduling instructions optimally and keeping units busy. If you have two ALUs, then you want concurrent execution, and that implies parallelism.

Secondly, I think the industry has already realized that TLP is cheaper than ILP. That's why server chips are moving in that direction. The only reason desktop CPUs haven't, is because the workloads of servers are typically parallelizable, whereas the workloads for desktops aren't.

For multidispatch to work, the compiler and HW must be clever in the order in which instructions are processed. The design is alot more complex compared to TLP. NVidia relies in having a sophisticated "driver compiler" to assist with maximizing shader throughput, much like the Itanium was handicapped by compilers. We see how brilliantly (or not) this works in practice, since the compiler still handicaps the true performance of the GPU, which leads to shader replacement techniques.

ILP/OoOE has the advantage that it will work with existing serialized code better, but on GPUs, since by definition, you can break the problem up into thousands of threads executing the same chunks of code automatically, TLP seems the logical course of action.

At this stage, ILP is not prohibitive because the software (driver) performs the work, but I don't think this will scale up much further beyond 2-3 units per pipe.

Jawed · Jul 29, 2005

Rockster said:
Doesn't the inefficiency that Xenos is designed to tackle relate more to handling the different latencies associated with each instruction than ILP efficiency. (ie. Sure you can get peak performance single cycle MAD's if your shader just runs MAD after MAD with no dependancies., but real shaders use a mix of different instructions each with different latencies that create pipeline bubbles in the ALU's. Xenos attempts to fill those gaps by constantly switching different threads into them.)

Agreed, a lot of Xenos's design focus is on hiding latency.

But current GPUs are also designed to hide latency. The primary sources of latency are texture operations and instruction dependencies (the latter only in the sense that it destroys ILP). Xenos appears to hide both types entirely.

But the long pipeline/large batch of a current GPU means that as long as triangles are above a certain size, then texturing latency is entirely hidden too.

Obviously the problem comes when triangles get smaller. G70 addresses this problem by making each quad of fragment pipes independent from the others, so that small triangles can only impact one quad at a time, instead of the whole set of quads as in NV40. (Though I suspect the chances are that if one quad is handling a small triangle, the other quads are, too.)

Xenos's design is geared towards much smaller batches it seems, i.e. much smaller triangles - the batch size would appear to be 16 but that's not known for sure as far as I can tell. It simply means that Xenos has to handle more concurrent batches of fragments than traditional GPU designs, hence the massive increase in batch scheduling complexity. Xenos ameliorates batch scheduling complexity by using 4 pools of pipelines, 3x16 for instructions and 1x16 for texturing.

Aside from that, I'm interested in discussing how low ILP (but not non-existant, Xenos can perform co-issues) seems optimal for a high-TLP design, since it minimises dependent-instruction latency on top of the gains made in handling smaller triangle texturing latency.

Jawed

nutball · Jul 29, 2005

Surely weighing up TLP v. ILP depends on what overhead (in terms of extra transistors) is incurred in duplicating pipelines?

I mean, say you have TMUs bound to pipelines (old hat, I know, but bear with me), then depending on what your shaders are doing, simply cloning that configuration to create yet more under-used TMUs might not give you as much extra bang-for-buck as spending the transistors adding more non-texturing compute capability into the pipelines you've already got.

Saying that TLP is better than ILP because that's the way CPUs are going is a bit disingenuous -- CPUs are already ILP and have been for more than a decade now. Going TLP is a better way to spend transistor budgets *right now* than yet more ILP, but if CPUs weren't already ILP, that's the way the industry would be going because it gets you more performance for easy-to-write (ie. single-threaded) code for the desktop.

DemoCoder · Jul 29, 2005

TLP doesn't require duplicating whole pipelines. Simply cut-and-pasting pipelines is a poor man's notion of TLP. TLP simples requires more context "in flight", plus a thread scheduler to parcel out work to non-busy units.

As for the CPU analogy, I already explained why it holds for servers, but not desktop. Desktops have to have good performance on both single threaded coded, and multithreaded code. But GPUs on the other hand, are parallel by design, ridiculously parallel, in fact, the whole shader programming model forces parallelism on the developer. So there is no need to have good "single thread" performance on a GPU, since they would correspond to rendering a single pixel, with the exception of esoteric GPGPU algorithms, no one runs extremely long shaders on single pixel renders.

ILP only makes sense on the GPU if the shader workload consistently consisted of instruction streams that were mostly ILP 2 or higher. You've got the TMU example backwards. Having 2 TMUs bound per pipe is like having two shader units per pipe. It is ILP, not TLP. Thus, if your workload doesn't consist of multitexture, but primarily single texture, then one of your TMUs are sitting idle.

Look, I'm not saying the NV4x is strictly an ILP design. All GPUs are highly threaded already. I just think that the current NV4x architecture depends too much on the compiler to take advantage of its ALUs, such that performance is somewhat non-deterministic. Because of the way compilers work (and shader replacement makes it worse), a developer changing a single line of code can cause degenerate performance, which is harder to do on an architecture like Xenos.

Jawed · Jul 29, 2005

nutball said:
Surely weighing up TLP v. ILP depends on what overhead (in terms of extra transistors) is incurred in duplicating pipelines?

I think that's fair. I think that's why ATI wanted to wait until 90nm technology was available in order to get the level of integration required (transistor count).

At the same time we have the new PowerVR SGX unified scalable shader engine (USSE) which appears to have a minimum of a single pipe in its architecture

so somehow I don't think IMG was waiting for 90nm.

http://www.beyond3d.com/forum/viewtopic.php?t=25386

Jawed

Ailuros · Jul 29, 2005

At the same time we have the new PowerVR SGX unified scalable shader engine (USSE) which appears to have a minimum of a single pipe in its architecture so somehow I don't think IMG was waiting for 90nm.

How many "pipes" do you expect to see for low end PDA/mobile devices in the future. Before seing also what that "pipe" consists of, that "single pipe" might be also somewhat misleading. The optional VGP for MBX was already a 4-way SIMD VS1.1.

**edit: a bit more...

http://www.powervr.com/Products/Graphics/SGX/Index.asp#

And yes for that particular market IMG was definitely waiting for 90nm.

Jawed · Jul 29, 2005

Ailuros said:
How many "pipes" do you expect to see for low end PDA/mobile devices in the future.

I've got no idea :!:

The resolution of the screens is so low, it's a segment of 3D that I just haven't thought about at all.

Before seing also what that "pipe" consists of, that "single pipe" might be also somewhat misleading. The optional VGP for MBX was already a 4-way SIMD VS1.1.

Sadly goes right over my head.

**edit: a bit more...

http://www.powervr.com/Products/Graphics/SGX/Index.asp#

Nice find...

And yes for that particular market IMG was definitely waiting for 90nm.

Why do you say that? The smallest is 2mm squared on a 90nm process. I don't think 90nm was required to get a unified core - rather I think 90nm is just desirable from a power/space perspective for the market.

The smallest SGX device would seem to be in the region of 7-8m transistors, just a rough guess...

Jawed

Ailuros · Jul 29, 2005

Someone will be screaming soon that we're OT (and he'll be right) but I'll take my chances:

A single unified "pipe" would be most likely for the lowest model, incorporating both geometry (HOS) and shading capabilities.

Why do you say that? The smallest is 2mm squared on a 90nm process. I don't think 90nm was required to get a unified core - rather I think 90nm is just desirable from a power/space perspective for the market.

The smallest SGX device would seem to be in the region of 7-8m transistors, just a rough guess...

Because MBX started out from 180nm to scale down only just now (see TI's plans) into 90nm. If and when the amount of gates get ever announced, you can then compare the two architectures and why 90nm might have been necessary for SGX, not only to fit all the gates into the specific process but also to be able to further scale to smaller processes in the future (65nm and beyond), for higher clockspeeds mostly.

It's a ultra small die; what exactly would you expect it to be, especially considering how critical power consumption in that market is?

***edit:

Sadly goes right over my head.

MBX has one "traditional pipeline", whereby partners can also license an optional vertex geometry processor, which is a VS1.1 4-way SIMD capable of 4 floating point/clock. You are looking at a core with ~DX6.0 capabilities with an additional DX8.1 VS/T&L unit. Geometry processing on RSX comes obviously from within the core this time, but with the slight difference that this one goes beyond SM3.0. What's the exact ballpark in terms of complexity between >dx6.0 ---> >PS3.0 and dx8.1---> >VS3.0?

I don't think 90nm was required to get a unified core

Depends what they've squeezed exactly in, doesn't it?

Jawed · Jul 29, 2005

Ailuros said:
It's a ultra small die; what exactly would you expect it to be, especially considering how critical power consumption in that market is?

I simply find it interesting that whatever the transistor-count overheads in terms of scheduling in Xenos, say, here's a product at the other extreme of the market which can withstand these overheads despite being on a tiny scale.

Jawed

KimB · Jul 29, 2005

DemoCoder said:
What does it mean to have two units in "serial" in this regard? Either each clock cycle, the two units can execute independent instructions, or they can't. If they can executive two non-dependent instructions at a throughput of 2/cycle, than the units are executing in parallel, regardless of how the block diagram shows them. "Serial" to me suggests a dependence, that the second unit depends on input from the first, otherwise, if there is true independence, than it is parallel. I don't see how calling the NV40 units "serial" creates any meanginful distinction that somehow washes away the difficulty of scheduling instructions optimally and keeping units busy. If you have two ALUs, then you want concurrent execution, and that implies parallelism.

From what I understand, they are in serial in that there is a limited amount of register file bandwidth, and if there is dependence between the two instructions, you are more likely to be using fewer registers, and thus the second unit is more likely to be kept busy.

Ailuros · Jul 30, 2005

Jawed said:
Ailuros said:

It's a ultra small die; what exactly would you expect it to be, especially considering how critical power consumption in that market is?

Click to expand...

I simply find it interesting that whatever the transistor-count overheads in terms of scheduling in Xenos, say, here's a product at the other extreme of the market which can withstand these overheads despite being on a tiny scale.

Jawed

Well I'd suppose that ATI is also planning a unified shader core for that very same market; before 90nm such a featureset would had been fairly impossible for any of the IHVs IMO.

I don't recall what manufacturing process they used for Imageon, but it falls short in a lot of departments when it comes to 3D capabilities.

G80 to have unified shader architechture??!?!

KimB

DemoCoder

Jawed

nutball

DemoCoder

Jawed

Ailuros

Epsilon plus three

Jawed

Ailuros

Epsilon plus three

Jawed

KimB

Ailuros

Epsilon plus three

Similar threads