ExtremeTech NVIDIA & ATI Q&A

Kirk's answer doesn't surprise me, as NVIDIA has recently put a great deal of importance on reducing the amount of idle units (that's already quite visible in the NV40, compared to their old architectures, imo)
I'm personally suspecting they plan on using some of their MediaQ IP and engineers to make a "shutdown" system for the idle units, but that's just some insane speculation on my part really. Still, the result would be a design with excellent performance/watt, something NVIDIA has had problems with in the past, and it'd be quite beneficial if the PS3 chip's technology - which shares the NV50 design ideals - was indeed to be used in other, more power-starved devices.

The way ATI shares the PS and VS workload is by "giving" whole pipelines, last I heard. As Kirk says, this is likely to result in idle units, mostly for texturing. If this is indeed how ATI is doing things, this seems like a very innefficient approach, and I fully agree with Kirk. The elegant solution is to share UNITS, not pipelines. But then, that's quite a way off; although if ATI managed acceptable results with their approach in the R500, I'd *assume* NVIDIA would try to share arithmetic units between VS and PS in the NV60.


Uttar
 
The way ATI shares the PS and VS workload is by "giving" whole pipelines, last I heard. As Kirk says, this is likely to result in idle units, mostly for texturing.

Might want to wait until there's some actual discussion / information on how its achieved.

As Kirk says, this is likely to result in idle units, mostly for texturing.

The flipside to this is, of course, that the vertex shader units require access to to texture samples due to vertex texturing which will result in larger vertex shaders that have individual access to their own VS, as they do at the moment, or some method of passing texture reads back from the PS without killing performance.

I'm wondering if part of this line of thinking may also be due to the development work they are currently doing with Sony - I'm wondering if the Cell CPU may be the element that does the vertex processing under that system.
 
My crackpot idea:

Instead of having (for example) 8 vertex shaders and 24 pixel shaders, have 32 unified shaders. The twist, though, is that 8 are optimized for vertex work and 24 are optimized for pixel work. I'm no hardware designer, but it seems to me they can do this and keep compatibility. This way, if you need all pixel shading (i.e. post processing), you'll get more than 24 pipes of power. It won't be 32 pipes worth, but it's better than having the vertex shaders sitting around counting their bellybutton hair. And in the general case, you don't lose performance because your 24 unified shaders can keep up with 24 dedicated pixel shaders.

In short, if the shaders are so similar, use them. No one said they had to be great at everything you make them do.

'Course, this does presume a more intelligent scheduling system and that ATi and nVidia can actually do it, but it might work.
 
While we're on crackpot ideas:

Let's say there's 16 pixel shader pipelines and 8 pixel shader pipelines that can also do vertex shading.

So you could have 6 quads of full pixel shading (only), or a combination of 5 quads of pixel shading + 1 "quad" of vertex shading, or 4+2 - providing the right kinds of proportions of pixel and vertex shading power for current (and near future) games.

So this is "hybrid" in the sense that it's a hybrid between "dedicated pipelines" of R420 and "unified shaders" of R600 where the transistor budget isn't totally blown on building a vast array of vertex shader power (i.e. where all pipelines are unified) which current games have no use for.

Jawed
 
I think you're going in the wrong direction. I believe there'll be no more specialized pixel or vertex shaders, but some kind of super-shaders which will be quite different from what we know now. It's bound to be much more effective than current architectures regardless of potential clockspeed increase. :?:
 
Jawed said:
While we're on crackpot ideas:

Let's say there's 16 pixel shader pipelines and 8 pixel shader pipelines that can also do vertex shading.

So you could have 6 quads of full pixel shading (only), or a combination of 5 quads of pixel shading + 1 "quad" of vertex shading, or 4+2 - providing the right kinds of proportions of pixel and vertex shading power for current (and near future) games.

So this is "hybrid" in the sense that it's a hybrid between "dedicated pipelines" of R420 and "unified shaders" of R600 where the transistor budget isn't totally blown on building a vast array of vertex shader power (i.e. where all pipelines are unified) which current games have no use for.

Jawed

This is exactly what I suggested in another thread. R520 must involve either a degree of "soft"-hackery (slow) or a blend of R4x0 and R500/R600 in order to support SM3.0. It doesn't make sense for them to develop a ground-up architecture with R600 on the way, and rumour has it that the shader core of R520 is essentially unchanged from R420. Besides, we heard (IIRC) that ATi would be placing a great emphasis on branching in this generation and that would be at odds with having lacklustre performance for VS3.0 ops.
 
MuFu said:
This is exactly what I suggested in another thread. R520 must involve either a degree of "soft"-hackery (slow) or a blend of R4x0 and R500/R600 in order to support SM3.0. It doesn't make sense for them to develop a ground-up architecture with R600 on the way, and rumour has it that the shader core of R520 is essentially unchanged from R420. Besides, we heard (IIRC) that ATi would be placing a great emphasis on branching in this generation and that would be at odds with having lacklustre performance for VS3.0 ops.

I have to say I'm more interested in how to make dynamic branching a win than in the load balancing properties of a hybrid or unified shader architecture.

Apart from the dynamic-branch overhead (6 cycles?) that NV40 suffers from when performing a dynamic branch, there's the issue of a quad of pixels in which some pixels follow the "true" branch, while the remaining pixels follow the "false" branch. The net effect is a complete loss of any speed-up due to the execution of less instructions, because all pixels in the quad are sent through both branches.

The only way to solve this issue seems to rid the GPU architecture of quads.

It seems to me that the talk of generalised ALUs (unified shaders) may provide an opportunity to break out of the current architectural constraint (for dynamic branching) of pixel shading in quads. With the loss of quads I think there would be a major conflict with ATI's Z-specific optimisations (erm, I'm not too sure) so ATI might be caught out in that respect.

Also, since dynamic branching is a feature of pixel and vertex shading languages, and presuming that the logic (transistor count) for this will be quite costly, it could be argued that unification is a win.

What kind of branch optimisation is practical in a GPU? Is it possible to make the cost of dynamic branching negligible, or are we simply waiting for a time when raw power is so high that the branching cost in a typical shader falls into the 1-5% range?

Jawed
 
Would there be any type of pressure to do the exact opposite of a unified architecture? Since Nvidia is working with Sony on the PS3, could Nvidia obtain any type of tech that would allow two seperate cores, where one would be a vertex Core and another be a Pixel core?

I'm not too familiar if this would be more effecient or less. Just curious if it would even be feasible.
 
geo said:
Could that be right? He supports it with some observations on HDR display tech penetrating the mass market in the near term.

My guess is the next year or two for affordable HDR displays is not realistic. But then again he says "more affordable", whatever that means.
 
Noone questioned the logic behind the statement yet?
It's not clear to me that an architecture for a good, efficient, and fast vertex shader is the same as the architecture for a good and fast pixel shader.
Umm. VS needs efficient vector dot products, MULs, ADDs and LERPs.
PS needs ... efficient vector dot products, MULs, ADDs and LERPs.

No fundamental difference really. Just slight differences:
1)DP4 and DPH are far less useful in PS as it is in VS (where you primarily want DP3).
2)NRM is less useful in VS as it is in PS. If you need to normalize, you'd want to do it per fragment for best quality.
3)3/1 splits aren't really necessary in VS, because 3-vectors aren't that useful for geometry (4-vectors are what you want). 2/2 splits actually are useful in a VS because of the majority of "real" texcoords being 2D now and in the foreseeable future.
A pixel shader would need far, far more texture math performance and read bandwidth than an optimized vertex shader.
"Read bandwidth" doesn't make sense here if it doesn't refer to texture read bandwidth.
Texture math? What's that supposed to be? Texture coordinate math for lookups is the samplers' job as far as I'm concerned. Using the texture sample is ALU territory. There is no such concern as "texture math" WRT unified shader architecutres.
Duh. Sharing texture units in a unified shader scenario is difficult. We knew that already. But there's really no relation to the shader ALU architecture itself. IMO he's just fabricating straw arguments.
So, if you used that pixel shader to do vertex shading, most of the hardware would be idle, most of the time.
Plain wrong. A dedicated NRM_pp unit would sit idle in a VS role, but noone forces you to build one, and even if you have it, that's just a small part, and not "most of the hardware".
Which is better—a lean and mean optimized vertex shader and a lean and mean optimized pixel shader or two less-efficient hybrid shaders?
Suggestive questions rarely are the technically right questions.
An NV40 fragment shading pipe could certainly work reasonably well in a vertex shader role.
 
Considering Microsoft design the API and they chose ATI for the next xbox, doesn't that suggest that they prefer ATI's method of having unified shaders?
 
Back
Top