R520 and G70 32 pipes ?, G70 dual core with turbo cache ?

trinibwoy · Apr 22, 2005

Well if Nvidia could do something about the AA performance hit that could be substantial too.

Tridam · Apr 22, 2005

Skinner said:
Well nv did something with the GF6800go, with 12 pipes it keeps almost up with a GF6800U, could be a sign for things to come as far as efficienty is conserned with the G70/80...

Don't mix up GPU efficiency improvements and reviewer's incompetence.

Xmas · Apr 22, 2005

GeForce Go 6800 Ultra: 5.4 GP/s, 33.9 GB/s
GeForce 6800 GT: 5.6 GP/s, 32 GB/s
GeForce 6800 Ultra: 6.4 GP/s, 35.2 GB/s

The Go 6800U in line with a 6800GT is no surprise, in fact it's exactly what you would expect from those numbers.

Robbitop · Apr 22, 2005

trinibwoy said:
Well if Nvidia could do something about the AA performance hit that could be substantial too.

IIRC the relative drop of 4xAA (w/out AF) is less in NV40 than in R420.
It's not so bandwithlimited (better mem controller and less rawpower).
If AF is added the scene is fillrate limeted. R420 got 33% (because of the clockrate) more of it.

Gotteshand · Apr 22, 2005

Dual Core???? - This is the Way to deal with 3D-Technology!!!!!
-------------------------------------------------------------------------------------
6800@16x1,6vp@370/810@

KimB · Apr 22, 2005

Jesus I hope you're joking.

psurge · Apr 23, 2005

Dual-core on a single die could make some sense so long as there is no artificial split between everything that is already duplicated (ROPs, texture units, shader units).

However - having 2 copies of everything else (besides memory controller and PCI interface) could make some sense.

Code:

Vertex Fetch 1                     Rasterizer + quad pool 1
              \ _ Vertex Shaders _/                         \_ Pixel Shaders - ROPs
              /                   \                         /
Vertex Fetch 2                     Rasterizer + quad pool 2

I guess what I'm thinking of would be closer to 2 thread SMT than "dul-core"

If multiple applications are using the GPU simultaneously (under Longhorn say), selective duplication could very well reduce the penalty of full GPU context switches.

KimB · Apr 23, 2005

Dual core only makes sense in the context of improving yields. You can't think of dual core in a GPU in the same way as with a CPU. GPU's are already multicore in that fashion.

Going true multicore can only decrease performance (or do nothing) in a GPU.

psurge · Apr 23, 2005

No they aren't. Show me a GPU that can render UT and Doom3 simultaneously, and not by multitasking (preemptive or otherwise).

AFAICS almost everything in current GPUs is based on data parallelism and pipelining. This is not the same as control parallelism - so when I say SMT in reference to a GPU, I mean a GPU capable of handling separate command streams from completely different applications in parallel.

Pete · Apr 23, 2005

Dual screens might appreciate dual cores, no? Though it seems like a huge waste of transistors and energy to add another core or card just to make my web browser 3D while I game on another screen, maybe games themselves can take advantage.

Still, we're probably talking big expense for minimal market.

nutball · Apr 23, 2005

psurge said:
No they aren't. Show me a GPU that can render UT and Doom3 simultaneously, and not by multitasking (preemptive or otherwise).

Show me a gamer who can play UT and Doom3 simultaneously!

AFAICS almost everything in current GPUs is based on data parallelism and pipelining. This is not the same as control parallelism - so when I say SMT in reference to a GPU, I mean a GPU capable of handling separate command streams from completely different applications in parallel.

What you is probably correct, but is it relevant? Why would such functionality be necessary? GPUs can already support multiple applications apparently simultaneously, in the same way that single-core CPUs apparently multi-task. Why would it be necessary to have genuine *multi-application* parallelism in GPUs? (That's a genuine question, I'm not being facetious!).

psurge · Apr 23, 2005

nutball -

I admit that was a terrible example. Also - I'm not at all sure that something like this is necessary.

Here's a better use case I was thinking about - the Longhorn 3D GUI :
- you have one high priority "context/thread" for the application in focus (which is guaranteed rendering resources)
- background applications are asigned time slices on the second context as necessary.

A single application could of course use both contexts - more or less telling the hardware that two command streams are independent (e.g. both correspond to opaque triangles resolved via the z-buffer, or non overlapping transparent triangles, or different render targets). The single application might be able to get better utilization of functional units and/or hide more latency this way.

Anyway thats my 2c.

DeanoC · Apr 23, 2005

psurge said:
No they aren't. Show me a GPU that can render UT and Doom3 simultaneously, and not by multitasking (preemptive or otherwise).

AFAICS almost everything in current GPUs is based on data parallelism and pipelining. This is not the same as control parallelism - so when I say SMT in reference to a GPU, I mean a GPU capable of handling separate command streams from completely different applications in parallel.

Longhorn targeted GPUs are meant to have support for multiple command buffers. They will have hardware support for multiplexing multiple command threads into the actual rendering. Wether this is via hardware SMP (i.e. these N quads rendering Doom3 and these N quads rendering UE3) or by task switching (much more likely for a single GPU based system IMO, better for cache coherance) is irrelevant to the OS.

To the front-end OS it will be as if there were N GPU connected to the system, the GPU itself will do all the sharing.

KimB · Apr 23, 2005

Rendering multiple 3D applications is not a problem. A "multi-core" GPU wouldn't even be more efficient at it, since the most efficient way of doing it would be to never render for more than one application at any one time. That is to say, you'd render the Doom3 scene, copy it to the desktop's back buffer, then render the UT scene and copy it to the backbuffer, then swap. You'd set up a simple short queue system to determine which frame to render next.

psurge · Apr 24, 2005

I'm talking about a system (see the diagram above) in which the shader units are being shared across applications. However - some of the caches and FIFOs, v-fetch engine, rasterizer and command queues are duplicated, meaning that while the state for context 1 is draining, various buffers for context 2 can be filled. If you constrain the shader units to working on just 2 programs at a time, cache coherency wouldn't be affected much (assuming you don't swap between them on a cycle by cycle basis).

A "multi-core" GPU wouldn't even be more efficient at it, since the most efficient way of doing it would be to never render for more than one application at any one time.

I'm not convinced that this is true. I know you've made this statement many times, but what is your justification?

In the Doom3 + UT example being discussed, why would a scheme in which one is prioritized, and the second is used to fill pipeline bubbles be slower than your solution?

[edit] DeanoC - is this kind of thing part of WGF 1.0 already?

KimB · Apr 24, 2005

It's simple: memory coherency. Additionally, since 3D graphics are separated into discreet frames by nature, it becomes very easy to parallelize multiple programs on the frame level.

The state changes incurred in sharing work between different programs wouldn't allow pipeline bubbles to be filled, as you say.

psurge · Apr 24, 2005

Obviously such a system would require either duplication of units (thus no state change overhead), or units which can have 2 "states" in flight simultaneously.

KimB · Apr 24, 2005

psurge said:
Obviously such a system would require either duplication of units (thus no state change overhead), or units which can have 2 "states" in flight simultaneously.

More states in flight requires more cache, though, leading back to the same coherency problem. In an architecture like the NV4x, for example, the two states would have to share the limited number of registers.

psurge · Apr 24, 2005

Well, the NV4x works on batches of quads, so the quad pools would either have to be duplicated (register wasteage as you say), or a single pool would have to be able to contain quads from different programs. Basically the pixel shader control unit would have to be able to deal with quads that have different program counters, and dispatch from 2 programs (or program locations) simultaneously. This sounds like something you have to do anyway for efficient branching, and to avoid stalling the pixel shader if you just don't have hundreds of pixels to run the same shader on. It seems to me that this becomes more and more important as the number of execution units grow and larger and larger quad batches are needed for a given latency tolerance.

As for the vertex shaders, Nvidia claims they are MIMD already.

So I guess that leaves texture, z, stencil, and render target coherency/bandwidth. I have no experimental or simulated basis for commenting on these, so I'll accept the claim that these factors would negate the benefits outlined above as plausible.

KimB · Apr 24, 2005

psurge said:
Well, the NV4x works on batches of quads, so the quad pools would either have to be duplicated (register wasteage as you say), or a single pool would have to be able to contain quads from different programs. Basically the pixel shader control unit would have to be able to deal with quads that have different program counters, and dispatch from 2 programs (or program locations) simultaneously. This sounds like something you have to do anyway for efficient branching, and to avoid stalling the pixel shader if you just don't have hundreds of pixels to run the same shader on.

Sure, but what if each program wants to do branching?

Edit:
Anyway, the #1 thing to keep in mind here is that one frame of a 3D program is effectively like a massively-multithreaded application already. There's no way that adding another program into the mix can possibly improve efficiency. So it makes much more sense to just run one frame at a time, so as not to have to deal with coherency issues.

R520 and G70 32 pipes ?, G70 dual core with turbo cache ?

trinibwoy

Meh

Tridam

Xmas

Porous

Robbitop

Gotteshand

KimB

psurge

KimB

psurge

Pete

Moderate Nuisance

nutball

psurge

DeanoC

Trust me, I'm a renderer person!

KimB

psurge

KimB

psurge

KimB

psurge

KimB

Similar threads