How about a little R700 speculation?

I wonder if that means AMD's on-die GPU will need to have less pixel shading performance than it would have otherwise.

A lot pixel data could be streamed out to the satellite GPUs without clogging CPU caches, and the crossbar on AMD's die won't need to be widened as much to accomodate a full GPU's pixel traffic.

If they are counting on satellite GPUs to handle the brunt of the post-transform work, then the Fusion GPU could focus more on scheduling.

Where would geometry shading fit in this? It sounds like it could be very helpful to have that close the CPU.

Since the shader portions are highly parallel and they could conceivably get their own sockets which might allow better memory bandwidth scaling, they could possibly be engineered to tolerate the additional latency.
 
I wonder if that means AMD's on-die GPU will need to have less pixel shading performance than it would have otherwise.

A lot pixel data could be streamed out to the satellite GPUs without clogging CPU caches, and the crossbar on AMD's die won't need to be widened as much to accomodate a full GPU's pixel traffic.

If they are counting on satellite GPUs to handle the brunt of the post-transform work, then the Fusion GPU could focus more on scheduling.

Where would geometry shading fit in this? It sounds like it could be very helpful to have that close the CPU.

Since the shader portions are highly parallel and they could conceivably get their own sockets which might allow better memory bandwidth scaling, they could possibly be engineered to tolerate the additional latency.

what do u mean bu satellite GPUs?
 
what do u mean bu satellite GPUs?

The extra GPUs that work around the CPU GPU. I didn't have any other name to call them, and it's kind of awkward using all those GPU, CPUGPU, GPU/CPU abbreviations in the same sentence.
 
Last edited by a moderator:
Eh? You've just described a non-unified multi-chip traditional GPU. What's the point of that?

Jawed

At a very high level yes, but when you look at what each chip is doing, the master chip maximum priority to VS work, and best case you don't fully sature it with VS work allowing it to do some FS work. A non-unified chip would only have its 4 or 6 VSs to work with while the rest of the chip is idle. A unified one can have all shaders go full steam processing VS work, and then farm out FS work to the other chips, with all of them using all of their shaders for FS work.

A unified shader multi-chip GPU would still allow relatively effective load balancing.

- Being VS bottlenecked would probably be the worst case scenario but is still be better than any alternative I can think of, if after all you've fully saturated one unified chip with VS work, I'm not sure other configurations would fair well either.

- Being FS bottlenecked would work well because you can still use the idle shaders on the master chip to do FS in addition to the shaders on all the other chips. (So perhaps the term dedicated to VS wasn't accurate, just that it gives VS the highest priority)

So you still gets many of the benefits from load balancing, but it's a way of avoiding duplicate work as described by having each chip do the same VSing.

And I appologize if I'm not very clear, I'm running on only a few hours of sleep, I'll see about drawing up a diagram when I get a chance to better illustrate what I mean.
 
Last edited by a moderator:
Maybe it's a little bit like this in spirit, except that sections of the network used for load balancing are actually off-chip HT links, and that almost all of the work-types are handled by the same kind of unit (unified shader). I imagine that a single GPU core will be small enough to not cannibalize too much room on a CPU die (making Fusion possible), so that the high-end products are both multi-core and multi-chip (just like current CPU scaling strategy).
 
With these plans for DX10.1 is not hard to conclude that there is something in all this talk about paradigm shift in case of R700

 
If you consider a non-unified architecture, where the VS/GS units are embedded into the CPU, and the GPU becomes merely a cluster of chips containing PS units and ROPS, and a "software secret sauce" could work, because SMP CPUs could effectively generate the geometry and transform it to screen space before distributing it to different GPU cores.

The problem is with render targets, since you've have to re-gather the framebuffer back to the CPU, as well as what to do about streamout when you have N streams that need to be merged.

I think until process technology hits a wall, multi-GPU-chip approaches are a losing architecture except at very high end enthusiast levels.
 
Google with :

site:download.microsoft.com winhec 2006 wddm

And, while I'm posting, it seems to me that once GPUs have virtual memory and paging engines, moving pages of data twixt system memory and local memory is no different than moving data between GPU cores.

Jawed
 
Well, except that when moving to system memory, you'll have contention and coherency issues and the need to invalidate caches. I prefer on-chip approaches to SLI-on-steroids. The scheme has one advantage going with it, which is segmented memory bus and the ability to scale up memory bandwidth in the presense of a pinout wall. The cost issues, well, I don't think it has such a big advantage. Single chips are subject to Moore's law economics (yields vs density) , multichips have a more complicated factor. It may not neccessarily reduce manufacturing costs, but it reduces R&D costs by removing the need to develop, test, verify a whole range of GPU products. You have 1 core, instead of a half-dozen. It's an attempt to replace on-chip aggregation of pipes with software and board level integration, and hopefully have favorable economics.

It may work, but I prefer maximizing benefits from Moore's law first, and then scaling horizontally when you can't push it any further. Pack as much parallelism on chip as is possible for a given process node and at the largest chipsize you can economically manufacture, and work within power/heat tolerances. Then, figure out how to glue multiple copies of this chip together. However, if the approach is instead to make a bunch of small lowend G72/G73-style chips and bundle them together, I don't like it. I think there is more to gain by clustering best of breed performance, than clustering median performance. Economically, the latter may make more sense, but from a absolute performance standpoint, I prefer the latter.
 
Economically, the latter may make more sense, but from a absolute performance standpoint, I prefer the latter.
As far as I can see, the memory bus breaking through the pinout wall is the only aspect of this that's compelling.

It may turn out to be an intermediate solution before some radical new memory architecture becomes common (or chips genuinely become 3D with optical buses).

Jawed
 
GPUs will move to finer grained cross-die parallelism over the next few years. The precise details will undoubtedly vary from vendor to vendor but the trend will definitely be there. The reasons are primarily market driven.

The current screen slicing-dicing approaches are too application/graphics specific. As GPUs become more general purpose and more CPU like (the other trend), the need to share that load efficiently and generally across multiple dies regardless of the particular applications becomes more critical.

While the architectual and engineering obstacles are significant, especially the bandwidth and latency issues, they will be met.

Eventually the cross-die sharing will most likely be at the thread level, much like they are with CPUs, with threads dispatched to processor modules across dies much as they are now within a die. This will allow complete transparency to the applications, with close to 100% utilization across processors. In essence, adding another card or die on a card will be no different to an application than using a larger/faster chip with double the ALUs. Gone will be all the application incompatibilities, application specific limitations, vendor tweaks, unaccounted poor performance, etc. currently associated with multiple chips.

Another trend will be toward double precision fpt. Currently there is little demand for this since the GPGPU market is in its infancy. As GPUs become the defacto floating point engines on PCs and engineering-scientific workstations (and they will), the market will require that double precision be available as an option, just as it is with current CPUs. My guess is the GPU manufacturers will follow the same approach as their CPU counter parts and make double use of some of the registers/logic with the number of double precision ALUs being half the single precision, running at the same frequency. ATI/AMD are in an excellent position to leverage their SSE-X experience in the GPU arena.

The third trend will be toward much higher core frequencies in the floating point ALUs, than has been traditional for GPUs. This is also driven by market dynamics. NVidia has already started in this direction. ATI/AMD are sure to follow suit, given their combined expertise in this area as well as the competitive environment. How soon this happens will be interesting to see.

Whichever vendor meets these requirements first will have a good shot at capturing the lionshare of the future floating point engine market, not only in the entertainment-desktop market but across the entire engineering-scientific-supercomputer market as well.
 
GPUs will move to finer grained cross-die parallelism over the next few years. The precise details will undoubtedly vary from vendor to vendor but the trend will definitely be there. The reasons are primarily market driven.

Same SA that used to post here?
 
Another important trend will be the integration of GPUs with standard compilers. For GPUs to become the high performance floating point engine for general purpose computing they must integrate seamlessly with existing languages and compilers. Currently GPUs use proprietary interfaces and hardware models for general purpose computing. In the long run, the industry will need to define an open industry standard abstract machine that compilers can compile to regardless of the model or vendor. Ideally, it would abstract away the actual number of physical dies and ALUs, and appear as a set of n floating point resources that process threads. They will also need to eventually handle the additional details of standard floating point representations.
 
The same. My previous password no longer works, so I registered again.

There is nothing wrong with that account that I can see. If you can validate you're SA, we can reset the password, since from what I see you must have just forgotten it. You can send me an email at geo@beyond3d.com with the registration email addy you used for the SA account, and the typical domain it posted from. . . and if that matches what I see, then we can reset the password for you.
 
I interpret that DX10.1 slide as a multicore CPU statement rather than GPU.

Well it says multicore for rendering! With the ofloading CPU from the rendering with DX10, and Geometry Shader, There is no point bringing back the past in DX 10.1.

So it's about multi GPU core rendering ;)
 
I don't know if the bandwidth issues will be so serious. Look at HT3 where you have a low latancy, low pin count interconnect with 41.6GB/s bandwidth for a 32-bit link. If ATI were to use something like this, except in a 2x32-bit variety, I think you have sufficent low latancy bandwidth to move your GPU thread data around. Couple this with AMD's cache coherent HT extensions, some of their special transistor sauce, and you've got something interesting. Now you can give all the processors direct access to each other's RAM in a safe way with the same individal bandwidth we see today, without the need for a managing processor.

Also, once this system is in place, I imagine one can keep the GPUs you bought in the past and seamlessly add new GPUs as they become available. At the worst, you'd have to buy a new base GPU board to support a new faster interconnect, but would maintain backwards compatibility with the older GPUs. Thus, your multi-GPU investment is safe for much longer then it would be otherwise.
 
Last edited by a moderator:
Well it says multicore for rendering! With the ofloading CPU from the rendering with DX10, and Geometry Shader, There is no point bringing back the past in DX 10.1.

So it's about multi GPU core rendering ;)
In a sense the G80 is already 8 way multicore. Having glazed through the presentation I'd say it's about being able to have several application contexts running concurrently on GPU. (concurrently in the parallel cores sense, not in fast context switch timesharing sense)
 
Back
Top