Nvidia GT300 core: Speculation

Status
Not open for further replies.
What do you know about the architecture of GT300 stream processors... maybe they'll support FP64 in a similar way the RV770 does (like, same units, more clock loops required).
 
Maybe that's what they'll do - maybe DP in GT200 is the way it is because it was bolted on, meaning the SP MAD lanes weren't touched, minimising the design effort and making for an interim solution.

Jawed
 
Say you build a "global illumination" algorithm, you might use low refresh rates on the computations, or even stagger the computations over successive frames operating on different areas/LODs in a round robin fashion.

Or you have a particle system built upon an effects-physics simulation.

Or reprojection caching of pixels, using final shaded pixel results from prior frames.

Any kind of technique that makes stuff persist over multiple frames is AFR-unfriendly.

Jawed

That makes sense. And honestly, if you take the mile high overview of things like me and then look at the kind of parallelism available in the shading stages, it's a shame that the best mGPU stuff being plain AFR.:rolleyes: Surely, better techniques have to be available to achieve mGPU scaling, even if it means moving in a lrb like direction.
 
Maybe that's what they'll do - maybe DP in GT200 is the way it is because it was bolted on, meaning the SP MAD lanes weren't touched, minimising the design effort and making for an interim solution.

Jawed

When I was posting that assumption, I was (rightfully) being confronted with the extended feature-set for DP Nvidia markets right now. So it seems kinda odd if they'd for a feature-less implementation by stitching the normal ALU-lanes together.
 
Would stitching lanes result in loss of features?

My understanding is that comprehensive IEEE-754 support is proving unavoidable in the future, so then it's a question of the margin between the single-precision and double-precision comprehensiveness.

Jawed
 
It's been done before ... but for the moment GPGPU which requires IEEE support is a tiny market which will become very hard to compete in with Larrabee around, so meh. NVIDIA can't go back on it's support for downward compatibility reasons, but whether it was ultimately necessary? Don't think so.

Hell, the exceptions are as important as rounding modes ... they aren't fully compliant to start with, and likely won't be.
 
Would stitching lanes result in loss of features?
That I don't know - but I presume it's a question of transistor budget and how much you're going to have to add to each lane to make stitching work with the full (or better) feature set of the current DP unit.
 
Would stitching lanes result in loss of features?

I think so. Atleast if you wanted to have minimal sp alu's. Right now all the fancy transistors work only for DP, if every 2 sp alu's are ganaged together to make a dp alu, then the fancy transistors will have to be replicated several times. Which I think would be prohibitive from a transistor budget POV. AFAIK, amd's dp is 3x faster but has less features compared to nv.
 
That makes sense. And honestly, if you take the mile high overview of things like me and then look at the kind of parallelism available in the shading stages, it's a shame that the best mGPU stuff being plain AFR.:rolleyes: Surely, better techniques have to be available to achieve mGPU scaling, even if it means moving in a lrb like direction.

What about LRB makes it potentially more adept at multi-chip scaling? It seems that the fundamental data sharing problem of AFR that Jawed pointed out also applies to SFR and tiling. It seems like the only way forward would be some sort of memory sharing and cache coherent multi-chip architecture.
 
It seems like the only way forward would be some sort of memory sharing and cache coherent multi-chip architecture.

My point exactly. :)

A lrb based x2 board will be comparable to a multi socket server architecturally, having shared memory and coherent caches.
 
The problem is bandwidth ... cache coherency Larrabee style causes more problems than it solves when used externally (generating extra bandwidth is what snooping is good at).
 
When comparing against HD 4850 - no. It's about 40-50 watts higher under load and I do not want to attribute all that to the GPU alone, albeit it runs at a higher voltage.
Voltage does make a large difference. In fact its where the majority of the idle power savings are coming from with HD 4890.

However, there is nearly 2x the bandwidth between a GDDR5 and GDDR3 HD 4800, which means there is 2x the speed on the PHY and memory controller - of course there is going to be higher power (nothing is for free in 3D). You run a PCI Express Gen 2 device in Gen 1 mode and there will be a high power difference.

You/AMD have shown that PP 2.0 can work really well in the case of HD4670. All I'm saying is, that I'd love to the an equivalently impressive implementation on HD 4770. :)
Load efficiency and idle power savings/features are different things.
 
The problem is bandwidth ... cache coherency Larrabee style causes more problems than it solves when used externally (generating extra bandwidth is what snooping is good at).

Does Larrabee even have the hooks to begin considering that approach? There's Quickpath and Hypertransport for CPUs but has there been any word on any sort of inter-chip link on LRB?
 
The problem is bandwidth ... cache coherency Larrabee style causes more problems than it solves when used externally (generating extra bandwidth is what snooping is good at).

That's where the 3D part comes in. These caches aren't full blown read/write truly general memory caches. They are read only caches (for textures). The tiles that the lrb is going to keep in it's L2 are on overlapping (for render targets/color/z). So all alu's/cores are not sharing any cachelines (for that part atleast). There's not much scope for the kind of cache coherency traffic that can occur in servers for instance. Come to think of it, even on chip lrb ring bus will saturate if there was too much traffic just to maintain cache coherency.


So there are software side guarantees to keep the cache apocalypse from happening.


Does Larrabee even have the hooks to begin considering that approach? There's Quickpath and Hypertransport for CPUs but has there been any word on any sort of inter-chip link on LRB?

They can be added if Intel wants.
 
Then you don't really need that kind of fine grained coherency to begin with now do you? It's a poorly scaling scheme, which even Intel will have to ditch as they increase the number of cores.
 
Last edited by a moderator:
Why they went for coherent caches then? Because they may be helpful in non graphics workloads perhaps. for mGPU scaling, however, full general CPU like cache coherency protocol is unnecessary. Unless, of course, we see some radical changes in the graphics pipeline in the future.
 
Well, if in future the color of each pixel starts depending on color of every other pixel, then you'd need lrb like cache coherency. I doubt things would become as bad (or as radical) as that. They appear to be there to handle non-graphics stuff and to provide the familiarity of existing programming models. Whatever changes the future brings, they'd likely respect the inherent parallelism of graphics.
 
Status
Not open for further replies.
Back
Top