Nvidia GT300 core: Speculation

Status
Not open for further replies.
i think the "shader solve aa" on r6xx is just used for "xxxx-tent filter AA", the MSAA still use RBE on R6xx.
 
I must admit, I kinda lost what's the connection between this and the previous (also off-topic) discussion about DX7 hardware and HW/SW TnL. As far as I know, R6xx uses shaders for every kind of AA, including the standard MSAA box filter modes.
 
i think the "shader solve aa" on r6xx is just used for "xxxx-tent filter AA", the MSAA still use RBE on R6xx.
R6xx use RBEs, but not for resolve pass. Resolve pass is always performed by shader core (for all modes, including box filter).

This changed with R7xx, which is able to correctly perform MSAA resolve via RBEs.
 
I know NVIDIA's design decisions haven't always impressed everyone lately, but I hope you're not suggesting they replaced all their engineers by drunk monkeys?

Hi sorry to butt in the discussion. I do not understand all the deep techspeak of the past pages, forgive me i am only a normal consumer wishing to buy my first unified GPU. Can you please explain in laymen's english on why Nvidia design is bad lately?

I read at launch, GTX 260 and 280 were overpriced per performance over ATI offerings so they are not recommended. Now i can get Nvidia GTX200 cards at the same competitive prices with ATI equivalents and Nvidia is suppose to have perform better on many games at the same price levels and has better power/thermal management. However i heard Nvidia's cards are designed towards a batch of game engines (they work closer with developers) while ATI designed their cards as more towards how 3D would be like with a given set of budget and transistors lithography for the period in the market. Is this the reason why ATI has better design? Because they are designing for the DX specs evolution, they tend to perform, aged better with newer 3D engines?

I hope you can understand what i am trying to find out...
 
=>gongo: You can buy similarly powerful ATI and nVidia hardware for similar prices, but in terms of manufacturing costs, nVidia hardware is more expensive. That's why nVidia's design is "bad" and it's quite imporant for the manufacturer. You as a consumer, on the other hand, could care less about these things and you should just buy whatever gives you more bang for your buck.
 
Lukfi,

It's natural to lose connection with such an amalgam of different topics. In any case I think Arun answered Carsten's notion that NV might have ditched the ff ROPs and incorporated their functions elsewhere and that's what Arun actually replied to.

And just because I love to drop some oil into the fire while its still burning: ok ff ROPs it is. Can I still bet for twice the Pixel/Z throughput compared to today's GPUs? :devilish:
 
Can I still bet for twice the Pixel/Z throughput compared to today's GPUs? :devilish:

Due to an increase in the number of ROPs or per ROP throughput? If Nvidia is currently happy with the bandwidth per ROP then it makes sense that a 512-bit GDDR5 bus will serve the equivalent of 64 GT200 class ROPs.
 
Due to an increase in the number of ROPs or per ROP throughput? If Nvidia is currently happy with the bandwidth per ROP then it makes sense that a 512-bit GDDR5 bus will serve the equivalent of 64 GT200 class ROPs.

No idea. But yes you're right the increase in memory bandwidth according to most recent rumours sounds quite large. Your latter scenario sounds a tad weird to me as a layman since I'd figure 8 ROPs/partition unless someone can convince me that 32bit MC can be a better idea than 64bit MCs.
 
Due to an increase in the number of ROPs or per ROP throughput? If Nvidia is currently happy with the bandwidth per ROP then it makes sense that a 512-bit GDDR5 bus will serve the equivalent of 64 GT200 class ROPs.

Or perhaps the number of ROPs is not scheduled to increase but rather their capabilities. I'm sure NV isn't happy to trail ATI in the 8x MSAA performance category as is the case now.
 
Or perhaps the number of ROPs is not scheduled to increase but rather their capabilities. I'm sure NV isn't happy to trail ATI in the 8x MSAA performance category as is the case now.

They may or may not address that particular deficiency. It hasn't been much of an issue for them and there are a lot bigger fish to fry. First to 60fps in Crysis at 4MP wins!
 
The raster back-end hardware is already taking quite a large chunk (die-area wise) of the current GT200 architecture. If GT300 is to boost significantly its ALU capacity (and even more CUDA-ish overall design), probably it will be a matter of optimization per transistor/area instead of pumping more of the same old stuff, not only for the ROP domain, but every other graphics dedicated hardware in there.
Some of those measures could include mapping suitable fixed-function assets in to optimized program kernels to be emulated, sort of.
 
Last edited by a moderator:
R6xx use RBEs, but not for resolve pass. Resolve pass is always performed by shader core (for all modes, including box filter).

This changed with R7xx, which is able to correctly perform MSAA resolve via RBEs.

http://www.pcinlife.com/article/graphics/2008-06-30/1214766055d534_4.html

fb2004_r600xt.png


fb2004_rv770pro.png
 
And just because I love to drop some oil into the fire while its still burning: ok ff ROPs it is. Can I still bet for twice the Pixel/Z throughput compared to today's GPUs? :devilish:
And what exactly those ff ROPs are still capable of your little birdie didn't tell you? ;)
 
The raster back-end hardware is already taking quite a large chunk (die-area wise) of the current GT200 architecture. If GT300 is to boost significantly its ALU capacity (and even more CUDA-ish overall design), probably it will be a matter of optimization per transistor/area instead of pumping more of the same old stuff, not only for the ROP domain, but every other graphics dedicated hardware in there.
Some of those measures could include mapping suitable fixed-function assets in to optimized program kernels to be emulated, sort of.

I was thinking in that direction too as a layman. It's just that the Arun's drunken monkeys keep me from further thinking in that direction :LOL:

And what exactly those ff ROPs are still capable of your little birdie didn't tell you? ;)

Since I claim to belong to the feline family birds usually don't have time enough to even squeek :p
 
Not grim IMO, but rather shows what will become important. For example, note how the BG/P OS doesn't do disc backed memory, pages are always physically pinned so DMA engine has low latency and CPU doesn't touch pages during communication. What I gather from all of it is that eventually the hardware is going to consist of cores and interconnect which provides dedicated hardware support for the most important parallel communication patterns, so that the cores aren't involved in communication which is latency bound. Things like CPUs manually doing all the work on interrupts (preemption) just isn't going to scale ... nor is ALUs doing atomic operations on shared queues between cores ... etc. I think all this goes away at some point for dedicated hardware, and a different model of general purpose computing.
Yeah, in GPUs the hardware is effectively providing a queue for atomics and the pre-emptive scheduler is sleeping the context when it runs out of non-dependent instructions, and then using other contexts. GPUs currently have no useful concept of multi-GPU atomics, which is where AFR comes from.

My query is why can't software threading (fibre based) enjoy the same benefit? In software threading the scheduler would sleep/queue fibres.

The way I see it, where the atomic operation is performed doesn't impinge on whether the hardware runs 10s or 100s of contexts in order to hide latency, or whether software threading is used.

Going further out of my depth, how are atomics implemented in virtualised processors?...

My little brother (James Lottes, different last name) worked at Argonne in the MCS Division on tough scaling issues for Bluegene (until he decided to go back to get his PHD this year, now he works there on/off). An interesting paper related to the issues of scaling algorithms in interconnect limited cases, http://www.iop.org/EJ/article/1742-...quest-id=12293745-5238-4326-9be2-43b91b4c4753, covers how they adjust data exchange strategies for the problem to lower network latency.
Being able to use the dedicated reduction hardware seems to be the biggest win there (it was put there with good reason, eh?) but the crystal routing is none-too-shabby!

If you haven't read this PTX simulator paper, http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf, you might find it interesting. Their results showed performance more sensitive to interconnection network bisection bandwidth rather than latency.
Ooh, that's interesing and it's Wilson Fung again. A nice range of applications and not too much low-hanging fruit computationally.

I've got a number of issues with that paper:
  1. it doesn't use the cache sizes that Volkov has indicated exist, as a baseline - though I'm still unclear on whether caching has any effect on global memory fetches (as opposed to fetches through texturing hardware)
  2. it doesn't allow for developers to re-configure their algorithms to the hardware configurations they evaluated - this is particularly serious as CUDA programming is very much about finding a sweet-spot for the hardware in hand
  3. the evaluations all seem too one-dimensional, with the exception of "are more threads better?" which groups several changes, but seemingly doesn't take the criticisms they make as clues for which other variables (e.g. memory controller queue-length) to take into account
  4. PTX is fairly distant from what the processor executes, both because of the dual-ALU configuration which isn't simulated and because of the re-ordering of program flow that the hardware can perform. The very low MAD count makes me wonder if driver-compiler optimisations, from MUL + ADD into MAD is one of the things they missed - though I know that some developers steadfastly try to prevent this particular optimisation from occurring
There's some truly cruel IPCs shown there :p Warp occupancy (inverse of branch-divergence) seems pretty decent overall, but MUM is a disaster zone.

All the worst-performing applications (BFS, DG, MUM, NN, WP - NQU isn't an application if you ask me - also can't actually see performance there) show that "perfect memory" would make a very substantial difference in performance.

With regard to on-die communication topology I wonder if these GPUs are using a single communications network. The diagrams for ATI GPUs clearly indicate multiple networks and 3dilettante's point:

http://forum.beyond3d.com/showpost.php?p=1292923&postcount=1103

about texture cache traffic being uni-directional is a very powerful point that massively affects the simulations performed.

It's also interesting that the ring doesn't look very good there. Did they give it enough bandwidth?

They also added a cache in their simulation, which indeed helped some of the apps, but also reduced the performance of a lot of them.
"a lot"? CP suffers due to a simulation artefact. RAY and FWT may suffer from cache policy. And without developer-optimisation, as I said earlier, it's not saying much - developers have optimised for whatever cache the apps have on the hardware they've tested, e.g. MUM is 2D texturing and benefits greatly. LIB is making heavy use of local memory (private to the thread in video memory is the definition of local memory, I guess) yet shows a performance decline as more and more cache is added that isn't explained.

Jawed
 
typically unused features get designed out the next big silicon rev.

given the news about Advanced Scheduler being dropped was far enough in advance of Vista shipping, ATI might have removed the feature before 2xxx series.
[...]
this was a case of aiming too far ahead of some sectors of the market. I have to watch what I say here.
Subtle mis-application of this quote, but the overall context is what I want:
Nope. R600 paid the price.
Hmm, now I'm wondering if the combination of scheduling and ring-bus in R600 was the cause of much of the bloat that disappeared with R700, by dropping those features.

Though I'm still doubtful on the scheduling thing (never could find a sign of it in R600). But at the same time I'm wondering if the ring-bus was implemented in order to support higher costs of moving data around the die and/or on/off die, due to swapping amongst contexts (swapping registers against memory, load-balancing?).

I'm still dubious how AMD will expand the cluster count, e.g. to 20 clusters. Of course it helps having only 4 MCs/RBEs/L2s in the current architecture, and maybe if MCs and L2s stay at four in RV870 it's not too bad. RBEs seem to be independent of MC count and effectively have their own private caches and have a completely separate input fed by Shader Export, which handles output from kernels.

Jawed
 
Status
Not open for further replies.
Back
Top