Nvidia GT300 core: Speculation

Lukfi · May 16, 2009

R6xx family doesn't (AA resolve runs on shaders instead), R7xx family does.

cho · May 17, 2009

i think the "shader solve aa" on r6xx is just used for "xxxx-tent filter AA", the MSAA still use RBE on R6xx.

Lukfi · May 17, 2009

I must admit, I kinda lost what's the connection between this and the previous (also off-topic) discussion about DX7 hardware and HW/SW TnL. As far as I know, R6xx uses shaders for every kind of AA, including the standard MSAA box filter modes.

no-X · May 17, 2009

cho said:
i think the "shader solve aa" on r6xx is just used for "xxxx-tent filter AA", the MSAA still use RBE on R6xx.

R6xx use RBEs, but not for resolve pass. Resolve pass is always performed by shader core (for all modes, including box filter).

This changed with R7xx, which is able to correctly perform MSAA resolve via RBEs.

fellix · May 17, 2009

no-X said:
This changed with R7xx, which is able to correctly perform MSAA resolve via RBEs.

But only for the box kernel filter?

AlexV · May 17, 2009

fellix said:
But only for the box kernel filter?

Yes. Edge-Detect itself isn't solely software/shader based though. Doing it completely programatically would likely increase it's cost further.

gongo · May 17, 2009

Arun said:
I know NVIDIA's design decisions haven't always impressed everyone lately, but I hope you're not suggesting they replaced all their engineers by drunk monkeys?

Hi sorry to butt in the discussion. I do not understand all the deep techspeak of the past pages, forgive me i am only a normal consumer wishing to buy my first unified GPU. Can you please explain in laymen's english on why Nvidia design is bad lately?

I read at launch, GTX 260 and 280 were overpriced per performance over ATI offerings so they are not recommended. Now i can get Nvidia GTX200 cards at the same competitive prices with ATI equivalents and Nvidia is suppose to have perform better on many games at the same price levels and has better power/thermal management. However i heard Nvidia's cards are designed towards a batch of game engines (they work closer with developers) while ATI designed their cards as more towards how 3D would be like with a given set of budget and transistors lithography for the period in the market. Is this the reason why ATI has better design? Because they are designing for the DX specs evolution, they tend to perform, aged better with newer 3D engines?

I hope you can understand what i am trying to find out...

Lukfi · May 17, 2009

=>gongo: You can buy similarly powerful ATI and nVidia hardware for similar prices, but in terms of manufacturing costs, nVidia hardware is more expensive. That's why nVidia's design is "bad" and it's quite imporant for the manufacturer. You as a consumer, on the other hand, could care less about these things and you should just buy whatever gives you more bang for your buck.

Ailuros · May 17, 2009

Lukfi,

It's natural to lose connection with such an amalgam of different topics. In any case I think Arun answered Carsten's notion that NV might have ditched the ff ROPs and incorporated their functions elsewhere and that's what Arun actually replied to.

And just because I love to drop some oil into the fire while its still burning: ok ff ROPs it is. Can I still bet for twice the Pixel/Z throughput compared to today's GPUs?

trinibwoy · May 17, 2009

Ailuros said:
Can I still bet for twice the Pixel/Z throughput compared to today's GPUs?

Due to an increase in the number of ROPs or per ROP throughput? If Nvidia is currently happy with the bandwidth per ROP then it makes sense that a 512-bit GDDR5 bus will serve the equivalent of 64 GT200 class ROPs.

Ailuros · May 17, 2009

trinibwoy said:
Due to an increase in the number of ROPs or per ROP throughput? If Nvidia is currently happy with the bandwidth per ROP then it makes sense that a 512-bit GDDR5 bus will serve the equivalent of 64 GT200 class ROPs.

No idea. But yes you're right the increase in memory bandwidth according to most recent rumours sounds quite large. Your latter scenario sounds a tad weird to me as a layman since I'd figure 8 ROPs/partition unless someone can convince me that 32bit MC can be a better idea than 64bit MCs.

ShaidarHaran · May 17, 2009

trinibwoy said:
Due to an increase in the number of ROPs or per ROP throughput? If Nvidia is currently happy with the bandwidth per ROP then it makes sense that a 512-bit GDDR5 bus will serve the equivalent of 64 GT200 class ROPs.

Or perhaps the number of ROPs is not scheduled to increase but rather their capabilities. I'm sure NV isn't happy to trail ATI in the 8x MSAA performance category as is the case now.

trinibwoy · May 17, 2009

ShaidarHaran said:
Or perhaps the number of ROPs is not scheduled to increase but rather their capabilities. I'm sure NV isn't happy to trail ATI in the 8x MSAA performance category as is the case now.

They may or may not address that particular deficiency. It hasn't been much of an issue for them and there are a lot bigger fish to fry. First to 60fps in Crysis at 4MP wins!

fellix · May 17, 2009

The raster back-end hardware is already taking quite a large chunk (die-area wise) of the current GT200 architecture. If GT300 is to boost significantly its ALU capacity (and even more CUDA-ish overall design), probably it will be a matter of optimization per transistor/area instead of pumping more of the same old stuff, not only for the ROP domain, but every other graphics dedicated hardware in there.
Some of those measures could include mapping suitable fixed-function assets in to optimized program kernels to be emulated, sort of.

cho · May 18, 2009

no-X said:
R6xx use RBEs, but not for resolve pass. Resolve pass is always performed by shader core (for all modes, including box filter).

This changed with R7xx, which is able to correctly perform MSAA resolve via RBEs.

http://www.pcinlife.com/article/graphics/2008-06-30/1214766055d534_4.html

CarstenS · May 18, 2009

Ailuros said:
And just because I love to drop some oil into the fire while its still burning: ok ff ROPs it is. Can I still bet for twice the Pixel/Z throughput compared to today's GPUs?

And what exactly those ff ROPs are still capable of your little birdie didn't tell you?

Unknown Soldier · May 18, 2009

KonKort said:
Nvidia G300 has got taped out. He is actually running well at A1 step.
The GDDR5 memory, he is used, clocks higher than 1,000 MHz. So you can expect a bandwidth higher than 256 GB/s.

Source: Hardware-Infos

So we gonna get the thread changed to 'Nvidia G300 core: Speculation'

US

Ailuros · May 18, 2009

fellix said:
The raster back-end hardware is already taking quite a large chunk (die-area wise) of the current GT200 architecture. If GT300 is to boost significantly its ALU capacity (and even more CUDA-ish overall design), probably it will be a matter of optimization per transistor/area instead of pumping more of the same old stuff, not only for the ROP domain, but every other graphics dedicated hardware in there.
Some of those measures could include mapping suitable fixed-function assets in to optimized program kernels to be emulated, sort of.

I was thinking in that direction too as a layman. It's just that the Arun's drunken monkeys keep me from further thinking in that direction

CarstenS said:
And what exactly those ff ROPs are still capable of your little birdie didn't tell you?

Since I claim to belong to the feline family birds usually don't have time enough to even squeek

Jawed · May 18, 2009

TimothyFarrar said:
Not grim IMO, but rather shows what will become important. For example, note how the BG/P OS doesn't do disc backed memory, pages are always physically pinned so DMA engine has low latency and CPU doesn't touch pages during communication. What I gather from all of it is that eventually the hardware is going to consist of cores and interconnect which provides dedicated hardware support for the most important parallel communication patterns, so that the cores aren't involved in communication which is latency bound. Things like CPUs manually doing all the work on interrupts (preemption) just isn't going to scale ... nor is ALUs doing atomic operations on shared queues between cores ... etc. I think all this goes away at some point for dedicated hardware, and a different model of general purpose computing.

Yeah, in GPUs the hardware is effectively providing a queue for atomics and the pre-emptive scheduler is sleeping the context when it runs out of non-dependent instructions, and then using other contexts. GPUs currently have no useful concept of multi-GPU atomics, which is where AFR comes from.

My query is why can't software threading (fibre based) enjoy the same benefit? In software threading the scheduler would sleep/queue fibres.

The way I see it, where the atomic operation is performed doesn't impinge on whether the hardware runs 10s or 100s of contexts in order to hide latency, or whether software threading is used.

Going further out of my depth, how are atomics implemented in virtualised processors?...

My little brother (James Lottes, different last name) worked at Argonne in the MCS Division on tough scaling issues for Bluegene (until he decided to go back to get his PHD this year, now he works there on/off). An interesting paper related to the issues of scaling algorithms in interconnect limited cases, http://www.iop.org/EJ/article/1742-...quest-id=12293745-5238-4326-9be2-43b91b4c4753, covers how they adjust data exchange strategies for the problem to lower network latency.

Being able to use the dedicated reduction hardware seems to be the biggest win there (it was put there with good reason, eh?) but the crystal routing is none-too-shabby!

If you haven't read this PTX simulator paper, http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf, you might find it interesting. Their results showed performance more sensitive to interconnection network bisection bandwidth rather than latency.

Ooh, that's interesing and it's Wilson Fung again. A nice range of applications and not too much low-hanging fruit computationally.

I've got a number of issues with that paper:

it doesn't use the cache sizes that Volkov has indicated exist, as a baseline - though I'm still unclear on whether caching has any effect on global memory fetches (as opposed to fetches through texturing hardware)
it doesn't allow for developers to re-configure their algorithms to the hardware configurations they evaluated - this is particularly serious as CUDA programming is very much about finding a sweet-spot for the hardware in hand
the evaluations all seem too one-dimensional, with the exception of "are more threads better?" which groups several changes, but seemingly doesn't take the criticisms they make as clues for which other variables (e.g. memory controller queue-length) to take into account
PTX is fairly distant from what the processor executes, both because of the dual-ALU configuration which isn't simulated and because of the re-ordering of program flow that the hardware can perform. The very low MAD count makes me wonder if driver-compiler optimisations, from MUL + ADD into MAD is one of the things they missed - though I know that some developers steadfastly try to prevent this particular optimisation from occurring

There's some truly cruel IPCs shown there

Warp occupancy (inverse of branch-divergence) seems pretty decent overall, but MUM is a disaster zone.

All the worst-performing applications (BFS, DG, MUM, NN, WP - NQU isn't an application if you ask me - also can't actually see performance there) show that "perfect memory" would make a very substantial difference in performance.

With regard to on-die communication topology I wonder if these GPUs are using a single communications network. The diagrams for ATI GPUs clearly indicate multiple networks and 3dilettante's point:

http://forum.beyond3d.com/showpost.php?p=1292923&postcount=1103

about texture cache traffic being uni-directional is a very powerful point that massively affects the simulations performed.

It's also interesting that the ring doesn't look very good there. Did they give it enough bandwidth?

They also added a cache in their simulation, which indeed helped some of the apps, but also reduced the performance of a lot of them.

"a lot"? CP suffers due to a simulation artefact. RAY and FWT may suffer from cache policy. And without developer-optimisation, as I said earlier, it's not saying much - developers have optimised for whatever cache the apps have on the hardware they've tested, e.g. MUM is 2D texturing and benefits greatly. LIB is making heavy use of local memory (private to the thread in video memory is the definition of local memory, I guess) yet shows a performance decline as more and more cache is added that isn't explained.

Jawed

Jawed · May 18, 2009

PhilTaylor said:
typically unused features get designed out the next big silicon rev.

given the news about Advanced Scheduler being dropped was far enough in advance of Vista shipping, ATI might have removed the feature before 2xxx series.
[...]
this was a case of aiming too far ahead of some sectors of the market. I have to watch what I say here.

Subtle mis-application of this quote, but the overall context is what I want:

3dcgi said:
Nope. R600 paid the price.

Hmm, now I'm wondering if the combination of scheduling and ring-bus in R600 was the cause of much of the bloat that disappeared with R700, by dropping those features.

Though I'm still doubtful on the scheduling thing (never could find a sign of it in R600). But at the same time I'm wondering if the ring-bus was implemented in order to support higher costs of moving data around the die and/or on/off die, due to swapping amongst contexts (swapping registers against memory, load-balancing?).

I'm still dubious how AMD will expand the cluster count, e.g. to 20 clusters. Of course it helps having only 4 MCs/RBEs/L2s in the current architecture, and maybe if MCs and L2s stay at four in RV870 it's not too bad. RBEs seem to be independent of MC count and effectively have their own private caches and have a completely separate input fed by Shader Export, which handles output from kernels.

Jawed

Nvidia GT300 core: Speculation

Lukfi

cho

Lukfi

no-X

fellix

AlexV

Heteroscedasticitate

gongo

Lukfi

Ailuros

Epsilon plus three

trinibwoy

Meh

Ailuros

Epsilon plus three

ShaidarHaran

hardware monkey

trinibwoy

Meh

fellix

cho

CarstenS

Moderator

Unknown Soldier

Ailuros

Epsilon plus three

Jawed

Jawed

Similar threads