NVIDIA Fermi: Architecture discussion

Jawed · Dec 31, 2009

3dilettante said:
I've seen descriptions of it being out-of-order completion, is there a source that confirms out-of-order issue?

You can start here:

Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators

(Link doesn't actually work, so here's the naked URL:

http://v3.espacenet.com/publication...=B1&FT=D&date=20081007&DB=EPODOC&locale=en_V3

)

and rummage for others. The register dependency scoreboarding is what determines which instructions can issue. So instructions can issue out of order.

Jawed

Silus · Dec 31, 2009

spigzone said:

So you didn't know that ?...Interesting...
But I'll explain it again. You can't run some games maxed out (even using Eyefinity) with a HD 5850, because the HD 5850 is a very powerful card (although I'm not saying it isn't), but rather because games nowadays don't have hefty requirements to be played, given the fact that most of them, are just console ports and developers focus on the lowest common denominator (main SKU is always the console one).

3dilettante · Dec 31, 2009

trinibwoy said:
That's fine but that raises more questions than it answers. I'm gonna assume that you guys do lots of profiling of existing and future game workloads and use that analysis to determine where to focus with future hardware. So now you're saying that doubling texture units and doubling ALUs did not result in doubled performance because the bottleneck is elsewhere. So why didn't you guys address those bottlenecks instead of doubling up stuff unnecessarily? Honest question.

There is always a bottleneck (or the design is so underachieving it can't accomplish much) but it isn't the same in every situation, and some are not in the hands of the GPU silicon.

Bottlenecks at the system level are not going to change.
Certain bottlenecks such as memory bandwidth cannot be easily fixed, particularly if the design is meant to be an evolution of an older architecture and also facing economic and electrical constraints on the ability to scale external bandwidth.
Triangle setup, if it is the root cause of some of the scaling problems in the Heaven benchmark, would not be a bottleneck in most current games, or would be a bottleneck in only certain situations.

If ALU and TMU resources hadn't been doubled, we'd probably be complaining about the increase in situations where the chip is ALU and TMU limited.

mczak · Dec 31, 2009

spigzone said:
I recently saw benchmarks for the Powercolor watercooled version. My 20-30% guesstimate was maybe overblown but 10-20% is certainly within the realm of the possible.

10%-20% sounds doable to me too. Don't forget the 4870 wasn't bandwidth limited (so increasing ram frequency didn't really help much), but the 5870 could benefit from faster ram - and it should be possible to get 6Gbps gddr5 chips (even though those are overvolted parts) instead of the 5Gbps parts now.

DavidGraham · Dec 31, 2009

Okay , so giving the current estimations of 512-core based Geforce being close to HD5970 , what would GTX 360 performance be?

Isn't it logical that GTX 360 would use the same chips that the newly announced Tesla cards use ? (ie 448-cores) , in that case would it beat HD5870 ?

DeF · Dec 31, 2009

Silus said:
Maybe you missed it, but the HD 5870 is barely 50% faster than the HD 4870. Less even than the HD 489 ...

Well, your barely 50% number is a bit inaccurate. Charts from computerbase say that 5870 is 70% faster than 4870 at 1680x1050 4xAA/16xAF and 1920x1200 4xAA/16xAF.

trinibwoy · Dec 31, 2009

3dilettante said:
There is always a bottleneck (or the design is so underachieving it can't accomplish much) but it isn't the same in every situation, and some are not in the hands of the GPU silicon. Certain bottlenecks such as memory bandwidth cannot be easily fixed, particularly if the design is meant to be an evolution of an older architecture and also facing economic and electrical constraints on the ability to scale external bandwidth.

Well exactly, my question is why aren't resources directed to resolve those bottlenecks instead of piling on stuff for non-existent workloads?

If ALU and TMU resources hadn't been doubled, we'd probably be complaining about the increase in situations where the chip is ALU and TMU limited.

It would be nice to see an example of that. So far there aren't any outliers, nearly every single game falls in the same 45-55% band in terms of HD5870's advantage over RV770. You can either claim that workloads are not making use of the available horsepower (which then leads back to the above question) or you can claim that the available horsepower isn't being applied efficiently. Can't really have it both ways.

3dilettante · Dec 31, 2009

Jawed said:
http://v3.espacenet.com/publication...=B1&FT=D&date=20081007&DB=EPODOC&locale=en_V3

and rummage for others. The register dependency scoreboarding is what determines which instructions can issue. So instructions can issue out of order.

What I see claimed is a method for what amounts to scoreboard compression, such that contiguous runs of registers that can pose a hazard can be checked more quickly and the scoreboard can scale to higher register and thread counts.
It is possible to scoreboard an in-order design or allow out of order completion.
I'm not seeing the claim that the issue stage--upon encountering a hazard in the scoreboard--will attempt to issue the next instruction.

It is also possible that this patent may not entirely apply to an actual product, as patents do not always wind up in one. The scoreboard scheme could possibly be used for an OoO completion architecture.

Jawed · Dec 31, 2009

3dilettante said:
It is also possible that this patent may not entirely apply to an actual product, as patents do not always wind up in one. The scoreboard scheme could possibly be used for an OoO completion architecture.

That's why I suggested you rummage - if you're interested, you'll want to read across a set of patent documents.

Jawed

Dave Baumann · Dec 31, 2009

trinibwoy said:
Well exactly, my question is why aren't resources directed to resolve those bottlenecks instead of piling on stuff for non-existent workloads?

When designing a GPU you don't you are still bound by the rest of the system; I can't resolve a CPU bottleneck, or a system RAM bottleneck, or a system bus bottleneck, etc., etc. with GPU engine changes. And when looking at bottleneck its always a shifting thing, even intra-frame.

However, when designing one element of the system you look at a whole range of stuff and come up with a configuation that buys the best bang for the buck across a whoe range of apps - lots of stuff is profiled and the the best overall configuration (bearing in mid future factors) is picked.

It would be nice to see an example of that. So far there aren't any outliers, nearly every single game falls in the same 45-55% band in terms of HD5870's advantage over RV770. You can either claim that workloads are not making use of the available horsepower (which then leads back to the above question) or you can claim that the available horsepower isn't being applied efficiently. Can't really have it both ways.

That because games have so may dependancies on them. Yet, you go and look at some of the stuff prunedtree is showing and, bang, you see nigh on perfect scaling, because thats what his work is most bound by.

trinibwoy · Dec 31, 2009

Dave Baumann said:
When designing a GPU you don't you are still bound by the rest of the system; I can't resolve a CPU bottleneck, or a system RAM bottleneck, or a system bus bottleneck, etc., etc. with GPU engine changes. And when looking at bottleneck its always a shifting thing, even intra-frame.

Yes Dave but I'm not buying the "system limited" argument because multi-GPU setups continue to scale higher. If we were system limited that would not be possible, therefore there is room to improve performance on the GPU side of things.

However, when designing one element of the system you look at a whole range of stuff and come up with a configuation that buys the best bang for the buck across a whoe range of apps - lots of stuff is profiled and the the best overall configuration (bearing in mid future factors) is picked.

I can understand that. Maybe the reason is that it's cheaper/easier to simply add more ALUs than to address other more troublesome bottlenecks (bandwidth for one). That's what I'm trying to get some insight into. Performance tuning of anything focuses on the slowest component. Now we keep hearing that games aren't ALU bound yet AMD keeps throwing more ALUs at them, that's the disconnect I'm trying to understand. Why not throw resources at the bottleneck instead?

That because games have so may dependancies on them. Yet, you go and look at some of the stuff prunedtree is showing and, bang, you see nigh on perfect scaling, because thats what his work is most bound by.

Prunedtree had to squeeze his algorithm to perfectly map against the TMU and ALU structure of one specific GPU. I agree with you that it depends on the workload but his work is hardly a generally applicable example.

3dilettante · Dec 31, 2009

Jawed said:
That's why I suggested you rummage - if you're interested, you'll want to read across a set of patent documents.

That is slim guidance.
It does not address the concern that I am ill-equiped to know which ones Nvidia decided to actually use, as the patents I have seen aren't so bald-faced as to say "this is used in GTxx products", though maybe I did not see those.

Other descriptions and articles by those who have contacts with Nvidia's engineering have described an architecture that can pick ready instructions from different threads and can allow out of order completion.
I do not have the connections to determine if a patent by Nvidia was used, and the patents I have seen today and what I recall from patent-fests in other threads don't explicitly outline the more involved Out of Order issue mechanism.

Dave Baumann · Dec 31, 2009

trinibwoy said:
Yes Dave but I'm not buying the "system limited" argument because multi-GPU setups continue to scale higher. If we were system limited that would not be possible, therefore there is room to improve performance on the GPU side of things.

From the GPU perspective you have two of everything there, however you rarely see perfect scaling, why? Usually system limitations. Despite more engine than HD 5850, 5970's average at 25x16 is ~70%, so there is a lot lost on system dependancies.

I can understand that. Maybe the reason is that it's cheaper/easier to simply add more ALUs than to address other more troublesome bottlenecks (bandwidth for one). That's what I'm trying to get some insight into. Performance tuning of anything focuses on the slowest component. Now we keep hearing that games aren't ALU bound yet AMD keeps throwing more ALUs at them, that's the disconnect I'm trying to understand. Why not throw resources at the bottleneck instead?

You're making a bad presuposition that other areas weren't addressed - and it would be entirely false to hink as such, as again some of the work being shown in the GPGPU forum is already showing.

Prunedtree had to squeeze his algorithm to perfectly map against the TMU and ALU structure of one specific GPU. I agree with you that it depends on the workload but his work is hardly a generally applicable example.

His squeezed each ounce of performance out of a particular ALU / TMU organisation (SIMD), yet when run on an architecture that that scales a similar organistation, shows perfect scaling.

Jawed · Dec 31, 2009

3dilettante said:
That is slim guidance.
It does not address the concern that I am ill-equiped to know which ones Nvidia decided to actually use, as the patents I have seen aren't so bald-faced as to say "this is used in GTxx products", though maybe I did not see those.

Other descriptions and articles by those who have contacts with Nvidia's engineering have described an architecture that can pick ready instructions from different threads and can allow out of order completion.
I do not have the connections to determine if a patent by Nvidia was used, and the patents I have seen today and what I recall from patent-fests in other threads don't explicitly outline the more involved Out of Order issue mechanism.

Well I can't help you then.

Regardless (of whether instruction issue is out of order), scoreboarding every instruction and scoreboarding every operand is considerably more expensive than the approach seen in ATI, where scoreboarding is at the hardware thread level tracking Control Flow instructions (rather than ALU instructions or TEX instructions), which are issued in order. Waterfalling constants, LDS writes/reads and indexed register writes (within an ALU clause) create hazards for ALU instruction issue - and in that case the ALUs stall (though LDS and indexed-register operations don't necessarily stall) - so that's a pipeline state. Not the finely-grained scoreboarding that NVidia indulges in.

Jawed

3dilettante · Dec 31, 2009

Jawed said:
Regardless (of whether instruction issue is out of order), scoreboarding every instruction and scoreboarding every operand is considerably more expensive than the approach seen in ATI, where scoreboarding is at the hardware thread level tracking Control Flow instructions (rather than ALU instructions or TEX instructions), which are issued in order. Waterfalling constants and indexed register writes (within an ALU clause) both create hazards for ALU instruction issue - and in that case the ALUs stall - so that's a pipeline state. Not the finely-grained scoreboarding that NVidia indulges in.

I agree that the clause-level tracking is significantly simpler and lower in transistor cost.

One consideration about the patent most recently posted is that I am not sure how necessary part of the patent is.
The desire to compress down runs of sequential RAW hazards seems laudable, but the maximum number of registers addressable in CUDA does not exceed those of already existing and standardly scoreboarded designs.

spigzone · Dec 31, 2009

mczak said:
10%-20% sounds doable to me too. Don't forget the 4870 wasn't bandwidth limited (so increasing ram frequency didn't really help much), but the 5870 could benefit from faster ram - and it should be possible to get 6Gbps gddr5 chips (even though those are overvolted parts) instead of the 5Gbps parts now.

Say ... 15% ... well within reach?

That ~= GTX 285 in SLI.

From data to date, Fermi surpassing that seems a bit fantasy islandy.

trinibwoy · Dec 31, 2009

Dave Baumann said:
From the GPU perspective you have two of everything there, however you rarely see perfect scaling, why? Usually system limitations. Despite more engine than HD 5850, 5970's average at 25x16 is ~70%, so there is a lot lost on system dependancies.

Or it's lost on the inefficiency of scaling an API that wasn't designed to be run on multiple GPUs. But that's beside the point, fact is adding GPU power increases performance far above what a single GPU can do which means there's a lot of room to improve the single GPU setup and not be hindered by the system.

You're making a bad presuposition that other areas weren't addressed - and it would be entirely false to hink as such, as again some of the work being shown in the GPGPU forum is already showing.

I'm not making any presupposition. The numbers speak for themselves. Let me rephrase the question. Given what you know of graphics workloads and Cypress performance today do you think there's another configuration that would be faster? For example would fewer ALUs and a wider memory bus be a generally faster option? If no, why not?

His squeezed each ounce of performance out of a particular ALU / TMU organisation (SIMD), yet when run on an architecture that that scales a similar organistation, shows perfect scaling.

Well there's no magic there for exactly that reason. A specific workload targeting a specific architecture scales with that architecture. That doesn't apply to games.

3dilettante · Dec 31, 2009

trinibwoy said:
I'm not making any presupposition. The numbers speak for themselves. Let me rephrase the question. Given what you know of graphics workloads and Cypress performance today do you think there's another configuration that would be faster? For example would fewer ALUs and a wider memory bus be a generally faster option? If no, why not?

A wider bus would require additional chip perimeter for the pads, which would most likely have added to die size (at which point, why not add even more ALUs for all the free space).
Another option would be to remove other features that take up pad space, but those options would be very limited as memory pads are the dominant feature on the perimeter.

Psycho · Dec 31, 2009

Dave Baumann said:
Games are fairly CPU bound (even ones that people often associate as GPU killers, like Crysis, are very CPU sensitive).

And even games that show decent resolution (gpu) scaling will be cpu bound in parts of the frames. Games are not single render call furmark like

spigzone · Dec 31, 2009

THE BOTTOM LINE in my mind is if the A3 stepping is the one that goes to market, Nvidia now knows exactly how Fermi stacks up against the 5870. Every day hundreds of gamers are choosing to NOT wait any longer, sans concrete data on Fermi, and are opting for the 5800 series, which they do have concrete data on - and which makes them drool.

IF Fermi was bitch slapping the 5870 into last week (30%+ gaming performance advantage) why wouldn't Nvidia would be SAYING so in a concrete way. If Fermi COULD do so, why wouldn't they be crowing it from the rooftop? I am unable to dredge up a single logical reason why they wouldn't be doing so. Any Fermi partisans out there got one?

I consider that the most compelling reason to doubt Fermi has a substantial performance advantage over the 5870 as it is a reflection of the hard knowledge of it.

NVIDIA Fermi: Architecture discussion

Jawed

Silus

3dilettante

mczak

DavidGraham

DeF

trinibwoy

Meh

3dilettante

Jawed

Dave Baumann

Gamerscore Wh...

trinibwoy

Meh

3dilettante

Dave Baumann

Gamerscore Wh...

Jawed

3dilettante

spigzone

trinibwoy

Meh

3dilettante

Psycho

spigzone

Similar threads