NVIDIA GF100 & Friends speculation

KimB · Mar 24, 2010

fbomber666 said:
No. 8500 was released much earlier than geforce 4. In fact, Nvidia released the geforce 3 Ti500 to compete with the 8500.

Sorry, was going by memory. The 8500 was released 7 months after the GeForce3, and 6 months before the GeForce4. This makes it, I suppose, more similar in terms of execution to the current generation than any of the others. The only major difference, as near as I can tell, is that there wasn't, at the time, any strong reason to believe that the 8500 had been delayed by 6 months.

Sontin · Mar 24, 2010

jimbo75 said:
My personal belief is nVidia stopped innovating on gaming gpu's the very second AMD bought ATI. A full on rebranding scheme has got them into this position while ATI went back to basics and delivered on time. I believe nVidia thought they had much more time than they had.

A shame that everybody is ignoring Tessellation.

Picao84 · Mar 24, 2010

Sontin said:
A shame that everybody is ignoring Tessellation.

Well, its not really tesselation you mean is it? Because it was "created" by ATI, AFAIK.
What I think you mean is the reorganisation of the GPU structure to better cope (in theory) with geometry tasks.

Mintmaster · Mar 24, 2010

Picao84 said:
If there would not be diminishing returns for a given architecture, noone would ever change them

Yeah they would. In the past, features make old architectures obsolete. But that doesn't really apply anymore, as the shader model just undergoes tweaks with new DX revisions.

Regarding scaling, the only thing that hasn't scaled 2x with evergreen are portions of the scene that are limited by geometry, bandwidth, or the CPU/PCIe. Anything that needs flops, rops, or texture sampling has scaled as expected.

We may see ATI move to a scalar architecture, as I've often advocated in the past, but it's not very difficult to do so while keeping the branching and issue rate the same per SIMD and thus most of the architecture intact.

rpg.314 · Mar 24, 2010

Chalnoth said:
And no, I strongly disagree with this claim. It might be valid in areas where Windows is popular, but most other operating systems have no trouble whatsoever running on other architectures.

Linux is fairly ISA agnostic, but if I may appeal to Top500, x86-64 has it locked down hard. All around me, I see x86 dominating other ISAs by orders of magnitude in no. of cores deployed. And no matter what, performance-porting apps across ISAs is hard. And please don't forget that they'll need all the drivers ported to new isa for all the hw they'll ever use. And sure as hell, that ain't an easy task.

Porting away from x86 is like expecting entropy of an open system to decrease. Theoretically, yes (with a small probability, practically, well, you know.....). Just ask designers of PA-RISC/Alpha/PowerPC/Mips/68000/yadda-yadda...

I'm pretty sure that Linux is the OS of choice in most HPC environments.

Yup, Linux has a monopoly in HPC about as big as Windows has on desktop. Don't know why but it won't go away anytime soon.

Sontin · Mar 24, 2010

Picao84 said:
Well, its not really tesselation you mean is it? Because it was "created" by ATI, AFAIK.
What I think you mean is the reorganisation of the GPU structure to better cope (in theory) with geometry tasks.

I could call it "DX 11 tessellation", but yes.

It's ironic that cypress has this huge amount of flops but it's unable to use it with tessellation because of the implementation. And GF100 is so fast with tessellation that it is limited by the pixel calculations...

air_ii · Mar 24, 2010

Picao84 said:
Dont get me wrong, im not trying to spin anything. If anything, ATI does it by promoting FLOPS on their gaming PR, knowing it has no meaning to graphics

But, i ask from you: Is it conceivable to imagine an HD5870 with less FLOPS and same performance? If answer is yes, ok. If answer is no, then my first sentence stays: Cypress needs more FLOPS power to perform.

And then, here we are saying GF100 is mainly directed at GPGPU, when with less FLOPS it does more graphically

I think you should look at architectural efficiency as a whole, not some randomly picked numbers (perf/$, perf/sq.mm). Even if nVidia's flops are more "efficient", if AMD can pack more in the same sized chip, which one is better (for the record, I'm not implying either is, just don't get what he's on about)?

air_ii · Mar 24, 2010

Sontin said:
A shame that everybody is ignoring Tessellation.

I think the entire conversation is pointless... It has been rumbled on several times now...

thatdude90210 · Mar 24, 2010

Sontin said:
I could call it "DX 11 tessellation", but yes.
It's ironic that cypress has this huge amount of flops but it unable to use it with tessellation because of the implementation. And GF100 is so fast with tessellation that they are limited by the pixel calculation...

We don't know that for sure, that the GF100 being some sort tessellation monster compared to Cypress. If the info in this post is correct, for all we know, this 1.1 is simply optimized to their hardware on a different (other than tessellation) level. Could be that the GF framerates also take big hits in Dx 11. Wait for some real benchmarks first.

Picao84 · Mar 24, 2010

Mintmaster said:
Yeah they would. In the past, features make old architectures obsolete. But that doesn't really apply anymore, as the shader model just undergoes tweaks with new DX revisions.

I stand corrected in my extreme affirmation :smile:
But you understood what i mean

Regarding scaling, the only thing that hasn't scaled 2x with evergreen are portions of the scene that are limited by geometry, bandwidth, or the CPU/PCIe. Anything that needs flops, rops, or texture sampling has scaled as expected.

So, my inquire about whether HD5870 has more FLOPS than it really needs can be right after all. Thanks.

And geometry, the thing nVIDIA (theoretically) worked on, seems the right move. Shame they had to spoil it with something else...

KimB · Mar 24, 2010

rpg.314 said:
Linux is fairly ISA agnostic, but if I may appeal to Top500, x86-64 has it locked down hard. All around me, I see x86 dominating other ISAs by orders of magnitude in no. of cores deployed. And no matter what, performance-porting apps across ISAs is hard. And please don't forget that they'll need all the drivers ported to new isa for all the hw they'll ever use. And sure as hell, that ain't an easy task.

As I said, x86 is quite popular right now. But x86 isn't nearly as "locked-in" in the HPC space as it is in the consumer space. Yes, performance-porting is challenging, but usually that's managed by porting a relatively small number of libraries, such as BLAS and LAPACK, where most of the compute time is spent. Such libraries are typically the only parts of HPC applications that are architecture-optimized anyway.

The #1 reason why x86 is popular in the HPC space right now is because AMD and Intel are leveraging the huge (by HPC standards) volume of consumer-space products to basically out-R&D their competitors.

Of course, if nVidia is going to get into the HPC space with a hybrid CPU-GPU part after Fermi, they're going to have the same uphill battle that all non-x86 manufacturers have. Their best bet would probably be to focus on producing effective CPU-GPU parts for markets like cell phones, and, if they can become successful there, use that to leverage larger HPC CPU-GPU parts.

rpg.314 said:
Yup, Linux has a monopoly in HPC about as big as Windows has on desktop. Don't know why but it won't go away anytime soon.

Because Linux is vastly superior for this space. Some of the biggest benefits off the top of my head are the ssh interface (which is essential for working remotely), the more-or-less standardized compiler and library setup (which is really important for porting between different machines), and the extremely powerful commandline (which is essential for saving time in executing large numbers of jobs).

rpg.314 · Mar 24, 2010

Mintmaster said:
We may see ATI move to a scalar architecture, as I've often advocated in the past, but it's not very difficult to do so while keeping the branching and issue rate the same per SIMD and thus most of the architecture intact.

Initially, I used to think like that as well, but now I am definitely on the fence for this one. vec4 totally borks the register allocation, but that could probably be worked around in compiler. In view of the efficiency of VLIW overall, and the cheap (almost free) DP cost makes me feel that they should stay the course. Even if you factor out the 5x ILP, amd's alu's appear to be more efficient than nv's.

Mintmaster · Mar 24, 2010

Sontin said:
I could call it "DX 11 tessellation", but yes.
It's ironic that cypress has this huge amount of flops but it's unable to use it with tessellation because of the implementation. And GF100 is so fast with tessellation that it is limited by the pixel calculations...

Tessellation doesn't need lots of flops. Even on GF100, I'm sure under 20% of its ALUs will be working on tessellation at any time. Four triangles set up per clock means it can't use do more than one vertex generated per hot clock. The ALUs have 1024 flops to offer in that time, and probably more because even NVidia said something like 3.2 tri/clk is the fastest they can achieve.

rpg.314 · Mar 24, 2010

Chalnoth said:
As I said, x86 is quite popular right now. But x86 isn't nearly as "locked-in" in the HPC space as it is in the consumer space. Yes, performance-porting is challenging, but usually that's managed by porting a relatively small number of libraries, such as BLAS and LAPACK, where most of the compute time is spent. Such libraries are typically the only parts of HPC applications that are architecture-optimized anyway.

What about all the drivers for all the hw? They are heavily ISA dependent by design.

Of course, if nVidia is going to get into the HPC space with a hybrid CPU-GPU part after Fermi, they're going to have the same uphill battle that all non-x86 manufacturers have. Their best bet would probably be to focus on producing effective CPU-GPU parts for markets like cell phones, and, if they can become successful there, use that to leverage larger HPC CPU-GPU parts.

For that, they'll have to sell soc's with 2 arm A9's and ~30 fermi SM's to the mobile phone industry. :runaway:

KimB · Mar 24, 2010

rpg.314 said:
What about all the drivers for all the hw? They are heavily ISA dependent by design.

Fortunately in the HPC space you can build your machine for the task at hand, which means you can stick to a relatively small variety of hardware that has the drivers you want.

Mintmaster · Mar 24, 2010

rpg.314 said:
Initially, I used to think like that as well, but now I am definitely on the fence for this one. vec4 totally borks the register allocation, but that could probably be worked around in compiler. In view of the efficiency of VLIW overall, and the cheap (almost free) DP cost makes me feel that they should stay the course. Even if you factor out the 5x ILP, amd's alu's appear to be more efficient than nv's.

Basically, think of AMD retaining the same register structure and modifying the data flow so that it allows dependent data flow in each xyzw instruction group. Instead of a four-instruction group being executed on 16 pixels each clock, a single instruction is executed on 64 pixels each clock. Instead of flipping between two wavefronts to fill the ALU pipeline, it goes round-robin on eight wavefronts. All the issue rates stay the same, register granularity actually decreases, and most of the ALU stays intact.

LordEC911 · Mar 24, 2010

Picao84 said:
So, my inquire about whether HD5870 has more FLOPS than it really needs can be right after all. Thanks.

What? I don't think you understand what you are asking...

Silus said:
NVIDIA struck gold with their G80 deisng and especially G92, which took ATI almost 3 years to catch up. As for Fermi, well new architectures tend to be very hard to start (just look at what ATI had to deal with R600) and this is just another example.

G80 = Nov '06
RV770 = June '08

Since RV770 "caught up" how do you get to three years? It is less than two years.

rpg.314 · Mar 24, 2010

Mintmaster said:
Basically, think of AMD retaining the same register structure and modifying the data flow so that it allows dependent data flow in each xyzw instruction group. Instead of a four-instruction group being executed on 16 pixels each clock, a single instruction is executed on 64 pixels each clock. Instead of flipping between two wavefronts to fill the ALU pipeline, it goes round-robin on eight wavefronts. All the issue rates stay the same, register granularity actually decreases, and most of the ALU stays intact.

IOW, quadruple the per clock throughput of each wavefront to issue it in one clock instead of 4, just like per warp throughput was doubled in fermi to one every 2 clocks. OTOH, eight wavefronts = 512 threads for minimal coverage of simd. That is quite high...

What you just described is your scheme. I want to know the rationale.

Archaeolept · Mar 24, 2010

fbomber666 said:
No. 8500 was released much earlier than geforce 4. In fact, Nvidia released the geforce 3 Ti500 to compete with the 8500.

just a minor rectification of this error - the geforce 3 predated the 8500 by over 6 months. The 8500 was meant to be a geforce 3 killer, but early driver issues meant it took a while to achieve it's true potential. In the end, at very high AF, it was able to challenge the gf4 4200.

edit: meh, didn't notice chalnoth's post, and can't delete

Silus · Mar 24, 2010

seahawk said:
And? If you look at the results ATI builts a smaller GPU with more Flops and real hardware tesselation that perfroams equally to the huge monster NV made, and which they could not produce for 6 months after 580X0 came out.

Cypress does not need more Flops, it simply has more Flops.

Do you think that repeating that "real tessellation" thingy, somehow makes it true ?

It's incredible that at this stage, with architecture specs and all, some people are still hanging on to some FUD spreading"articles"...

NVIDIA GF100 & Friends speculation

Similar threads