NVIDIA Fermi: Architecture discussion

Or how much is due to the capacity of other internal blocks that aren't revealed by fillrate, flops or bandwidth numbers?
True, things like setup rate, or, apparently, that ATI GPUs are limited to 512 hardware threads, no matter how many SIMDs there are. (If the latter isn't a clue that R900 needs a serious revision, I don't know what is :p )

Anyway, this means all the bitching over HD5870 scaling was misdirected. It'll replay when GF100 appears for exactly the same reasons. But GF100 should have an advantage if the transition is a bit like HD3870->HD4870, where performance per unit or performance per mm² was seriously tackled. Also, as I said before, D3D11 games change the nature of the baseline.

Jawed
 
Neither am I. I'm not sold on it being much faster than HD5870 even. But you can either go on reasonable assumptions or you can do like neliz and assume the worst possible outcome in all aspects - timing, performance, power consumption etc.

I've always said GT300 will be faster than RV870 ;) But that might not even matter when it finally hits the shops.
 
Last edited by a moderator:
http://www.brightsideofnews.com/new...mi-is-less-powerful-than-geforce-gtx-285.aspx

The only bit that's interesting, I think:

Update #2, November 18 2009 02:17AM GMT - Following our article, we were contacted by Mr. Andy Keane, General Manager of Tesla Business and Mr. Andrew Humber, Senior PR Manager for Tesla products. In a long discussion, we discussed the topics in this article and topics of Tesla business in general. First and foremost, Tesla is the slowest clocked member of Fermi-GPU architecture as it has to qualify for supercomputers. The way to win the HPC contract is more complex than the CPUs itself.

Bear in mind that Intel flew heads out of Oak Ridge with their otherwise superior Core 2 architecture after Woodcrest had a reject rate higher than 8% [the results of that Opteron vs. Xeon trial in 2006 are visible today, as Oak Ridge is the site of AMD-powered Jaguar, world's most powerful supercomputer]. In order to satisfy the required multi-year under 100% stress, nVidia Tesla C2050/2070 went through following changes when compared to the Quadro card [which again, will be downclocked from the consumer cards]:
  • Memory vendor is providing specific ECC version of GDDR5 memory: ECC GDDR5 SDRAM
  • ECC is enabled from both the GPU side and Memory side, there are significant performance penalties, hence the GFLOPS number is significantly lower than on Quadro / GeForce cards.
  • ECC will be disabled on GeForce cards and most likely on Quadro cards
  • The capacitors used are of highest quality
  • Power regulation is completely different and optimized for usage in Rack systems - you can use either a single 8-pin or dual 6-pin connectors
  • Multiple fault-protection
  • DVI was brought in on demand from the customers to reduce costs
  • Larger thermal exhaust than Quadro/GeForce to reduce the thermal load
  • Tesla cGPUs differ from GeForce with activated transistors that significantly increase the sustained performance, rather than burst mode.
Since Andy is the General Manager for Tesla business, he has no contact with the Quadro Business or GeForce Business units, thus he was unable to answer what the developments are in those segments.
 
Anyway, I've thought of a work-around for HD5870 scaling/bandwidth-sensitivity questions that makes HD4890 irrelevant. It's bloody simple, compare it with HD5770!
cerating_qualitaet[/URL]

Here you can see that as the rendering workload increases the performance margin for HD5870 goes from 71% up to 82% and then falls back to 81%, while the theoretical margin is 100% on every single parameter: texture rate, fillrate, GB/s and FLOPS.
5870 is not double in triangle throughput.
 
That's precisely my point Dave. HD5970 is only 20% faster than Nvidia's old stuff and the advantage is much lower than that in many cases. It all depends on GF100 of course but the current numbers indicate that we won't see a repeat of this generation where GTX295 was later than HD4870X2 and not much faster. It looks like GF100x2 will be much later but could be much faster as well.

GF100x2 will need for sure a new process. And given that NV is already struggling with 40nm, I don't think they will be able to pass to 32nm or 28nm anytime soon... Remember that they won't use a new pp on a high - end card for the first time, so they will need to launch another card before...
I mean, it could take a year or so to see that kind of monster, maybe it won't even be compared with Hemlock, but with the refresh of RV870...

In graphics mode they'll get it for free unless they ask for it to be switched off. It'll be the default. It should be clear from the article that running the old MUL + ADD is two clocks, sorry if it's not.

Is that "for free" that sounds strange to me...
But what will be the flow?
In a real game situation, what would do Fermi if it encounters a series of shaders with some MADD?
Is there anyone patient enough to explain to me the whole "workflow"? :smile:
 
In graphics mode they'll get it for free unless they ask for it to be switched off. It'll be the default. It should be clear from the article that running the old MUL + ADD is two clocks, sorry if it's not.

That's not what i meant. I was thinking more along the lines of persuading devs to use FMA instead of MADD, thus taking 20% off of Cypress' peak throughput, while maintaing one's own peak rate. That way, you could give your arch an edge or start writing papers on how the other guys silently switch back FMAs to MAD, thus lowering "the quality".

As I said - 2010 will be fun.
 
GF100x2 will need for sure a new process.

Not sure how you can make that call now. That sounds just like Charlie and his proclamation that Nvidia couldn't make the 295.

In a real game situation, what would do Fermi if it encounters a series of shaders with some MADD?

You're thinking about it all wrong :) Fermi can never "encounter a MADD". The instructions sent to the GPU are based on the instruction set of the hardware. So a MADD hitting the compiler will be submitted to Fermi as an FMA instruction.
 
That's not what i meant. I was thinking more along the lines of persuading devs to use FMA instead of MADD, thus taking 20% off of Cypress' peak throughput, while maintaing one's own peak rate. That way, you could give your arch an edge or start writing papers on how the other guys silently switch back FMAs to MAD, thus lowering "the quality".

As I said - 2010 will be fun.

I don't think it would be a smart move.
It will hit Cypress (do you have a link that says that FMA in RV870 is done only by 4 units out of 5?) but it would also hit (much more) all the lower end cards of NVidia. So a GTX285 or a GTX295, even if they will continue to be good cards even in the not-so-near future, with a game that has only FMA ops they will stop to function, basically.
The move from MADD to FMA will probably happen, but not in 2010, I think.
 
Not sure how you can make that call now. That sounds just like Charlie and his proclamation that Nvidia couldn't make the 295.

In fact NVidia couldn't make a 295 on 65nm. :LOL:
That's what I'm saying. The GTX280 had a max power draw similar to the one of the GTX380... More than 225W, that is.
So why they couldn't be able to create a GTX295 with 2 gpus with that kind of power draw back then, while now they can? ;)

You're thinking about it all wrong :) Fermi can never "encounter a MADD". The instructions sent to the GPU are based on the instruction set of the hardware. So a MADD hitting the compiler will be submitted to Fermi as an FMA instruction.

OK.
But I don't get why submitting a MADD as an FMA is for free.
If you have an operation like this in a shader, what would Fermi do?

MAD tTC0.x, R0.x, R2.w, 0.5

It would change MAD with FMA, and then submits to the cuda cores, right?

And this happens at no cost. Even if you have one more logical step in what the compiler does?

Read->Send
Read->Change->Send

As a disclaimer, I am not a programmer or an engineer, but I'm just trying to understand the thing by a logical standpoint... :)
 
I don't think it would be a smart move.
It will hit Cypress (do you have a link that says that FMA in RV870 is done only by 4 units out of 5?) but it would also hit (much more) all the lower end cards of NVidia. So a GTX285 or a GTX295, even if they will continue to be good cards even in the not-so-near future, with a game that has only FMA ops they will stop to function, basically.
The move from MADD to FMA will probably happen, but not in 2010, I think.

WRT to 4-lane FMA: AMD says so itself in
http://www.pcgameshardware.de/aid,6...i/Grafikkarte/News/bildergalerie/?iid=1207291

WRT Nvidias lower end cards such as GTX 285 and 295 (which itself deserves a separate "lol", doesn't it?): you could tie FMA to DX11.
 
But I don't get why submitting a MADD as an FMA is for free.
If you have an operation like this in a shader, what would Fermi do?

MAD tTC0.x, R0.x, R2.w, 0.5

It would change MAD with FMA, and then submits to the cuda cores, right?

And this happens at no cost. Even if you have one more logical step in what the compiler does?

Read->Send
Read->Change->Send

As a disclaimer, I am not a programmer or an engineer, but I'm just trying to understand the thing by a logical standpoint... :)
You(the driver) can change the code however you(it) want(s) at compilation time.

When applications use shaders, they don't send down the whole shader, just a handle. The driver then binds the HW ISA version of that shader to the HW.
 
In fact NVidia couldn't make a 295 on 65nm. :LOL:

Actually he claimed they couldn't make one on 55nm either, then when it was apparent it was being done he claimed it will be done in extremely low "halo" quantities and then when he realized he was wrong he shut up about it :)

That's what I'm saying. The GTX280 had a max power draw similar to the one of the GTX380... More than 225W, that is.
So why they couldn't be able to create a GTX295 with 2 gpus with that kind of power draw back then, while now they can? ;)

You're right...if GTX 380 is 225W then it's not likely there will be a dual-GPU part based on that configuration.

But I don't get why submitting a MADD as an FMA is for free.

Because the compilation doesn't happen on the fly. It's done long before the shader is invoked (usually while the game or level is loading).
 
Back
Top