NVIDIA Fermi: Architecture discussion

One consideration about the patent most recently posted is that I am not sure how necessary part of the patent is.
The desire to compress down runs of sequential RAW hazards seems laudable, but the maximum number of registers addressable in CUDA does not exceed those of already existing and standardly scoreboarded designs.
I couldn't be bothered to read that closely, but I suspect what's going on there is that the compression is used to cater for horizontal or vertical register file addressing. i.e. registers can be allocated in either direction, depending on access pattern in instructions. One or the other layout then plays ball with compression.

Oh and the main instruction issue patent document, since I stumbled upon it:

http://v3.espacenet.com/publication...7214343A1&KC=A1&FT=D&date=20070913&DB=&locale=

The title uses "out of order" to refer to hardware threads, rather than intra-thread instruction ordering.

Compilation can be used to re-order instructions for minimal hazards per issue clock:
[0043] The number of threads in a given core may also be varied according to the particular implementation and the amount of latency that is to be hidden. In this connection, it should be noted that in some embodiments, instruction ordering can also be used to hide some latency. For instance, as is known in the art, compilers for graphics processor code can be optimized to arrange the instructions of the program such that if there is a first instruction that creates data and a second instruction that consumes the data, one or more other instructions that do not consume the data created by the first instruction are placed between the first and second instructions. This allows processing of a thread to continue while the first instruction is executing. It is also known in the art that, for instructions with long latencies, it is usually not practical to place enough independent instructions between creator and consumer to fully hide the latency. In determining the number of threads per core, consideration may be given to the availability (or lack thereof) of such optimizations; e.g., the number of threads supported by a core may be decided based on the maximum latency of any instruction and the average (or minimum or maximum) number of instructions that a particular compiler can be expected to provide between a maximum-latency instruction and its first dependent instruction.
But, regardless, because instruction issue is keyed upon register-dependency:

[0076] Buffer 510 is advantageously configured to store collected operands together with their instructions while other operands for the instruction are being collected. In some embodiments, issuer 506 is configured to issue instructions to execution units 142 as soon as their operands have been collected. Issuer 506 is not required to issue instructions in the order in which they were dispatched. For example, instructions in buffer 510 may be stored in a sequence corresponding to the order in which they were dispatched, and at each clock cycle issuer 506 may select the oldest instruction that has its operands by stepping through the sequence (starting with the least-recently dispatched instruction) until an instruction that has all of its operands is found. This instruction is issued, and instructions behind it in the sequence are shifted forward; newly dispatched instructions are added at the end of the sequence. The sequence may be maintained, e.g., by an ordered set of physical storage locations in buffer 510, with instructions being shifted to different locations as preceding instructions are removed.

[0077] In one embodiment, an instruction that has been dispatched to issuer 506 remains in buffer 138 until it has been issued to execution module 142. After dispatch, the instruction is advantageously maintained in a valid but not ready state (e.g., the valid bit 210 for a dispatched instruction may remain in the logical true state until the instruction is issued). It will be appreciated that in embodiments where issuer 506 may issue instructions out of the dispatch order, this configuration can help to prevent multiple instructions from the same thread from being concurrently present in buffer 510, thereby preserving order of instructions within a thread.
intra-thread instructions can issue out of order. With the proviso that the document isn't a 100% guarantee of what's inside a GPU...

Actually it's possible to read that as merely stating that the hardware-thread ordering isn't necessarily maintained by issuer 506.

Anyway we agree, whichever way we take it, the fine-grained register-dependency and operand-readiness scoreboarding is relatively costly.

Jawed
 
Yes Dave but I'm not buying the "system limited" argument because multi-GPU setups continue to scale higher. If we were system limited that would not be possible, therefore there is room to improve performance on the GPU side of things.
In fact, it's not totally true.

Cypress shows a bad scaling compared to Juniper CF (-> AFR), but at the same time playability is more in line with Cypress than CF, pointing out to system limitation.

Try a frametimes record on some scene, and you'll probably see huge variations on the CF while the single board will only show a big variation. From my numbers on Heaven, Juniper XT managed to render each frame in between 5 to 200ms, that's what I call big... imagine what it will give with AFR since each GPU has to wait for the other one to complete to "finalize" its frame... that will probably give like 0-200ms, so a perceivable stuttering while average framerate is almost doubled.

Now, if you take the exact same benchmark with one Cypress, you'll end up with 5-100ms render time per frame, so average framerate is almost equal to Juniper CF but stuttering will be much less visible although it's still there.

A simple 2D example of this situation is a scrolling pattern, if you scroll 1 pixel every 1/60th second it's perfect, if you scroll 2 pixels it's barely acceptable, but if you scroll alternatively 1 and 3 pixels it'll be ugly. Unfortunately, many engines seem to render in a way causing such a pattern, even with only one GPU (perhaps some shaders data are only updated on 50% of the frames or even less), and adding one or more GPU(s) won't bring you anywhere as the slowest rendering frame will always imply a stuttering effect even if scaling "seems" to be perfect.

Anyway, Cypress driver doesn't seem to be mature enough to conclude, I ran the old X3-Reunion benchmark and just found very disappointing numbers, almost twice slower than my old RV770 on the exact same PC.

So, while this is quite interesting to investigate, what remains to be seen is how Fermi will manage to handle DX11 rendering compared to Cypress, with a probably (very high chance) slower rasterizer, about half the raw shading power and its unified L1/shared memory. Note that RAM bandwidth doesn't seem to be a bottleneck for Evergreen GPUs btw, doubling it only marginally improves performance, be it on Juniper or Cypress.

I'd be very happy to find a tool to deactivate SIMD blocks in Cypress and Juniper to investigate further, as that would allow to test a GPU with 10 SIMD blocks and 32 ROP, for example. Talking about SIMD, I'm still not sure if NV will go for 14 clusters for the "360" and 16 for the "380", it could very well be 12/14 with half-year refreshes having 14/16 after more tweaking.
 
Or it's lost on the inefficiency of scaling an API that wasn't designed to be run on multiple GPUs. But that's beside the point, fact is adding GPU power increases performance far above what a single GPU can do which means there's a lot of room to improve the single GPU setup and not be hindered by the system.
Well the obvious one is triangle setup rate. Making a GPU setup two triangles per clock is a significant architectural change.

Also, adding a second GPU (CrossFire) doesn't necessarily help frame-rate minima, so it's not much of a win and mostly invalidates such comparisons.

Jawed
 
I couldn't be bothered to read that closely, but I suspect what's going on there is that the compression is used to cater for horizontal or vertical register file addressing. i.e. registers can be allocated in either direction, depending on access pattern in instructions. One or the other layout then plays ball with compression.
It occurred to me that it might be because of the sheer number of scoreboard contexts that must be maintained per thread, and that it is possible for an implementation to have scoreboard hardware that is underspecified for the max number of scoreboarded threads with a max number of non-contiguous register hazards.
Maybe having max threads that have instructions writing to every other register can force a scoreboard stall.

Actually it's possible to read that as merely stating that the hardware-thread ordering isn't necessarily maintained by issuer 506.
Yes, it is somewhat ambiguous.
The later part of claim 77 would indicate that at least for some embodiments, there is a more explicit attempt to make sure intra-thread ordering is respected.
 
The later part of claim 77 would indicate that at least for some embodiments, there is a more explicit attempt to make sure intra-thread ordering is respected.
I think that's to reduce consumption of scoreboard - if intra-thread re-ordering is done then minimisation of that re-ordering will constrain the number of scoreboard entries consumed by the thread.

Jawed
 
THE BOTTOM LINE in my mind is if the A3 stepping is the one that goes to market, Nvidia now knows exactly how Fermi stacks up against the 5870. Every day hundreds of gamers are choosing to NOT wait any longer, sans concrete data on Fermi, and are opting for the 5800 series, which they do have concrete data on - and which makes them drool.

IF Fermi was bitch slapping the 5870 into last week (30%+ gaming performance advantage) why wouldn't Nvidia would be SAYING so in a concrete way. If Fermi COULD do so, why wouldn't they be crowing it from the rooftop? I am unable to dredge up a single logical reason why they wouldn't be doing so. Any Fermi partisans out there got one?

I consider that the most compelling reason to doubt Fermi has a substantial performance advantage over the 5870 as it is a reflection of the hard knowledge of it.

How about the "Damn that card is freakin fast" factor. Nvidia was pretty quite about G80 before, less not for leaks once they started sending out cards to AIBs 2-4 weeks before launch. We could very well be looking at the same thing all over again. And to date, they have been trumpeting the GPGPU side of Fermi, not the gaming side. Two entirely different uses.
 
I agree, if i was nv and fermi is fast as f*** the best thing would be to shut up about it, let the world think it's gonna suck, then wammo shock and awe the world into submission (would create a huge buzz)
 
How about the "Damn that card is freakin fast" factor. Nvidia was pretty quite about G80 before, less not for leaks once they started sending out cards to AIBs 2-4 weeks before launch. We could very well be looking at the same thing all over again. And to date, they have been trumpeting the GPGPU side of Fermi, not the gaming side. Two entirely different uses.

Nvidia wasn't being killed by the competition when they released G80...
 
I agree, if i was nv and fermi is fast as f*** the best thing would be to shut up about it, let the world think it's gonna suck, then wammo shock and awe the world into submission (would create a huge buzz)

I disagree. Every day they don't have a product on the market is money lost. Why not try to prevent potential customers from purchasing a competing product by telling them how awesome your product is? Marketing 101.
 
I agree, if i was nv and fermi is fast as f*** the best thing would be to shut up about it, let the world think it's gonna suck, then wammo shock and awe the world into submission (would create a huge buzz)

Nvidia have already been running spoilers against AMD's DX11 cards. If they could crow about better Fermi performance right now, then that is what they would be doing. For a card they've been claiming will be out in the next month, they must know what they've got. Even if Fermi was three months away, Nvidia should know what they've got - if they are stockpiling for launch. If Nvidia are not stockpiling and don't know performance, then they will be late and/or in severe shortage during their supposed launch period.

I suspect that with it's problems, Fermi's marketing will concentrate on GPGPU, PhysX, CUDA etc if it doesn't meet performance expectations. The problem is that if Fermi doesn't manage to hit it's performance targets, Nvidia will be forced to sell it more cheaply than they would like against a competitor that already has a die size advantage.

If AMD drops their price, then Fermi can't just match 5870/5890 performance and still expect to get a higher price - Nvidia will have to drop prices or be substantially faster/better to get a higher price.
 
THE BOTTOM LINE in my mind is if the A3 stepping is the one that goes to market, Nvidia now knows exactly how Fermi stacks up against the 5870. Every day hundreds of gamers are choosing to NOT wait any longer, sans concrete data on Fermi, and are opting for the 5800 series, which they do have concrete data on - and which makes them drool.

IF Fermi was bitch slapping the 5870 into last week (30%+ gaming performance advantage) why wouldn't Nvidia would be SAYING so in a concrete way. If Fermi COULD do so, why wouldn't they be crowing it from the rooftop? I am unable to dredge up a single logical reason why they wouldn't be doing so. Any Fermi partisans out there got one?

I consider that the most compelling reason to doubt Fermi has a substantial performance advantage over the 5870 as it is a reflection of the hard knowledge of it.


They already kinda stated it when they said Fermi is going to have the fastest chip in every segement :D. If you think that doesn't mean performance, well I don't know what else to say! To give out concrete numbers, wait till CES its just about a week away.

And marketing 101, its always better to show it then just crow about something, remember the fake fermi board.... that kinda backfired, showing leaves no doubts.
 
If, as Jawed points out, its CPU limited then the benchmarks are going to have limited use as each of the architectures you are comparing are going to be limited by the CPU. Games are fairly CPU bound (even ones that people often associate as GPU killers, like Crysis, are very CPU sensitive).

Im really curious, but from your perspective what is the ideal mixture of CPU and GPU hardware to extract ideal performance per $ from AMD hardware? I know this is slightly off topic but also relevant in a grander sense as well.

So say you've got an HD 5850, then whats the AMD CPU to go with that? Is it the Phenom 945?
 
At the general sense of the performance improvement being unnecessary, I don't believe that can be considered the case. The ideal frame-rate on a mouse driven interface is roughly 60 frames per second to get a truely responsive game. I don't believe we quite have that across the board at the 24" 1920/1080 or 1200 resolution monitors we have. Some people may like to turn the eye candy up which kills frame-rate but I would suspect most here would like their cake too and having 60FPS is the best of both worlds.
 
They already kinda stated it when they said Fermi is going to have the fastest chip in every segement :D. If you think that doesn't mean performance, well I don't know what else to say! To give out concrete numbers, wait till CES its just about a week away.

Depends how you define "fastest" and "segment". Fermi could be +30 percent faster than 5890 in the high-end, but if it comes in at $5000, it isn't going to sell. It could be +1 percent faster, in which case a price cut from AMD still makes it a hard choice. I might be "fastest" at DP GPGPU calculations or ECC support - something that does not benefit the mainstream. I'm sure Nvidia PR have their weasel words ready.

And marketing 101, its always better to show it then just crow about something, remember the fake fermi board.... that kinda backfired, showing leaves no doubts.

But marketing 101 tells us it's better to show... but when you've got nothing to show, you crow about what you are going to show when you've got it. You don't just let the competition eat your lunch when they've got a newer, better product selling hand over fist, and you're scrambling around trying to get your late one out the door.

Unless of course, you can't show, and you can't say anything good about your upcoming product - in which case you are right, it's better to keep quiet on the specifics and hope you can BS your way through the launch of an under-performing product with the use of DP, HPC, PhysX, etc.
 
Depends how you define "fastest" and "segment". Fermi could be +30 percent faster than 5890 in the high-end, but if it comes in at $5000, it isn't going to sell. It could be +1 percent faster, in which case a price cut from AMD still makes it a hard choice. I might be "fastest" at DP GPGPU calculations or ECC support - something that does not benefit the mainstream. I'm sure Nvidia PR have their weasel words ready.


That wasn't PR that stated that, it was a product manager :rolleyes:. And its a very fast card, you wouldn't see them stating that if it wasn't, they would go through the route of value for the money or something else. Thats marketing and PR for ya. There is a fine line between a lie and spin. If that statement is false, they are just lieing.

But marketing 101 tells us it's better to show... but when you've got nothing to show, you crow about what you are going to show when you've got it. You don't just let the competition eat your lunch when they've got a newer, better product selling hand over fist, and you're scrambling around trying to get your late one out the door.

Yes so you can't wait a week and half?

Unless of course, you can't show, and you can't say anything good about your upcoming product - in which case you are right, it's better to keep quiet on the specifics and hope you can BS your way through the launch of an under-performing product with the use of DP, HPC, PhysX, etc.

Oh they were talking in the context of gaming ;)
 
I disagree. Every day they don't have a product on the market is money lost. Why not try to prevent potential customers from purchasing a competing product by telling them how awesome your product is? Marketing 101.

Added to that AMD is soon releasing a dozen Cedar and Redwood boards, which will rapidly take over and own the value segment of the market for the forseeable future. The top end, small as it is, is the only segment Nvidia will even have a card to compete in and without a compelling reason (read - substantial performance advantage) to do otherwise, by march another several hundred thousand of those potential (and highest margin) customers will have spurned Fermi and turned to AMD for their GPU fix.
 
They already kinda stated it when they said Fermi is going to have the fastest chip in every segement :D. If you think that doesn't mean performance, well I don't know what else to say! To give out concrete numbers, wait till CES its just about a week away.

Except they kinda DIDN'T say Fermi IS going to be the fastest chip in every segment, they said they EXPECT it to be the fastest chip in every segment.

There's a world of difference in that word choice.
 
Back
Top