NVIDIA Fermi: Architecture discussion

XDR2 needs to be over twice as fast to give an advantage in peak bandwidth per data-pin.

? Care to elaborate more ? I feel like I am missing something... (is it due to micro-threading?)

Edit: I thought GDDR5 was using fully differential signalling for data too.
 
True, but RAMBUS allows for AMD and nVIDIA to ease out the transition to such a new radically different, for them, system (which would be painful and needs lots of resources).

What is about rambus ip that intel's money cannot buy? LRB will debut directly with gddr5. And since lrb is a more bandwidth efficient architecture, it'll beat amd and nv gpu's anyday on the bandwidth efficiency metric. And intel has the pockets to make sure that they'll not be competing with inferir memory technology any time soon.

As gpu's generalize they'll have to give up the luxury of present day inefficiencies to compete with more efficient chips. And yes, edram/tbdr's seem to be the only solution right now.
 
I just have to bring this up, what about using edram? Sure putting full depth buffers in there isn't really feasible, but z buffers are compressed nowadays so if you'd put for instance only the parts in there which are fully compressed (8x ratio?) you'd "only" need 16MB for 2560x1600 with 8xAA, which doesn't look unreasonable. You'd still need high-bandwidth memory (to fetch other parts of z buffer, color buffers, textures etc.) but surely this should help.
Z isn't compressed. It is fully allocated and static in size for a given render target size.

The data format for Z is optimised to minimise the count of locations that are accessed, e.g. a 4x4 tile of pixels where no edge falls can have Z samples written to a single tile in a burst for only 16 samples, leaving the remaining 48 samples (4xMSAA) unwritten. On-chip tile tag tables are used to keep track of the status of tiles.

Jawed
 
Z isn't compressed. It is fully allocated and static in size for a given render target size.

The data format for Z is optimised to minimise the count of locations that are accessed, e.g. a 4x4 tile of pixels where no edge falls can have Z samples written to a single tile in a burst for only 16 samples, leaving the remaining 48 samples (4xMSAA) unwritten. On-chip tile tag tables are used to keep track of the status of tiles.
Z *is* compressed or else you would see much slower results for single-sample Z fillrate. You still need to allocate the full memory, however, because you can't guarantee a consistent level of compression.
 
Jawed, that's just nitpicking, which invites the meta-nitpick that the individual tile is compressed. Anyway ... the term compression for this is now so well entrenched that regardless of your personal reservations of whether it's applicable or not is quite irrelevant.
 
Z isn't compressed. It is fully allocated and static in size for a given render target size.

The data format for Z is optimised to minimise the count of locations that are accessed, e.g. a 4x4 tile of pixels where no edge falls can have Z samples written to a single tile in a burst for only 16 samples, leaving the remaining 48 samples (4xMSAA) unwritten. On-chip tile tag tables are used to keep track of the status of tiles.

Jawed

Z buffer was already compressed on r100, which had no idea what MSAA is...
Granted it only had compression ratio of 2:1. You're right that it is always fully allocated, but there's absolutely no reason you had to allocate all of it in your shiny edram. (The 8x compression factor I mentioned btw isn't quite correct, I think today the max ratio is actually a lot higher with AA).
As you said, there's on-chip tile tags - if I'm not mistaken these indicate if a tile has 8x, 4x, 2x or no compression, and when the chip needs to read/write a z tile the chip will read the appropriate amount of data from memory to get the data (decompressed) into the small z buffer cache (or write it back respectively). So you could simply store the first chunk of data of each tile in edram (so all data in case of 8x compression, half the data if it's 4x, down to 1/8 if it's not compressed at all), and leave the rest in ram. Would be quite simple to implement - in fact it's so obvious I'm wondering why noone implemented something like that yet :).
 
Last edited by a moderator:
There's more than that, the hardware needs also be able to sample the compressed z representation. No rocket science though :)
 
EDRAM on the same die as the rest of the GPU logic would incur a cost in additional manufacturing complexity and possible impacts on the design effort and yields.

An upcoming large processor with EDRAM on-die is POWER7, but given the price that thing will go for, the tiny volume, and the service revenues IBM gets, the extra dollars to tens of dollars in per-chip manufacturing costs and possibly less than stellar yields mean very little.

Possibly future nodes will make this design option more feasible. If not, it would be a sign of significant competitive pressure to opt for the more expensive option.
 
What was the unshiny downside of viewport tiling again? Except for doing some border crossing vertices potentially multiple times and giving up some percentage of cache hits by changing your buffer access patterns that is.
 
In the end if it's inside the engine? Very little AFAICS. I expect some devs to experiment with approximate tile rendering on Fermi (ie. you tile, but you don't clip to tile boundaries and only render triangles which straddle boundaries once).
 
In the end if it's inside the engine? Very little AFAICS. I expect some devs to experiment with approximate tile rendering on Fermi (ie. you tile, but you don't clip to tile boundaries and only render triangles which straddle boundaries once).
Having a group of threads reading/writing from/to another group of threads tile(s) is not a good idea. Apart from the obvious synchronization issues good luck with maintaining the proper primitive submission order without leaving a lot of performance on the table.
 
Jawed, that's just nitpicking, which invites the meta-nitpick that the individual tile is compressed. Anyway ... the term compression for this is now so well entrenched that regardless of your personal reservations of whether it's applicable or not is quite irrelevant.
It's not a nitpick to say that 'you'd "only" need 16MB for 2560x1600 with 8xAA' is wrong. The entire surface needs to exist, which is more than 16MB. Anyway, we've moved past that. I think we've even discussed the partially within EDRAM type solution before...

Jawed
 
In the end if it's inside the engine? Very little AFAICS. I expect some devs to experiment with approximate tile rendering on Fermi (ie. you tile, but you don't clip to tile boundaries and only render triangles which straddle boundaries once).

I was thinking of a more software-transparent approach. But then. I'd guess, you could move to a finer-grained TBDR as well.
 
The good thing with GF100 is that you can take a break for a week or so, drool over your newborn like there's no tomorrow and there still won't be anything worthwhile about it in the net.

Sorry for the OT; as you where ;)
 
The good thing with GF100 is that you can take a break for a week or so, drool over your newborn like there's no tomorrow and there still won't be anything worthwhile about it in the net.

Sorry for the OT; as you where ;)

I guess "Congrats" is in order :)
 
Back
Top