NVIDIA Fermi: Architecture discussion

MfA · Dec 1, 2009

XDR2 needs to be over twice as fast to give an advantage in peak bandwidth per data-pin.

Panajev2001a · Dec 1, 2009

MfA said:
XDR2 needs to be over twice as fast to give an advantage in peak bandwidth per data-pin.

? Care to elaborate more ? I feel like I am missing something... (is it due to micro-threading?)

Edit: I thought GDDR5 was using fully differential signalling for data too.

rpg.314 · Dec 1, 2009

Panajev2001a said:
True, but RAMBUS allows for AMD and nVIDIA to ease out the transition to such a new radically different, for them, system (which would be painful and needs lots of resources).

What is about rambus ip that intel's money cannot buy? LRB will debut directly with gddr5. And since lrb is a more bandwidth efficient architecture, it'll beat amd and nv gpu's anyday on the bandwidth efficiency metric. And intel has the pockets to make sure that they'll not be competing with inferir memory technology any time soon.

As gpu's generalize they'll have to give up the luxury of present day inefficiencies to compete with more efficient chips. And yes, edram/tbdr's seem to be the only solution right now.

Thowllly · Dec 1, 2009

Panajev2001a said:
? Care to elaborate more ? I feel like I am missing something... (is it due to micro-threading?)

Edit: I thought GDDR5 was using fully differential signalling for data too.

GDDR5 only use differential signalling for the clock signals

Jawed · Dec 1, 2009

mczak said:
I just have to bring this up, what about using edram? Sure putting full depth buffers in there isn't really feasible, but z buffers are compressed nowadays so if you'd put for instance only the parts in there which are fully compressed (8x ratio?) you'd "only" need 16MB for 2560x1600 with 8xAA, which doesn't look unreasonable. You'd still need high-bandwidth memory (to fetch other parts of z buffer, color buffers, textures etc.) but surely this should help.

Z isn't compressed. It is fully allocated and static in size for a given render target size.

The data format for Z is optimised to minimise the count of locations that are accessed, e.g. a 4x4 tile of pixels where no edge falls can have Z samples written to a single tile in a burst for only 16 samples, leaving the remaining 48 samples (4xMSAA) unwritten. On-chip tile tag tables are used to keep track of the status of tiles.

Jawed

OpenGL guy · Dec 1, 2009

Jawed said:
Z isn't compressed. It is fully allocated and static in size for a given render target size.

The data format for Z is optimised to minimise the count of locations that are accessed, e.g. a 4x4 tile of pixels where no edge falls can have Z samples written to a single tile in a burst for only 16 samples, leaving the remaining 48 samples (4xMSAA) unwritten. On-chip tile tag tables are used to keep track of the status of tiles.

Z *is* compressed or else you would see much slower results for single-sample Z fillrate. You still need to allocate the full memory, however, because you can't guarantee a consistent level of compression.

MfA · Dec 1, 2009

Jawed, that's just nitpicking, which invites the meta-nitpick that the individual tile is compressed. Anyway ... the term compression for this is now so well entrenched that regardless of your personal reservations of whether it's applicable or not is quite irrelevant.

mczak · Dec 1, 2009

Jawed said:
Z isn't compressed. It is fully allocated and static in size for a given render target size.

The data format for Z is optimised to minimise the count of locations that are accessed, e.g. a 4x4 tile of pixels where no edge falls can have Z samples written to a single tile in a burst for only 16 samples, leaving the remaining 48 samples (4xMSAA) unwritten. On-chip tile tag tables are used to keep track of the status of tiles.

Jawed

Z buffer was already compressed on r100, which had no idea what MSAA is...
Granted it only had compression ratio of 2:1. You're right that it is always fully allocated, but there's absolutely no reason you had to allocate all of it in your shiny edram. (The 8x compression factor I mentioned btw isn't quite correct, I think today the max ratio is actually a lot higher with AA).
As you said, there's on-chip tile tags - if I'm not mistaken these indicate if a tile has 8x, 4x, 2x or no compression, and when the chip needs to read/write a z tile the chip will read the appropriate amount of data from memory to get the data (decompressed) into the small z buffer cache (or write it back respectively). So you could simply store the first chunk of data of each tile in edram (so all data in case of 8x compression, half the data if it's 4x, down to 1/8 if it's not compressed at all), and leave the rest in ram. Would be quite simple to implement - in fact it's so obvious I'm wondering why noone implemented something like that yet

.

nAo · Dec 1, 2009

There's more than that, the hardware needs also be able to sample the compressed z representation. No rocket science though

3dilettante · Dec 1, 2009

EDRAM on the same die as the rest of the GPU logic would incur a cost in additional manufacturing complexity and possible impacts on the design effort and yields.

An upcoming large processor with EDRAM on-die is POWER7, but given the price that thing will go for, the tiny volume, and the service revenues IBM gets, the extra dollars to tens of dollars in per-chip manufacturing costs and possibly less than stellar yields mean very little.

Possibly future nodes will make this design option more feasible. If not, it would be a sign of significant competitive pressure to opt for the more expensive option.

CarstenS · Dec 1, 2009

What was the unshiny downside of viewport tiling again? Except for doing some border crossing vertices potentially multiple times and giving up some percentage of cache hits by changing your buffer access patterns that is.

MfA · Dec 1, 2009

In the end if it's inside the engine? Very little AFAICS. I expect some devs to experiment with approximate tile rendering on Fermi (ie. you tile, but you don't clip to tile boundaries and only render triangles which straddle boundaries once).

nAo · Dec 1, 2009

MfA said:
In the end if it's inside the engine? Very little AFAICS. I expect some devs to experiment with approximate tile rendering on Fermi (ie. you tile, but you don't clip to tile boundaries and only render triangles which straddle boundaries once).

Having a group of threads reading/writing from/to another group of threads tile(s) is not a good idea. Apart from the obvious synchronization issues good luck with maintaining the proper primitive submission order without leaving a lot of performance on the table.

Jawed · Dec 2, 2009

MfA said:
Jawed, that's just nitpicking, which invites the meta-nitpick that the individual tile is compressed. Anyway ... the term compression for this is now so well entrenched that regardless of your personal reservations of whether it's applicable or not is quite irrelevant.

It's not a nitpick to say that 'you'd "only" need 16MB for 2560x1600 with 8xAA' is wrong. The entire surface needs to exist, which is more than 16MB. Anyway, we've moved past that. I think we've even discussed the partially within EDRAM type solution before...

Jawed

rpg.314 · Dec 2, 2009

Jawed said:
It's not a nitpick to say that 'you'd "only" need 16MB for 2560x1600 with 8xAA' is wrong. The entire surface needs to exist, which is more than 16MB.

And how much edram would you need for runnig eyefinity with 6 monitors

Squilliam · Dec 2, 2009

rpg.314 said:
And how much edram would you need for runnig eyefinity with 6 monitors

Its always never enough no matter what the targeted resolutions!

jlippo · Dec 2, 2009

Squilliam said:
Its always never enough no matter what the targeted resolutions!

Thus, there is a point where you have to go to tiling or render the whole image with some sort of irregular or multi resolution method.

CarstenS · Dec 2, 2009

MfA said:
In the end if it's inside the engine? Very little AFAICS. I expect some devs to experiment with approximate tile rendering on Fermi (ie. you tile, but you don't clip to tile boundaries and only render triangles which straddle boundaries once).

I was thinking of a more software-transparent approach. But then. I'd guess, you could move to a finer-grained TBDR as well.

Ailuros · Dec 2, 2009

The good thing with GF100 is that you can take a break for a week or so, drool over your newborn like there's no tomorrow and there still won't be anything worthwhile about it in the net.

Sorry for the OT; as you where

Silus · Dec 2, 2009

Ailuros said:
The good thing with GF100 is that you can take a break for a week or so, drool over your newborn like there's no tomorrow and there still won't be anything worthwhile about it in the net.

Sorry for the OT; as you where

I guess "Congrats" is in order

NVIDIA Fermi: Architecture discussion

MfA

Panajev2001a

rpg.314

Thowllly

Jawed

OpenGL guy

MfA

mczak

nAo

Nutella Nutellae

3dilettante

CarstenS

Moderator

MfA

nAo

Nutella Nutellae

Jawed

rpg.314

Squilliam

Beyond3d isn't defined yet

jlippo

CarstenS

Moderator

Ailuros

Epsilon plus three

Silus

Similar threads