NVIDIA Fermi: Architecture discussion

MfA · Jan 18, 2010

Thanks, that answers my refrast question too ... if there is no assembly instruction it's not going to be in refrast either.

Rys · Jan 18, 2010

How do you know it's across a large silicon area? Common sense says they're central, probably in that bit in the middle we've all had a hard time identifying to any precision (probably the front ends for each GPC + memory controller at the very least). No real idea how it affects clock scaling, but I'd take an educated guess at that not much given the aggregate performance from the blocks compared to prior designs.

cal_guy · Jan 18, 2010

XMAN26 said:
I dont know about that. GTX is less than half a full fermi will be and its 30-35% from Cypress(5870) and 5-10% with in 5850. Half a fermi would have 256SPs, DDR5 with propably 256bit bus and 24 ROPs.

Well it would have considerably less texture capacity when compared even to a Juniper.

trinibwoy · Jan 18, 2010

cal_guy said:
Well it would have considerably less texture capacity when compared even to a Juniper.

Yep, that's going to be an issue methinks. With older titles being considerably texture bound and not pushing too many polys I'm not sure how Fermi's derivatives are gonna hold up. I think a comparison to Juniper is setting the bar a bit low though.

MfA · Jan 18, 2010

trinibwoy said:
Interesting. So it decomposes into multiple instructions, one per distinct offset. Makes sense I guess.

It doesn't really ... except for trivial cases it will compile to 4 separate gathers and the assembler has turn it back into a single instruction (if there are repeated fixed offsets the assembler can recognise that as easily as the compiler). Forcing a decompilation step in the assembler for the hardware for which the HLSL instruction was meant in the first place is retarded. I can come up with reasons to keep it out of the assembler/refrast level, but they aren't technical reasons.

I don't see why the presence of this function would have influenced AMD's design either way though.

You don't see why having timely knowledge of the DirectX specification is important to an IHV? Of course AMD might have had better specs than us ... that remains to be seen though.

trinibwoy · Jan 18, 2010

MfA said:
It doesn't really ... except for trivial cases it will compile to 4 separate gathers and the assembler has turn it back into a single instruction (if there are repeated fixed offsets the assembler can recognise that as easily as the compiler). Forcing a decompilation step in the assembler for the hardware for which the HLSL instruction was meant in the first place is retarded. I can come up with reasons to keep it out of the assembler/refrast level, but they aren't technical reasons.

But what's the end result in terms of the number of gather instructions the hardware issues?

You don't see why having timely knowledge of the DirectX specification is important to an IHV? Of course AMD might have had better specs than us ... that remains to be seen though.

Based on my possibly incorrect understanding of the issue, no I don't see a difference. What Nvidia is proposing is calculation of two texel addresses per clock which results in doubled performance of gather4 with 4 distinct offsets. I don't understand why this optimization couldn't work by detecting cases where 4 point samples with different offsets are requested and then packing them into a single gather4 instruction. The 4 offset API call is a convenience but to me it seems the optimization is feasible without it.

MfA · Jan 18, 2010

The assembler instructions generated by the 4 offset gather HLSL are not something you'd ever actually expect to encounter unless you knew the HLSL instruction existed and would generate them. You have to know something will occur before you can decide to design hardware to accelerate it ... so if keeping this out of the docs and refrast kept AMD from knowing about the existence of the HLSL instruction for X months then that's a competitive advantage (this isn't about Evergreen, this is about future hardware).

trinibwoy · Jan 18, 2010

Well color me confused. I thought jittered point sampling was an old technique, the only difference being that prior hardware would do it one sample at a time instead of two at a time in Fermi.

MfA · Jan 18, 2010

Jittered point sampling is an old technique, but it stopped really making sense for a while when you could gather4 at about the same cost ... it's the changed structure of Fermi texture addressing which makes it relevant again. You'd never use this on Evergreen hardware except if you were lazy or running sponsored code.

trinibwoy · Jan 18, 2010

MfA said:
Jittered point sampling is an old technique, but it stopped really making sense for a while when you could gather4 at about the same cost ... it's the changed structure of Fermi texture addressing which makes it relevant again. You'd never use this on Evergreen hardware except if you were lazy or running sponsored code.

Well you've completely lost me now

Isn't the single offset gather 4 instruction going to return you 4 point samples from a texel quad which defeats the whole reason for jittering the samples in the first place?

If jittered point samples are useless it's curious that Nvidia is making a big deal about it (granted they're using screens from 3dmark06 to demonstrate the effect). I tried doing a quick google but came up empty. What are the new techniques for shadow mapping / SSAO that render jittered sampling irrelevant?

MfA · Jan 18, 2010

trinibwoy said:
Well you've completely lost me now Isn't the single offset gather 4 instruction going to return you 4 point samples from a texel quad which defeats the whole reason for jittering the samples in the first place?

It's better to have 16 quads from jittered locations than 16 single texel samples ... that's basically the choice with Evergreen.

CarstenS · Jan 19, 2010

Groo The Wanderer said:
One silicon level architecture question for you guys. Given the 'polymorph engines' are all interconnected across large expanses of the die, and need to be kept in sync, anyone want to hazard a guess on how that affects clock scaling?
-Charlie

Why would you not want to go through existing caches for this?

Razor1 · Jan 19, 2010

Groo The Wanderer said:
One silicon level architecture question for you guys. Given the 'polymorph engines' are all interconnected across large expanses of the die, and need to be kept in sync, anyone want to hazard a guess on how that affects clock scaling?

-Charlie

each domain has seperate clocks, each domain has to sync with its internals, this goes with the polymorph engines as well, the data is the spit out to where ever it needs to go. As long as each domain within themselves sync up there are no problems. Its just like previous architectures, thats why the domain clocks are there, each domain work independently of each other, data is sent from domain to domain has to be synced within the domain the data is being processed at a certain time.

Now with the g100 there is out of order processing but I'm pretty sure its always within the domain, *this is me guessing*.

Tahir2 · Jan 19, 2010

MfA said:
... If not then I personally will stake my bet on NVIDIA having received the XBOX720 contract already

Right, so where are we on this theory of yours then? Has NVIDIA gained a competitive advantage over ATI.... Or?

MfA · Jan 19, 2010

Tahir2 said:
Right, so where are we on this theory of yours then?

Conspiracy theory! Lets make one thing crystal clear, even though I let myself get drawn lengthy arguments on this it is pretty far out there. If it's true we will probably never hear of it, some people at AMD might get mad at Microsoft but it still would not be in their best interest to antagonize them in public. If it's false and AMD confirms it was only a public documentation error I will look foolish and we can all quickly forget about it.

Has NVIDIA gained a competitive advantage over ATI.... Or?

There are a couple components to this ... firstly the instruction itself and the kind of acceleration it can get on Fermi. It's a good instruction, with an underlying texture cache better suited to point sampling than the one in Evergreen. It will be a win in some algorithms (it will also leave some resources poorly used on ATI hardware, so ideally you will have two implementations). So in that it's a competitive advantage in a good way, better hardware (arguably depending on cost ... but intuitively I'd say the costs for allowing individually addressed 32 bit samples, as opposed to quads, are small compared to the benefits).

The other component is the center of my conspiracy theory ... the IHVs during DirectX standardization have to put a lot of cards on the table, their competition might not be immediately able to take that into account for their own hardware but they will take any implicit information about the other's upcoming hardware into consideration for their next generation. If NVIDIA got instructions into HLSL but got Microsoft to keep them out of the documentation and allowing them to simply declare "oh this is part of DirectX 11 too" at their convenience, then yes they got a clear competitive advantage. In a bad way.

eastmen · Jan 19, 2010

why would ms go to nvidia after they were royaly screwed by them back in the original xbox days and why would they leave ati when ati delivered a fantastic part in the xenos that has allowed them to stay competetive with the ps3 dispite launching a year earlier ?

Tahir2 · Jan 19, 2010

@eastmen

Business is business. Sometimes you get stung but others you might get a good deal with no obvious caveats that is cheaper and faster than the competition. Money talks louder than words.

eastmen · Jan 19, 2010

Tahir2 said:
@eastmen

Business is business. Sometimes you get stung but others you might get a good deal with no obvious caveats that is cheaper and faster than the competition. Money talks louder than words.

Thats what i'm saying. I pointed out two reasons why ati would be more likely than nvidia and as we see ati and nvidia's performance is quite in sync with each other

sethk · Jan 19, 2010

eastmen said:
why would ms go to nvidia after they were royaly screwed by them back in the original xbox days and why would they leave ati when ati delivered a fantastic part in the xenos that has allowed them to stay competetive with the ps3 dispite launching a year earlier ?

Not to get too far off-topic, but it's not like Nvidia sprang something on them - they mutually came to an agreement in the design phase which benefited Nvidia quite a bit in the long run. I don't see sticking to the terms of a contract and taking your profits as an unscrupulous business practice. I seriously doubt that Microsoft would have done anything different if the roles were reversed and they were in Nvidia's position.

willardjuice · Jan 19, 2010

Yes, let's try to keep the discussion on Fermi's architecture, not the internals of the next Xbox. You are always welcome to make a new thread over in the console forum if you wish to continue the discussion.

NVIDIA Fermi: Architecture discussion

MfA

Rys

Graphics @ AMD

cal_guy

trinibwoy

Meh

MfA

trinibwoy

Meh

MfA

trinibwoy

Meh

MfA

trinibwoy

Meh

MfA

CarstenS

Moderator

Razor1

Tahir2

MfA

eastmen

Tahir2

eastmen

sethk

willardjuice

super willyjuice

Similar threads