Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 18-Jan-2010, 22:19   #3426
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by MfA View Post
It doesn't really ... except for trivial cases it will compile to 4 separate gathers and the assembler has turn it back into a single instruction (if there are repeated fixed offsets the assembler can recognise that as easily as the compiler). Forcing a decompilation step in the assembler for the hardware for which the HLSL instruction was meant in the first place is retarded. I can come up with reasons to keep it out of the assembler/refrast level, but they aren't technical reasons.
But what's the end result in terms of the number of gather instructions the hardware issues?

Quote:
You don't see why having timely knowledge of the DirectX specification is important to an IHV? Of course AMD might have had better specs than us ... that remains to be seen though.
Based on my possibly incorrect understanding of the issue, no I don't see a difference. What Nvidia is proposing is calculation of two texel addresses per clock which results in doubled performance of gather4 with 4 distinct offsets. I don't understand why this optimization couldn't work by detecting cases where 4 point samples with different offsets are requested and then packing them into a single gather4 instruction. The 4 offset API call is a convenience but to me it seems the optimization is feasible without it.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 18-Jan-2010, 22:31   #3427
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,220
Send a message via ICQ to MfA
Default

The assembler instructions generated by the 4 offset gather HLSL are not something you'd ever actually expect to encounter unless you knew the HLSL instruction existed and would generate them. You have to know something will occur before you can decide to design hardware to accelerate it ... so if keeping this out of the docs and refrast kept AMD from knowing about the existence of the HLSL instruction for X months then that's a competitive advantage (this isn't about Evergreen, this is about future hardware).
MfA is offline   Reply With Quote
Old 18-Jan-2010, 22:39   #3428
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Well color me confused. I thought jittered point sampling was an old technique, the only difference being that prior hardware would do it one sample at a time instead of two at a time in Fermi.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 18-Jan-2010, 22:41   #3429
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,220
Send a message via ICQ to MfA
Default

Jittered point sampling is an old technique, but it stopped really making sense for a while when you could gather4 at about the same cost ... it's the changed structure of Fermi texture addressing which makes it relevant again. You'd never use this on Evergreen hardware except if you were lazy or running sponsored code.
MfA is offline   Reply With Quote
Old 18-Jan-2010, 22:58   #3430
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by MfA View Post
Jittered point sampling is an old technique, but it stopped really making sense for a while when you could gather4 at about the same cost ... it's the changed structure of Fermi texture addressing which makes it relevant again. You'd never use this on Evergreen hardware except if you were lazy or running sponsored code.
Well you've completely lost me now Isn't the single offset gather 4 instruction going to return you 4 point samples from a texel quad which defeats the whole reason for jittering the samples in the first place?

If jittered point samples are useless it's curious that Nvidia is making a big deal about it (granted they're using screens from 3dmark06 to demonstrate the effect). I tried doing a quick google but came up empty. What are the new techniques for shadow mapping / SSAO that render jittered sampling irrelevant?
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 18-Jan-2010, 23:40   #3431
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,220
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by trinibwoy View Post
Well you've completely lost me now Isn't the single offset gather 4 instruction going to return you 4 point samples from a texel quad which defeats the whole reason for jittering the samples in the first place?
It's better to have 16 quads from jittered locations than 16 single texel samples ... that's basically the choice with Evergreen.
MfA is offline   Reply With Quote
Old 19-Jan-2010, 00:03   #3432
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by Groo The Wanderer View Post
One silicon level architecture question for you guys. Given the 'polymorph engines' are all interconnected across large expanses of the die, and need to be kept in sync, anyone want to hazard a guess on how that affects clock scaling?
-Charlie
Why would you not want to go through existing caches for this?
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 19-Jan-2010, 00:58   #3433
Razor1
Senior Member
 
Join Date: Jul 2004
Location: NY, NY
Posts: 2,680
Default

Quote:
Originally Posted by Groo The Wanderer View Post
One silicon level architecture question for you guys. Given the 'polymorph engines' are all interconnected across large expanses of the die, and need to be kept in sync, anyone want to hazard a guess on how that affects clock scaling?

-Charlie

each domain has seperate clocks, each domain has to sync with its internals, this goes with the polymorph engines as well, the data is the spit out to where ever it needs to go. As long as each domain within themselves sync up there are no problems. Its just like previous architectures, thats why the domain clocks are there, each domain work independently of each other, data is sent from domain to domain has to be synced within the domain the data is being processed at a certain time.

Now with the g100 there is out of order processing but I'm pretty sure its always within the domain, *this is me guessing*.

Last edited by Razor1; 19-Jan-2010 at 01:06.
Razor1 is offline   Reply With Quote
Old 19-Jan-2010, 01:15   #3434
Tahir2
Itchy
 
Join Date: Feb 2002
Location: United Queendom
Posts: 2,858
Default

Quote:
Originally Posted by MfA View Post
... If not then I personally will stake my bet on NVIDIA having received the XBOX720 contract already
Right, so where are we on this theory of yours then? Has NVIDIA gained a competitive advantage over ATI.... Or?
Tahir2 is offline   Reply With Quote
Old 19-Jan-2010, 01:46   #3435
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,220
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by Tahir2 View Post
Right, so where are we on this theory of yours then?
Conspiracy theory! Lets make one thing crystal clear, even though I let myself get drawn lengthy arguments on this it is pretty far out there. If it's true we will probably never hear of it, some people at AMD might get mad at Microsoft but it still would not be in their best interest to antagonize them in public. If it's false and AMD confirms it was only a public documentation error I will look foolish and we can all quickly forget about it.
Quote:
Has NVIDIA gained a competitive advantage over ATI.... Or?
There are a couple components to this ... firstly the instruction itself and the kind of acceleration it can get on Fermi. It's a good instruction, with an underlying texture cache better suited to point sampling than the one in Evergreen. It will be a win in some algorithms (it will also leave some resources poorly used on ATI hardware, so ideally you will have two implementations). So in that it's a competitive advantage in a good way, better hardware (arguably depending on cost ... but intuitively I'd say the costs for allowing individually addressed 32 bit samples, as opposed to quads, are small compared to the benefits).

The other component is the center of my conspiracy theory ... the IHVs during DirectX standardization have to put a lot of cards on the table, their competition might not be immediately able to take that into account for their own hardware but they will take any implicit information about the other's upcoming hardware into consideration for their next generation. If NVIDIA got instructions into HLSL but got Microsoft to keep them out of the documentation and allowing them to simply declare "oh this is part of DirectX 11 too" at their convenience, then yes they got a clear competitive advantage. In a bad way.
MfA is offline   Reply With Quote
Old 19-Jan-2010, 02:16   #3436
eastmen
Senior Member
 
Join Date: Mar 2008
Posts: 4,917
Default

why would ms go to nvidia after they were royaly screwed by them back in the original xbox days and why would they leave ati when ati delivered a fantastic part in the xenos that has allowed them to stay competetive with the ps3 dispite launching a year earlier ?
eastmen is online now   Reply With Quote
Old 19-Jan-2010, 02:24   #3437
Tahir2
Itchy
 
Join Date: Feb 2002
Location: United Queendom
Posts: 2,858
Default

@eastmen

Business is business. Sometimes you get stung but others you might get a good deal with no obvious caveats that is cheaper and faster than the competition. Money talks louder than words.
Tahir2 is offline   Reply With Quote
Old 19-Jan-2010, 02:35   #3438
eastmen
Senior Member
 
Join Date: Mar 2008
Posts: 4,917
Default

Quote:
Originally Posted by Tahir2 View Post
@eastmen

Business is business. Sometimes you get stung but others you might get a good deal with no obvious caveats that is cheaper and faster than the competition. Money talks louder than words.
Thats what i'm saying. I pointed out two reasons why ati would be more likely than nvidia and as we see ati and nvidia's performance is quite in sync with each other
eastmen is online now   Reply With Quote
Old 19-Jan-2010, 02:43   #3439
sethk
Junior Member
 
Join Date: May 2004
Posts: 91
Default

Quote:
Originally Posted by eastmen View Post
why would ms go to nvidia after they were royaly screwed by them back in the original xbox days and why would they leave ati when ati delivered a fantastic part in the xenos that has allowed them to stay competetive with the ps3 dispite launching a year earlier ?
Not to get too far off-topic, but it's not like Nvidia sprang something on them - they mutually came to an agreement in the design phase which benefited Nvidia quite a bit in the long run. I don't see sticking to the terms of a contract and taking your profits as an unscrupulous business practice. I seriously doubt that Microsoft would have done anything different if the roles were reversed and they were in Nvidia's position.
sethk is offline   Reply With Quote
Old 19-Jan-2010, 03:21   #3440
willardjuice
super willyjuice
 
Join Date: May 2005
Location: Astoria, NY
Posts: 986
Default

Yes, let's try to keep the discussion on Fermi's architecture, not the internals of the next Xbox. You are always welcome to make a new thread over in the console forum if you wish to continue the discussion.
willardjuice is offline   Reply With Quote
Old 19-Jan-2010, 05:04   #3441
Bob
Member
 
Join Date: Apr 2004
Posts: 416
Default

GF100 Graphics Architecture Whitepaper
__________________
Vincent: G80 is designed for time to market, whereas the R600 is specialized in the rich feature.
Bob is offline   Reply With Quote
Old 19-Jan-2010, 05:23   #3442
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,816
Send a message via Skype™ to fellix
Default

I still don't get it, why there is a dedicated tessellator for each SM?
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 19-Jan-2010, 07:28   #3443
Squilliam
Beyond3d isn't defined yet
 
Join Date: Jan 2008
Location: New Zealand
Posts: 3,037
Default

Quote:
Originally Posted by MfA View Post
If Microsoft comes back with the answer "oops, this should be in there too" it would be due diligence to ask AMD if they were ever made aware of that ahead of time. If not then I personally will stake my bet on NVIDIA having received the XBOX720 contract already
What a bad tease

But they have always been naughty like that, when Nvidia didn't play ball, they grabbed them by the balls and partially made R300 the runaway success it was.

Whats the chances that the actual feature doesn't exist on the DX11 specification and the feature they implemented here is an intercept which nets the same result anyway?
Squilliam is offline   Reply With Quote
Old 19-Jan-2010, 10:56   #3444
Neb
Iron "BEAST" Man
 
Join Date: Mar 2007
Location: NGC2264
Posts: 8,382
Default

Just read the whitepaper and it sure looks innovative with great improvements in different areas that are IMO key for greatly pushing the graphic further in games and professionally. Glad I waited as a GTX380 will be the deal for me. Pretty much sure about this.
Neb is offline   Reply With Quote
Old 19-Jan-2010, 11:44   #3445
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,220
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by Squilliam View Post
Whats the chances that the actual feature doesn't exist on the DX11 specification and the feature they implemented here is an intercept which nets the same result anyway?
Read the bit about the code, it exists in the HLSL (but not in the docs or assembler/refrast, which is relevant since assembler/refrast are more strictly documented since drivers are written with them).
MfA is offline   Reply With Quote
Old 19-Jan-2010, 13:17   #3446
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by MfA View Post
It's a good instruction, with an underlying texture cache better suited to point sampling than the one in Evergreen.
Since it's apparently only using L/S, I'd assume, it wouldn't go through Tex-Cache at all, but rather use the (larger) L1-/Shared-Memory-Pool.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 19-Jan-2010, 15:17   #3447
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by trinibwoy View Post
Well you've completely lost me now Isn't the single offset gather 4 instruction going to return you 4 point samples from a texel quad which defeats the whole reason for jittering the samples in the first place?
I'm not sure, but I think MfA is saying that on ATI hardware you would fetch, say, all 16 samples from a 4x4 footprint using four gather4 instructions, but without gather4 you would use four jittered point sample instructions.

I think NVidia is saying that they can do the latter in one or two instructions, or equivalently they can gather 8-16 jittered point samples with the same four instructions. That wouldn't make a difference in a 4x4 sampling footprint, but it would if your samples were farther apart.
Mintmaster is offline   Reply With Quote
Old 19-Jan-2010, 15:27   #3448
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,220
Send a message via ICQ to MfA
Default

More samples is better even if they aren't ideally distributed and on ATI you get the extra samples from a quad virtually for free with gather4, so you should simply try to make use of them ... the optimal algorithms for both architectures are neither here nor there though.

What NVIDIA is saying is that there is an instruction in HLSL which up to this point has remained hidden, which if you know it exists you can decompile from assembler level and design hardware for to make it run efficiently. Knowing it exists is a rather important step though, without that knowledge you simply wouldn't expect those type of assembly instructions. They make absolutely no sense on the original hardware from which gather4 came (HD3/4/5, where you will just take all the samples).

Last edited by MfA; 19-Jan-2010 at 15:33.
MfA is offline   Reply With Quote
Old 19-Jan-2010, 15:43   #3449
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by fellix View Post
I still don't get it, why there is a dedicated tessellator for each SM?
Good question, been wondering the same myself. It could be that each tessellator is assigned a different screen tile to minimize inter-SM communication. So some of them would see less work than others in a given frame but the peak throughput for tessellation heavy scenes would be high.

I still don't know why that matters if you can only cull / setup / rasterize 4 triangles per clock though. Maybe there's something they're doing in the GS to discard primitives before they even get to the setup/rasterization stages.

Quote:
Originally Posted by Mintmaster View Post
I think NVidia is saying that they can do the latter in one or two instructions, or equivalently they can gather 8-16 jittered point samples with the same four instructions. That wouldn't make a difference in a 4x4 sampling footprint, but it would if your samples were farther apart.
Right, but my question is whether or not developers are using point sampling. If they are then AMD could certainly have implemented this optimization in hardware and simply grouped individual point sampling requests together. MfA is arguing that the reason they didnt consider it is that they may not have known about the four-offset gather HLSL instruction but I'm not convinced.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 19-Jan-2010, 15:59   #3450
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,862
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by MfA View Post
What NVIDIA is saying is that there is an instruction in HLSL which up to this point has remained hidden, which if you know it exists you can decompile from assembler level and design hardware for to make it run efficiently. Knowing it exists is a rather important step though, without that knowledge you simply wouldn't expect those type of assembly instructions. They make absolutely no sense on the original hardware from which gather4 came (HD3/4/5, where you will just take all the samples).
As you've been shown, ATI handles the individually-offset gather fine. Gather4 is just ATI's optimisation for when the data aligns within 128-bit buckets.

ATI always samples 128-bits at a time. If you choose to throw away 96 bits, then so be it. The hardware will not bother loading the 96 bits into registers if you choose not to use them. But the memory transaction is 128-bits.

The compiler should be able to coalesce distinct fetches when it sees they are using a common sample address, resulting in a 128-bit fetch, rather than several 32-bit fetches. That's very much dependent on how the code's written though, and wouldn't apply when the developer chooses to use nicely jittered samples, whose average is no greater than 32-bits of data per 128-bit sampling address.

Of course looking at all the pixels the average fetch per 128-bit address is likely to be higher. So global memory traffic won't show such a severe disparity in effort versus results. But there's no doubt that slowing the samplers down to 1 32-bit result per pixel per clock is going to make ATI slow here (actually, 1/4 of that, once ALU:fetch is taken into account).

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote

Reply

Tags
delay, fermi, geforce, gf100

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 04:36.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.