Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 20-Jan-2010, 15:15   #3501
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,900
Default

Re: triangle setup rate

are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

I'm curious because this could finally be the answer to increased performance in MSFS.
ShaidarHaran is offline   Reply With Quote
Old 20-Jan-2010, 15:26   #3502
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,221
Send a message via ICQ to MfA
Default

Meh, I'm no journalist ... I wouldn't even know who to ask for official statements on this, and even if I did I don't think "to satisfy my personal curiosity" will convince them to spend time on me. So I just threw the conspiracy theory out there and hoped someone else would find out whether this is just a public document error or if it runs deeper. Someone who does run a site perhaps. It would be fine and dandy too if some of the AMD/IMG lurkers on this board could just say something like "Microsoft told us about this, the public documents are just a mess" (nudge nudge).
MfA is offline   Reply With Quote
Old 20-Jan-2010, 15:33   #3503
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by ShaidarHaran View Post
Re: triangle setup rate

are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

I'm curious because this could finally be the answer to increased performance in MSFS.
Official docs say "up to" - as always. My question wrt exactly this hasn't been answered yet. Makes me wonder…
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 20-Jan-2010, 15:51   #3504
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by CarstenS View Post
Official docs say "up to" - as always. My question wrt exactly this hasn't been answered yet. Makes me wonder…
My understanding of this is that they have four parallel units, but those units may at times be stalled waiting for the results of other units. They've put a lot of work into attempting to make sure that these four parallel geometry units are used as optimally as possible, but in reality we can't expect a 4x increase in hardly any aspect of geometry performance.
Chalnoth is offline   Reply With Quote
Old 20-Jan-2010, 15:52   #3505
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,762
Default

Quote:
Originally Posted by Razor1 View Post
I'm not sure just a guess.
Of course are claimed values usually peak values, but I can't figure out at the moment (since I'm way too tired) why you couldn't process at least 2 Tris/clock to feed four raster units.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 20-Jan-2010, 15:56   #3506
DavidGraham
Member
 
Join Date: Dec 2009
Posts: 581
Default

Quote:
Originally Posted by ShaidarHaran View Post
Ok, so we're still stuck at 1 tri/clk unless tesselating.

Too bad.
I must say that left me severely disappointed too ! that means the chance of GF100 to improve it's overall performance due to enhanced geometric design is low ! in normal cases of course !

Add that to the not so huge texture improvements and possibly low clock speeds , and you get the picture of GF100 making about 80% BEST CASE of GTX285's performance , and hence an even lower advantage over HD5870 (possibly 20%) , However in tessellation it will trounce GTX 285 by a huge margin .

Unless that is wrong , and there is 4 tri/clk in normal rendering .
DavidGraham is offline   Reply With Quote
Old 20-Jan-2010, 15:57   #3507
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,900
Default

Hmm, perhaps I should rephrase my question then.

Can Fermi setup more than 1 non-tesselated tri/clk?

edit: I have a feeling this discussion should be in the other thread.
ShaidarHaran is offline   Reply With Quote
Old 20-Jan-2010, 16:01   #3508
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by ShaidarHaran View Post
Hmm, perhaps I should rephrase my question then.

Can Fermi setup more than 1 non-tesselated tri/clk?
From the architecture layouts, it seems it should be able to do better than 1/clock. Because it has four parallel geometry units as opposed to one monolithic unit, it seems extremely doubtful that there could be any hardwired limitation to one triangle per clock.

That said, other limitations, such as bandwidth or ability to parallelize non-tessellated triangles between the units, may prevent much performance improvement from having the additional geometry units.
Chalnoth is offline   Reply With Quote
Old 20-Jan-2010, 16:15   #3509
CRoland
Member
 
Join Date: Jan 2010
Posts: 114
Default

Quote:
Originally Posted by MfA View Post
Meh, I'm no journalist ... I wouldn't even know who to ask for official statements on this, and even if I did I don't think "to satisfy my personal curiosity" will convince them to spend time on me. So I just threw the conspiracy theory out there and hoped someone else would find out whether this is just a public document error or if it runs deeper. Someone who does run a site perhaps. It would be fine and dandy too if some of the AMD/IMG lurkers on this board could just say something like "Microsoft told us about this, the public documents are just a mess" (nudge nudge).
Am I missing something or does it seem at least as likely that:
a) it was an honest omission and
b) the omission could hinder its use and actually hurt GF100's potential edge?
CRoland is offline   Reply With Quote
Old 20-Jan-2010, 16:16   #3510
DavidGraham
Member
 
Join Date: Dec 2009
Posts: 581
Default

Hey guys , Could someone explain to me whether GF100 could output more than 1 tri/clk in non tessellated situations or not ? and why ?

This question is in the other thread too ..
DavidGraham is offline   Reply With Quote
Old 20-Jan-2010, 16:35   #3511
Rys
Tiled
 
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
Default

Quote:
Originally Posted by ShaidarHaran View Post
are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.
__________________
A major redesign of the core ALU pineapple boomerang fortress.
Rys is offline   Reply With Quote
Old 20-Jan-2010, 16:37   #3512
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,019
Default

Quote:
Originally Posted by DemoCoder View Post
Well, because if each clock, exactly only (u,v) coordinate is sent for domain shading, then the amplification amounts to bandwidth saving only, and tessellation into many small triangles won't be able to keep 1600 ALUs busy, since they'll all be waiting for rasterization of the next small triangle, which is going to take a few clocks to pop out. What you'd want is for the (u,v) to be sent to 64 different groups of domain shading ALUs, so that you can parallelize the tessellation as much as possible and not have those ALUs sitting idle.

On Fermi, you can be working on 16 different (u,v) values, separately domain shading, setting up the triangles, and rasterizing. So even if the polymorph engine can only tessellate one set of coordinates per clock, it can do 16 of them, as well as setup 4 outputs from domain shaders each clock. It's less bottlenecked.
I agree it's less bottlenecked with more (u,v) generation, but that's really a latency issue. If all the ALUs are needed to balance the DS and setup the one (u,v) per cluster option would take a little longer to reach steady state, but it would still reach it. How much of a performance impact this has will be determined by the amount of work done before switching to a workload that requires a different balance. This is still on the assumption that there's close to one new (u,v) per setup primitive.

Obviously more (u,v)'s per clock is better, but determining how much better is the tricky part.


Quote:
Originally Posted by ShaidarHaran View Post
Re: triangle setup rate

are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

I'm curious because this could finally be the answer to increased performance in MSFS.
It's unclear if GF100 can achieve full rate without tessellation, but it should be easy to test with a custom app. The potential limitation is that there is a single index buffer per draw call. So they need to parallelize processing of the index buffer to make a single draw command run faster than 1x. This is non trivial.
3dcgi is offline   Reply With Quote
Old 20-Jan-2010, 16:39   #3513
Rys
Tiled
 
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
Default

It doesn't matter if it's tesselated by hardware or not, since you can draw tiny triangles all by yourself if you so wish.
__________________
A major redesign of the core ALU pineapple boomerang fortress.
Rys is offline   Reply With Quote
Old 20-Jan-2010, 16:41   #3514
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,221
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by CRoland View Post
Am I missing something or does it seem at least as likely that:
a) it was an honest omission and
From the public docs ... sure. If the other IHVs weren't made aware of it, then in combination with the fact it's only at HLSL level (not the assembly level, where nothing can be hidden because it's used for drivers) then it becomes rather hard to believe.

This is what I basically said in the first post about this ... still as valid as then. Stop making me repeat myself ... are you people just tag teaming to make me dig myself in ever deeper or what?
Quote:
b) the omission could hinder its use and actually hurt GF100's potential edge?
I don't think DX11 engines are far enough in development for it to be an obstacle, especially if NVIDIA volunteers the work/code necessary to integrate it.
MfA is offline   Reply With Quote
Old 20-Jan-2010, 16:44   #3515
DavidGraham
Member
 
Join Date: Dec 2009
Posts: 581
Default

Quote:
Originally Posted by Rys View Post
It doesn't matter if it's tesselated by hardware or not, since you can draw tiny triangles all by yourself if you so wish.
Thanks Mr.Rys , but I have to wonder : you guys said that the reason why no body cared to double the number of Hardware Rasterizers is that you have to figure out what to do when triangles overlap , or share vertices .. how is that different in GF100 situation ? how did Nvidia overcome this seemingly difficult obstacle ?
DavidGraham is offline   Reply With Quote
Old 20-Jan-2010, 16:51   #3516
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by DavidGraham View Post
Thanks Mr.Rys , but I have to wonder : you guys said that the reason why no body cared to double the number of Hardware Rasterizers is that you have to figure out what to do when triangles overlap , or share vertices .. how is that different in GF100 situation ? how did Nvidia overcome this seemingly difficult obstacle ?
Basically it's a problem of out-of-order execution. Anand goes into it a little bit here:
http://www.anandtech.com/video/showdoc.aspx?i=3721&p=2

Though I must say that I was mistaken. The GF100 has 16 geometry units, not 4. So I think we can definitely expect faster geometry throughput all around. That said, the triangle setup is in the raster engine, of which there are four, so we should expect, in ideal conditions, that the GF100 can do 4 triangles/clock (I don't think the raster engine has the same out-of-order execution problems as the PolyMorph engine).
Chalnoth is offline   Reply With Quote
Old 20-Jan-2010, 16:53   #3517
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,900
Default

Quote:
Originally Posted by Rys View Post
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.
I don't follow you here. To me this sounds like you are saying there is no benefit to this implementation.
ShaidarHaran is offline   Reply With Quote
Old 20-Jan-2010, 17:07   #3518
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,019
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by Rys View Post
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.
Does this mean that the aggregate rasterisation area will be smaller than GT200's on mainstream derivatives?
Alexko is online now   Reply With Quote
Old 20-Jan-2010, 17:18   #3519
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

It's currently 8 ppc/raster unit. If triangles are larger than 32 pix you don't necessarily benefit but only move the bottleneck to the rasters instead of the tri setup. Mainstream parts will be affected based on Nvidias choice of implementation, i.e. their number of GPCs.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 20-Jan-2010, 17:22   #3520
Psycho
Member
 
Join Date: Jun 2008
Location: Copenhagen
Posts: 554
Default

Quote:
Originally Posted by Rys View Post
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.
So only half that of Cypress (per clock)?
Strange they can think so different about the balance..
Psycho is offline   Reply With Quote
Old 20-Jan-2010, 17:24   #3521
Rys
Tiled
 
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
Default

Quote:
Originally Posted by Alexko View Post
Does this mean that the aggregate rasterisation area will be smaller than GT200's on mainstream derivatives?
We don't know yet, raster area may be variable (I'd expect so). If I'm honest, my biggest wonder in the last few days is whether there's really more than one unit at all. Parallelisable setup into one fixed-area rasteriser makes some sense, where it can work on up to four input triangles in a clock.
__________________
A major redesign of the core ALU pineapple boomerang fortress.
Rys is offline   Reply With Quote
Old 20-Jan-2010, 17:30   #3522
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,900
Default

Quote:
Originally Posted by CarstenS View Post
It's currently 8 ppc/raster unit. If triangles are larger than 32 pix you don't necessarily benefit but only move the bottleneck to the rasters instead of the tri setup. Mainstream parts will be affected based on Nvidias choice of implementation, i.e. their number of GPCs.
Ok, now I get it.

Thanks!
ShaidarHaran is offline   Reply With Quote
Old 20-Jan-2010, 17:31   #3523
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by XMAN26 View Post
I believe if they keep this design going forward from here, that if and when they put support for 3+ monitors on 1 card, they'd have a leg up on ATI unless they change the setup rate aswell. As the number of triangles being rendered across say 3 2" WS LCD is 3x as many as a single 24. Take Crysis, 5-6M triangles per scene, across 3 monitors you get 15-18M. For a good framrate of say an average 60FPS, your talking 900M to 1.04B triangles per seconded needed for that. So if GTX3xx can do 2400-2800M triangles/sec, had they done even just 3 monitors on one single GPU card, I'd say they'd have a very clear and distinct advantage.
I don't think the setup rate will in any way help to run more displays, as the primary limitation there is simply the fill rate. Basically, yes, you're drawing more triangles, but you're also drawing more pixels, so the pixel/triangle ratio doesn't change.
Chalnoth is offline   Reply With Quote
Old 20-Jan-2010, 17:31   #3524
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

I interpreted that the aggregate raster output rate is 32 pixels per 1/2 shader clock. Depending on where the shader clocks go and what the L2/ROP domain is set to, the bottleneck could go between raster and ROP depending on the exact configuration and clocks of a given derivative.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 20-Jan-2010, 18:01   #3525
Bouncing Zabaglione Bros.
Regular
 
Join Date: Jun 2003
Posts: 6,160
Default

Quote:
Originally Posted by Rys View Post
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.
So... It's actually doing four quarter-triangles per clock? More detail, but not faster...?
Bouncing Zabaglione Bros. is offline   Reply With Quote

Reply

Tags
delay, fermi, geforce, gf100

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 03:31.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.