Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 10-Jan-2010, 14:26   #3326
hatter
Junior Member
 
Join Date: Dec 2009
Posts: 31
Default

Quote:
Originally Posted by Sontin View Post
And anandtech said in october, that GT200 is EOL.
That's not where I focused. And as many said here, GT200 cards are indeed difficult to come by. For me the highlight was:

Quote:
We're hearing that the rumors of a March release are accurate, but despite the delay Fermi is supposed to be very competitive (at least 20% faster than 5870?)
I doubt that Anand will put something in his piece after hearing it from people whispering in a crowd. You can say many things about AnandTech but I have found that mostly they don't pay heed to rumours. He must have got the figure from someone reliable or connected to Fermi.
hatter is offline   Reply With Quote
Old 10-Jan-2010, 16:42   #3327
A.L.M.
Member
 
Join Date: Jun 2008
Location: Looking for a place to call home
Posts: 144
Default

Quote:
Originally Posted by Sontin View Post
Really, what's the problem? He is counting the two L2 Cache together. Yeah, it's not right but AMD did the same with Hemlock: http://forum.beyond3d.com/showpost.p...postcount=4698
The problem is that L2 can't be shared, unless they created a dual core GPU. Which is quite hard to believe, due to the area of each core... 1000mm2 chip?
A.L.M. is offline   Reply With Quote
Old 10-Jan-2010, 16:47   #3328
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by A.L.M. View Post
The problem is that L2 can't be shared, unless they created a dual core GPU. Which is quite hard to believe, due to the area of each core... 1000mm2 chip?
What that means is that L2 is coherent wrt the 2 cores. This is obviously referring to a dual chip card. TSMC reticle limit is ~600 mm2, which means they can't make chips bigger than that.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 11-Jan-2010, 06:29   #3329
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,321
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by trinibwoy View Post
Full vertex buffer replication seems to be going in the opposite direction to where you want to be.
I meant the ones uploaded from the host. I didn't make it super explicit but HOS's would be tiled at the pre-tesselated levels, as I said "All transformed vertices get tiled and then written to buffers in the memory of the relevant GPUs for tesselation or rasterization" (should have said triangles and patches, not vertices). You obviously have to make some assumptions about bounding volumes of patches which developers could break for which you would need application specific fixes, application specific code for SLI is hardly new though. Also you would need to tag all vertices with a 32 bit integer and have one queue per GPU so the rasterizer can resequence them into an ordered stream, but that's no different from what the hardware has to do internally now.

It would be nice if sets of vertices came with bounding boxes so you could tile them at the start of the pipeline and get rid of all that communication, but they don't, so you can't ... you can multiply the vertex load to avoid the communication by simply transforming everything on all GPUs, but if you don't want to do that you are stuck doing tiling relatively late in the pipeline with only a 1/#GPU chance of a triangle/patch being local (ignoring overlap).
Quote:
For other buffers wouldn't you tag shared resources at the driver level and do replication in a push model instead of an on-demand pull? That seems like the best way to go given that you know up front which buffers are shared.
The problem with that is that you potentially have to transfer a whole lot more (lets say a buffer is used in some MRT technique, because of the correlation between the pixels being rendered and the texels needed you are not going to need the entire buffer) and that it introduces a different kind of latency (the latency needed to transfer the whole buffer) which is potentially much larger than normal texture access latency (and could thus stall the rendering). If you use a modest tile size for replication (say 8x8) and give remote data accesses priority in the memory controllers I think you could get latency close enough to normal texture access latency for normal multithreading to cover it.

Last edited by MfA; 11-Jan-2010 at 06:45.
MfA is online now   Reply With Quote
Old 11-Jan-2010, 20:14   #3330
AnarchX
Senior Member
 
Join Date: Apr 2007
Posts: 1,396
Default

Maybe a part of Dual-GF100?
Efficient multi-chip GPU - United States Patent 7616206
(According to page 11 35-50% more performance than a usual AFR-SLI-setup and 87.5% of its costs, through reduced/shared framebuffer.)

Last edited by AnarchX; 11-Jan-2010 at 20:46.
AnarchX is offline   Reply With Quote
Old 11-Jan-2010, 20:45   #3331
OlegSH
Member
 
Join Date: Jan 2010
Posts: 119
Default

Quote:
Originally Posted by AnarchX View Post
Last time i checked esp site there are was much of patens from nvidia related directly to multi chip configurations like yours one or like http://v3.espacenet.com/publicationDetails/originalDocument?CC=US&NR=7633505B1&KC=B1&FT=D&dat e=20091215&DB=EPODOC&locale=ru_ru]Apparatus, system, and method for joint processing in graphics processing units, http://v3.espacenet.com/publicationD...C&locale=ru_ru and others, much of them never be realized at practice
OlegSH is offline   Reply With Quote
Old 11-Jan-2010, 20:56   #3332
psurge
Member
 
Join Date: Feb 2002
Location: LA, California
Posts: 826
Default

Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?
psurge is offline   Reply With Quote
Old 11-Jan-2010, 21:57   #3333
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by psurge View Post
Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?
When I first read that patent I thought it was another brain dump that would never actually make it to market. It presumes higher scalability than AFR yet we see AFR setups achieving 80-100% scaling nowadays. It seems to me that AFR could benefit just as much from this higher bandwidth bus. Processing the same geometry on both chips isn't going to scale in this brave new world of tessellation.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 11-Jan-2010, 23:27   #3334
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,321
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by psurge View Post
Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?
If you don't equally distribute the vertex load, lets say you let one GPU handle it all, you have to rely on frame to frame coherence to determine the vertex/pixel load ratio so you know how many tiles each GPU gets to render. The round robin vertex shading ensures that giving each GPU an equal share of the tiles will result in virtually perfect loadbalancing (with identical GPUs). That said, NVIDIA suggests doubling the vertex load in the most explicit patent (7616206) which OlegSH linked ... it ain't going to outscale AFR like that in modern games.

The only way this kind of rendering will beat AFR is by convincing the consumers it's superior regardless of benchmarks. I'm convinced if they manage it with equal benchmarks, though I doubt they would get there while doubling the vertex load, but it might be a little harder in general.

Last edited by MfA; 11-Jan-2010 at 23:50.
MfA is online now   Reply With Quote
Old 12-Jan-2010, 02:48   #3335
psurge
Member
 
Join Date: Feb 2002
Location: LA, California
Posts: 826
Default

MfA,
That makes sense. According to the patent though, processing tiles in a checkerboard pattern gives good pixel load balancing since workloads for adjacent tiles are statistically similar (so I'm guessing the tiles are pretty small). Perhaps the ability to tesselate only primitives covering tiles owned by a GPU is present; if not, I agree that AFR will be difficult to beat in geometry heavy scenarios.

To scale from 2GPUs to 4, the patent seems to imply that standard AFR will be used, which is a bit disappointing. I know squat about bus PHY, but maybe their ambition was limited by the difficulty in creating 25GB/s+ connections between GPUs on separate PCBs? I'm thinking the whole "reuse a memory controller for inter chip communication" approach might mean that there are some pretty tight constraints on GPU placement for things to work. I guess the attraction must be that such reuse means you don't have HT link(s)/Intel-equivalent sitting around unused on the single GPU cards (or you get extra redundancy).
psurge is offline   Reply With Quote
Old 16-Jan-2010, 07:33   #3336
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,821
Default

Quote:
Originally Posted by DemoCoder View Post
What do you think about running the setup in its own clock domain (e.g. hot clock) rather then using multiple units? That would seem like the simplest way to do it.
Dumb question: irrelevant of frequency used isn't that more a dilemma whether to have 1 setup unit with >1Tri/clock vs. 2 setup units with 1 Tri/clock?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 16-Jan-2010, 08:44   #3337
DemoCoder
Regular
 
Join Date: Feb 2002
Location: California
Posts: 4,732
Default

Quote:
Originally Posted by Ailuros View Post
Dumb question: irrelevant of frequency used isn't that more a dilemma whether to have 1 setup unit with >1Tri/clock vs. 2 setup units with 1 Tri/clock?
My thoughts were, if you have two units, then all of a sudden you need extra logic to dispatch tris between them, buffer/route the results, etc = more complexity.
DemoCoder is offline   Reply With Quote
Old 16-Jan-2010, 11:21   #3338
sethk
Junior Member
 
Join Date: May 2004
Posts: 92
Default

The problem with AFR is the frame latency, which gets annoying when you're talking about sub-60 FPS dips during gameplay, where the latency and micro-stutter becomes very noticeable. It's one of the main reasons why I dislike multi-GPUs in practice, regardless of synthetic numbers and benchmarks. Even if split frame techniques displayed lower scaling than AFR, when it came to actually playing games, split-frame would be the mode I would pick. Now if it was actually faster than AFR even in synthetic benchmarks, that would really be something.
sethk is offline   Reply With Quote
Old 16-Jan-2010, 11:48   #3339
OlegSH
Member
 
Join Date: Jan 2010
Posts: 119
Default

Quote:
Originally Posted by sethk View Post
Even if split frame techniques displayed lower scaling than AFR, when it came to actually playing games, split-frame would be the mode I would pick. Now if it was actually faster than AFR even in synthetic benchmarks, that would really be something.
And what's about algorithms that uses accumulation to render target, you need to hold copy of that RT in the local memory of each GPU, that's also the case where AFR did't perfect scale
OlegSH is offline   Reply With Quote
Old 16-Jan-2010, 11:54   #3340
OlegSH
Member
 
Join Date: Jan 2010
Posts: 119
Default

i'm mean accumulation from previous frames and other frame depenndent algorithms
OlegSH is offline   Reply With Quote
Old 16-Jan-2010, 17:45   #3341
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,786
Default

Quote:
Originally Posted by DemoCoder View Post
What do you think about running the setup in its own clock domain (e.g. hot clock) rather then using multiple units? That would seem like the simplest way to do it.
Seems like a dead end investment. If you're going to bother with improving setup, might as well use an approach that scales with future architectures rather than just get a factor of 2 improvement.

If you look back to the GF3->GF4 transition, NVidia could have tried to use a hot clock for the vertex engine, but decided that parallelizing made far more sense. With DX11 tesselation, this was the perfect time to upgrade the setup engine. That's why I was dismayed that AMD didn't do it with R8xx.
Mintmaster is offline   Reply With Quote
Old 16-Jan-2010, 18:07   #3342
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,786
Default

Quote:
Originally Posted by MfA View Post
There is complexity, but is there expensive complexity? I still fail to see the big deal.
We all have trouble seeing the big deal, but the fact is that the brilliant design teams at ATI and NVidia have not done more than one tri per clock up to this point. AFAIK, they don't even discard culled triangles faster than that, and they are a huge source of pipeline bubbles on the pixel side. This lesser optimization doesn't even require any changes to the rasterizer.

There must be something that we're missing, because even if the average speed gain is only 5%, remember that R300 does one tri per clock and is only 100M transistors. We're looking at under 1% board cost to get that 5%, and you probably need 15% faster RAM to get that same boost.

I guess it could just be a "if it ain't broke don't fix it" situation. I did a lot of analysis on R200 at ATI, and I found a bug originating in the setup engine that cost 90 clocks for each culled triangle needing dependent texturing.
Mintmaster is offline   Reply With Quote
Old 16-Jan-2010, 18:36   #3343
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,321
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by Mintmaster View Post
We all have trouble seeing the big deal, but the fact is that the brilliant design teams at ATI and NVidia have not done more than one tri per clock up to this point. AFAIK, they don't even discard culled triangles faster than that, and they are a huge source of pipeline bubbles on the pixel side.
Half or quarter of huge is still huge ... either way you need buffering to take the edge off that effect. The solution is hierarchical culling, not parallel setup.

PS. I think the splitup of this and the old thread is not going as planned.
MfA is online now   Reply With Quote
Old 16-Jan-2010, 19:02   #3344
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,786
Default

Quote:
Originally Posted by MfA View Post
Half or quarter of huge is still huge ... either way you need buffering to take the edge off that effect. The solution is hierarchical culling, not parallel setup.
I'm not really talking about huge, though, I'm talking about significant. And half/quarter of that would definately be far less significant. Remember that visible triangles keep the shaders busy, and speeding up setup at that time is less useful. I'm sure it's the large clumps of culled triangles that occuppy the majority of time that the shaders are idling, and small triangles resulting in 1-2 quads per clock with simple shaders are a secondary problem.

FYI, when I say setup, I mean everything between the vertex shaders (well I guess geometry shaders) and the pixel shaders, so I'm including culling/clipping. When you say heirarchical culling, are you talking about the software side?
Quote:
PS. I think the splitup of this and the old thread is not going as planned.
My bad. Maybe a mod can move all the setup discussion to the architecture thread.
Mintmaster is offline   Reply With Quote
Old 16-Jan-2010, 19:27   #3345
sethk
Junior Member
 
Join Date: May 2004
Posts: 92
Default

I can't think of any scenario where split frame would have worse latency than AFR, can you?
sethk is offline   Reply With Quote
Old 16-Jan-2010, 20:14   #3346
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,786
Default

Quote:
Originally Posted by MfA View Post
Nah, was talking about backface culling on the GPU ... without hierarchical backface culling you are always going to get huge spans of vertex shading producing bugger all for the pixel shaders. Even if setup wasn't a problem that isn't a nice thing to do.
Do you have any links to the kind of heirarchical algorithm you're envisioning?

Personally, I think parallelized culling is adequate. Even if it was fixed function, the cost should be minimal to test, say, 8 tris per clock with fixed data paths. No need to go beyond that until you boost rasterizer speed. You'll still get spans of producing bugger all for the pixel shaders, but they'll be 8 times smaller and that should be good enough, IMO.
Mintmaster is offline   Reply With Quote
Old 16-Jan-2010, 20:17   #3347
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,786
Default

Quote:
Originally Posted by trinibwoy View Post
One practical optimization would be to cull backfacing HOS primitives to avoid the unnecessary tessellation hit. But obviously you need higher setup anyway if the number of visible polys is going to increase substantially with tessellation.
I don't think that's practical to do, because backfacing HOS primitives can generate frontfacing triangles after tesselation, especially when doing displacement.
Mintmaster is offline   Reply With Quote
Old 16-Jan-2010, 20:20   #3348
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Good point. Guess not.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 17-Jan-2010, 00:30   #3349
DemoCoder
Regular
 
Join Date: Feb 2002
Location: California
Posts: 4,732
Default

Quote:
Originally Posted by ChrisRay View Post
Specifically I said Fermi's tessellation engine is impressive. I think its the biggest investment Nvidia has put into a new API to accellerate new API features in a very long time. And what I mean by that is Nvidia's tessellation engine certainly not implemented in a half assed way. And I stand by that statement Its not long from now when everyone will have all their information about it.
There's an interesting thing happening in forums with these revelations happening. Months ago, there was much optimism and props given to AMD for their focus on tessellation in DX11, and from that came the assumption that NVidia put no work into it, and if they supported it at all, it would be some late additional, half-assed, bolted-on, or emulated tessellation and would not perform as well as AMD's. I'll note for the record that much the same story was repeated with Geometry Shaders (speculation that NVidia would suck at it, and that the R600 would be the only 'true' DX10 chip) AMD has had some form of tessellation for several generations all the way back to N-patches, so there's some logic to these beliefs. Also, the Fermi announcement mentioned nothing about improvement to graphics (only compute), so there has been a tacit assumption that the rest of the chip is basically a G8x with Fermi CUDA tacted on.

But as more and more leaks seem to indicate that NVidia has invested significant design work into making tessellation run very fast, it seems like some are in disbelief, while others are now starting to downplay the importance of tessellation performance and benchmarks (whereas once it was taken for granted that this was AMD's strong point) If indeed NVidia has significantly boosted triangle setup, culling, and tessellation, this could be like G80 all over again, where the total lack of information caused people to assume the worst, and the final chip coming as a big surprise. I think they deserve much props if they did increase setup rate.

As Mint said, it's been far too long to leave this part of the chips unchanged. Setup seems exactly where it was 10 years ago.
DemoCoder is offline   Reply With Quote
Old 17-Jan-2010, 15:31   #3350
PSU-failure
Member
 
Join Date: May 2007
Posts: 249
Default

Quote:
Originally Posted by Jawed View Post
Setup in RV670 is 1 triangle per clock isn't it?
Take a look here : http://www.behardware.com/articles/7...x-280-260.html

The results of RightMarks' VS tests point to a 0.5 triangle/clock setup rate for RV670 whichs scores ~270 at 775MHz while R600 scores ~600 at 750MHz. RV770 scores ~650, probably due to a partial only setup limitation.
PSU-failure is offline   Reply With Quote

Reply

Tags
delay, fermi, geforce, gf100

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 17:52.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.