If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#3326 | |
|
Junior Member
Join Date: Dec 2009
Posts: 31
|
That's not where I focused. And as many said here, GT200 cards are indeed difficult to come by. For me the highlight was:
Quote:
|
|
|
|
|
|
|
#3327 | |
|
Member
Join Date: Jun 2008
Location: Looking for a place to call home
Posts: 144
|
Quote:
|
|
|
|
|
|
|
#3328 |
|
Senior Member
|
What that means is that L2 is coherent wrt the 2 cores. This is obviously referring to a dual chip card. TSMC reticle limit is ~600 mm2, which means they can't make chips bigger than that.
|
|
|
|
|
|
#3329 | ||
|
Regular
|
Quote:
It would be nice if sets of vertices came with bounding boxes so you could tile them at the start of the pipeline and get rid of all that communication, but they don't, so you can't ... you can multiply the vertex load to avoid the communication by simply transforming everything on all GPUs, but if you don't want to do that you are stuck doing tiling relatively late in the pipeline with only a 1/#GPU chance of a triangle/patch being local (ignoring overlap). Quote:
Last edited by MfA; 11-Jan-2010 at 06:45. |
||
|
|
|
|
|
#3330 |
|
Senior Member
Join Date: Apr 2007
Posts: 1,396
|
Maybe a part of Dual-GF100?
Efficient multi-chip GPU - United States Patent 7616206 (According to page 11 35-50% more performance than a usual AFR-SLI-setup and 87.5% of its costs, through reduced/shared framebuffer.) Last edited by AnarchX; 11-Jan-2010 at 20:46. |
|
|
|
|
|
#3331 | |
|
Member
Join Date: Jan 2010
Posts: 119
|
Quote:
|
|
|
|
|
|
|
#3332 |
|
Member
Join Date: Feb 2002
Location: LA, California
Posts: 826
|
Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?
|
|
|
|
|
|
#3333 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
When I first read that patent I thought it was another brain dump that would never actually make it to market. It presumes higher scalability than AFR yet we see AFR setups achieving 80-100% scaling nowadays. It seems to me that AFR could benefit just as much from this higher bandwidth bus. Processing the same geometry on both chips isn't going to scale in this brave new world of tessellation.
__________________
What the deuce!? |
|
|
|
|
|
#3334 | |
|
Regular
|
Quote:
The only way this kind of rendering will beat AFR is by convincing the consumers it's superior regardless of benchmarks. I'm convinced if they manage it with equal benchmarks, though I doubt they would get there while doubling the vertex load, but it might be a little harder in general. Last edited by MfA; 11-Jan-2010 at 23:50. |
|
|
|
|
|
|
#3335 |
|
Member
Join Date: Feb 2002
Location: LA, California
Posts: 826
|
MfA,
That makes sense. According to the patent though, processing tiles in a checkerboard pattern gives good pixel load balancing since workloads for adjacent tiles are statistically similar (so I'm guessing the tiles are pretty small). Perhaps the ability to tesselate only primitives covering tiles owned by a GPU is present; if not, I agree that AFR will be difficult to beat in geometry heavy scenarios. To scale from 2GPUs to 4, the patent seems to imply that standard AFR will be used, which is a bit disappointing. I know squat about bus PHY, but maybe their ambition was limited by the difficulty in creating 25GB/s+ connections between GPUs on separate PCBs? I'm thinking the whole "reuse a memory controller for inter chip communication" approach might mean that there are some pretty tight constraints on GPU placement for things to work. I guess the attraction must be that such reuse means you don't have HT link(s)/Intel-equivalent sitting around unused on the single GPU cards (or you get extra redundancy). |
|
|
|
|
|
#3336 |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,821
|
Dumb question: irrelevant of frequency used isn't that more a dilemma whether to have 1 setup unit with >1Tri/clock vs. 2 setup units with 1 Tri/clock?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
#3337 |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
My thoughts were, if you have two units, then all of a sudden you need extra logic to dispatch tris between them, buffer/route the results, etc = more complexity.
|
|
|
|
|
|
#3338 |
|
Junior Member
Join Date: May 2004
Posts: 92
|
The problem with AFR is the frame latency, which gets annoying when you're talking about sub-60 FPS dips during gameplay, where the latency and micro-stutter becomes very noticeable. It's one of the main reasons why I dislike multi-GPUs in practice, regardless of synthetic numbers and benchmarks. Even if split frame techniques displayed lower scaling than AFR, when it came to actually playing games, split-frame would be the mode I would pick. Now if it was actually faster than AFR even in synthetic benchmarks, that would really be something.
|
|
|
|
|
|
#3339 |
|
Member
Join Date: Jan 2010
Posts: 119
|
And what's about algorithms that uses accumulation to render target, you need to hold copy of that RT in the local memory of each GPU, that's also the case where AFR did't perfect scale
|
|
|
|
|
|
#3340 |
|
Member
Join Date: Jan 2010
Posts: 119
|
i'm mean accumulation from previous frames and other frame depenndent algorithms
|
|
|
|
|
|
#3341 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,786
|
Quote:
If you look back to the GF3->GF4 transition, NVidia could have tried to use a hot clock for the vertex engine, but decided that parallelizing made far more sense. With DX11 tesselation, this was the perfect time to upgrade the setup engine. That's why I was dismayed that AMD didn't do it with R8xx. |
|
|
|
|
|
|
#3342 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,786
|
Quote:
There must be something that we're missing, because even if the average speed gain is only 5%, remember that R300 does one tri per clock and is only 100M transistors. We're looking at under 1% board cost to get that 5%, and you probably need 15% faster RAM to get that same boost. I guess it could just be a "if it ain't broke don't fix it" situation. I did a lot of analysis on R200 at ATI, and I found a bug originating in the setup engine that cost 90 clocks for each culled triangle needing dependent texturing. |
|
|
|
|
|
|
#3343 | |
|
Regular
|
Quote:
PS. I think the splitup of this and the old thread is not going as planned. |
|
|
|
|
|
|
#3344 | ||
|
Senior Member
Join Date: Mar 2002
Posts: 3,786
|
Quote:
FYI, when I say setup, I mean everything between the vertex shaders (well I guess geometry shaders) and the pixel shaders, so I'm including culling/clipping. When you say heirarchical culling, are you talking about the software side? Quote:
|
||
|
|
|
|
|
#3345 |
|
Junior Member
Join Date: May 2004
Posts: 92
|
I can't think of any scenario where split frame would have worse latency than AFR, can you?
|
|
|
|
|
|
#3346 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,786
|
Quote:
Personally, I think parallelized culling is adequate. Even if it was fixed function, the cost should be minimal to test, say, 8 tris per clock with fixed data paths. No need to go beyond that until you boost rasterizer speed. You'll still get spans of producing bugger all for the pixel shaders, but they'll be 8 times smaller and that should be good enough, IMO. |
|
|
|
|
|
|
#3347 |
|
Senior Member
Join Date: Mar 2002
Posts: 3,786
|
I don't think that's practical to do, because backfacing HOS primitives can generate frontfacing triangles after tesselation, especially when doing displacement.
|
|
|
|
|
|
#3348 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Good point. Guess not.
__________________
What the deuce!? |
|
|
|
|
|
#3349 | |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
Quote:
But as more and more leaks seem to indicate that NVidia has invested significant design work into making tessellation run very fast, it seems like some are in disbelief, while others are now starting to downplay the importance of tessellation performance and benchmarks (whereas once it was taken for granted that this was AMD's strong point) If indeed NVidia has significantly boosted triangle setup, culling, and tessellation, this could be like G80 all over again, where the total lack of information caused people to assume the worst, and the final chip coming as a big surprise. I think they deserve much props if they did increase setup rate. As Mint said, it's been far too long to leave this part of the chips unchanged. Setup seems exactly where it was 10 years ago. |
|
|
|
|
|
|
#3350 |
|
Member
Join Date: May 2007
Posts: 249
|
Take a look here : http://www.behardware.com/articles/7...x-280-260.html
The results of RightMarks' VS tests point to a 0.5 triangle/clock setup rate for RV670 whichs scores ~270 at 775MHz while R600 scores ~600 at 750MHz. RV770 scores ~650, probably due to a partial only setup limitation. |
|
|
|
![]() |
| Tags |
| delay, fermi, geforce, gf100 |
| Thread Tools | |
| Display Modes | |
|
|