NVIDIA Fermi: Architecture discussion

Take this with a grain of salt, but did anyone notice the Anonymous post on BSON with GF104 specs:

NVIDIA GeForce GTX 395 Specs :

- Codename "GF104".
- Dual Core GPU Design (Two GF100 "Fermi" Cores).
- 6.4 Billion Transistors In Total (TSMC 40nm Process).
- 32 Streaming Multiprocessors (SM).
- Each SM has 2x16-wide groups of Scalar ALUs (IEEE754-2008; FP32 and FP64 FMA).
- The 32 SMs Have 1536KB Shared L2 Cache.
- 1024 Stream Processors (1-way Scalar ALUs) at 1350MHz.
- 1024 ALUs In Total.
- 1024 FP32 FMA Ops/Clock.
- 512 FP64 FMA Ops/Clock.
- Single Precision (SP; FP32) FMA Rate : 2.76 Tflops.
- Double Precision (DP; FP64) FMA Rate : 1.38 Tflops.
- 256 Texture Address Units (TA).
- 256 Texture Filtering Units (TF).
- INT8 Bilinear Texel Rate : 153.6 Gtexels/s
- FP16 Bilinear Texel Rate : 76.8 Gtexels/s
- 80 Raster Operation Units (ROPs).
- ROP Rate : 48 Gpixels
- 600MHz Core.
- 640bit (2x320bit) Memory Subsystem.
- 4200 MHz Memory Clock.
- 336 GB/s Memory Bandwidth.
- 2560MB (1280MB Effective) GDDR5 Memory.
- New Cooling Design.
- High Power Consumption.

GTX395 is 60% faster than GTX380
GTX395 is 70% faster than HD5970
Release Date : May 2010, Price : 499-549 USD.

Pros:
2 way coherent caches using a 64 bits of mem bus looks to be a BIG improvement. Otherwise, seems rather sensible. Clocked a bit less than expected. If some SM's had been fused off, it could have been quite believable.

Cons:
A downclocked, castrated, cherrypicked GF100 hits 225W. So power is a big question here.
Can't shake off the feeling that this thing was derived/made up by applying the 285->295 formula.
 
And anandtech said in october, that GT200 is EOL. :LOL:
Yes, it's more or less a catalogue product now. In my country GTX285 are often more expensive and harder to find than HD5870. The same for GTX275/HD5850. Only a few overpriced ~1,8-2GB models are available. GT200 is EOL, availability is really poor, so it's logical to expect, that the GPU is produced no more. nVidia has some reasons to keep it in pricelists - maybe it's better to offer a virtual competitor than nothing.

I'd expect, that this situation will last until the launch of Fermi mainstream parts.
 
Not that it isn't pure fantasy but why would inter-GPU bandwidth need to be equal when the load on that path would be far lower than GPU<->Mem? I'm not sure what purpose it would serve anyway, didn't both AMD and Nvidia claim that their current proprietary links have sufficient bandwidth for their purposes?
It has sufficient bandwidth for AFR.

The problem with sidebusses and non AFR parallel rendering is that loadbalancing is rather difficult ... the naive approach is simple round-robin, but then all framebuffer writes are 50/50 local/remote ... which is going to take a whole lot of bandwidth.

Personally I would do things like this ...

- Vertex processing is divided round robin (vertex buffers are fully replicated)
- All write buffers are roughly tiled (say 64x64 or more) and checkerboard divided between the GPUs
- All transformed vertices get tiled and then written to buffers in the memory of the relevant GPUs for tesselation or rasterization (icky, but the writes would be done with special types of non temporal load/stores ... if the vertices get consumed while not evicted from L2 they never have to be written to external memory)
- All read buffers (including former write buffers) are replicated on demand on a tile by tile basis, which is to say they that if a tile from a buffer is accessed which is not stored locally that tile gets replicated

(My thinking on as needed replication is that it will be as or more efficient than NUMA with caching, for instance with dynamic textures reused across multiple frames it is clearly superior, and certainly more efficient than doing full buffer replication all the time since that introduces too much latency in between rendering steps.)

How much bandwidth that would consume? Hell if I know, would have to implement it in a simulator and run traces (lossless compression could probably cut the data for the tiled triangle writes by 66%, but because of the fact you are working with floating point numbers it's not cheap).
 
Last edited by a moderator:
Really, what's the problem? He is counting the two L2 Cache together. Yeah, it's not right but AMD did the same with Hemlock: http://forum.beyond3d.com/showpost.php?p=1359262&postcount=4698
No, they added the GPUs processing units together, which is 100% valid, they added the mem bandwidth together, which is quite valid too (not 100%, but bandwidth really doubles when there's no duplicate access), but they didn't add the "L2 cache" together in your link's pics.

These specs are one of the worst fakes ever, except perhaps G80's hybrid water/air cooling I don't recall something worse.
 
No, they added the GPUs processing units together, which is 100% valid, they added the mem bandwidth together, which is quite valid too (not 100%, but bandwidth really doubles when there's no duplicate access), but they didn't add the "L2 cache" together in your link's pics.

These specs are one of the worst fakes ever, except perhaps G80's hybrid water/air cooling.

They added the memory together - which is 100% not valid. :rolleyes:
 
And anandtech said in october, that GT200 is EOL. :LOL:
That's not where I focused. And as many said here, GT200 cards are indeed difficult to come by. For me the highlight was:

We're hearing that the rumors of a March release are accurate, but despite the delay Fermi is supposed to be very competitive (at least 20% faster than 5870?)

I doubt that Anand will put something in his piece after hearing it from people whispering in a crowd. You can say many things about AnandTech but I have found that mostly they don't pay heed to rumours. He must have got the figure from someone reliable or connected to Fermi.
 
The problem is that L2 can't be shared, unless they created a dual core GPU. Which is quite hard to believe, due to the area of each core... 1000mm2 chip? :LOL:

What that means is that L2 is coherent wrt the 2 cores. This is obviously referring to a dual chip card. TSMC reticle limit is ~600 mm2, which means they can't make chips bigger than that.
 
Full vertex buffer replication seems to be going in the opposite direction to where you want to be.
I meant the ones uploaded from the host. I didn't make it super explicit but HOS's would be tiled at the pre-tesselated levels, as I said "All transformed vertices get tiled and then written to buffers in the memory of the relevant GPUs for tesselation or rasterization" (should have said triangles and patches, not vertices). You obviously have to make some assumptions about bounding volumes of patches which developers could break for which you would need application specific fixes, application specific code for SLI is hardly new though. Also you would need to tag all vertices with a 32 bit integer and have one queue per GPU so the rasterizer can resequence them into an ordered stream, but that's no different from what the hardware has to do internally now.

It would be nice if sets of vertices came with bounding boxes so you could tile them at the start of the pipeline and get rid of all that communication, but they don't, so you can't ... you can multiply the vertex load to avoid the communication by simply transforming everything on all GPUs, but if you don't want to do that you are stuck doing tiling relatively late in the pipeline with only a 1/#GPU chance of a triangle/patch being local (ignoring overlap).
For other buffers wouldn't you tag shared resources at the driver level and do replication in a push model instead of an on-demand pull? That seems like the best way to go given that you know up front which buffers are shared.
The problem with that is that you potentially have to transfer a whole lot more (lets say a buffer is used in some MRT technique, because of the correlation between the pixels being rendered and the texels needed you are not going to need the entire buffer) and that it introduces a different kind of latency (the latency needed to transfer the whole buffer) which is potentially much larger than normal texture access latency (and could thus stall the rendering). If you use a modest tile size for replication (say 8x8) and give remote data accesses priority in the memory controllers I think you could get latency close enough to normal texture access latency for normal multithreading to cover it.
 
Last edited by a moderator:
Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?
 
Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?

When I first read that patent I thought it was another brain dump that would never actually make it to market. It presumes higher scalability than AFR yet we see AFR setups achieving 80-100% scaling nowadays. It seems to me that AFR could benefit just as much from this higher bandwidth bus. Processing the same geometry on both chips isn't going to scale in this brave new world of tessellation.
 
Looks very similar to what MfA proposed, minus the vertex load round-robin step and tiling of patches/tris. Maybe the latter isn't worth it for screen-tile sizes that give good pixel load-balancing?
If you don't equally distribute the vertex load, lets say you let one GPU handle it all, you have to rely on frame to frame coherence to determine the vertex/pixel load ratio so you know how many tiles each GPU gets to render. The round robin vertex shading ensures that giving each GPU an equal share of the tiles will result in virtually perfect loadbalancing (with identical GPUs). That said, NVIDIA suggests doubling the vertex load in the most explicit patent (7616206) which OlegSH linked ... it ain't going to outscale AFR like that in modern games.

The only way this kind of rendering will beat AFR is by convincing the consumers it's superior regardless of benchmarks. I'm convinced if they manage it with equal benchmarks, though I doubt they would get there while doubling the vertex load, but it might be a little harder in general.
 
Last edited by a moderator:
MfA,
That makes sense. According to the patent though, processing tiles in a checkerboard pattern gives good pixel load balancing since workloads for adjacent tiles are statistically similar (so I'm guessing the tiles are pretty small). Perhaps the ability to tesselate only primitives covering tiles owned by a GPU is present; if not, I agree that AFR will be difficult to beat in geometry heavy scenarios.

To scale from 2GPUs to 4, the patent seems to imply that standard AFR will be used, which is a bit disappointing. I know squat about bus PHY, but maybe their ambition was limited by the difficulty in creating 25GB/s+ connections between GPUs on separate PCBs? I'm thinking the whole "reuse a memory controller for inter chip communication" approach might mean that there are some pretty tight constraints on GPU placement for things to work. I guess the attraction must be that such reuse means you don't have HT link(s)/Intel-equivalent sitting around unused on the single GPU cards (or you get extra redundancy).
 
What do you think about running the setup in its own clock domain (e.g. hot clock) rather then using multiple units? That would seem like the simplest way to do it.

Dumb question: irrelevant of frequency used isn't that more a dilemma whether to have 1 setup unit with >1Tri/clock vs. 2 setup units with 1 Tri/clock?
 
Dumb question: irrelevant of frequency used isn't that more a dilemma whether to have 1 setup unit with >1Tri/clock vs. 2 setup units with 1 Tri/clock?

My thoughts were, if you have two units, then all of a sudden you need extra logic to dispatch tris between them, buffer/route the results, etc = more complexity.
 
The problem with AFR is the frame latency, which gets annoying when you're talking about sub-60 FPS dips during gameplay, where the latency and micro-stutter becomes very noticeable. It's one of the main reasons why I dislike multi-GPUs in practice, regardless of synthetic numbers and benchmarks. Even if split frame techniques displayed lower scaling than AFR, when it came to actually playing games, split-frame would be the mode I would pick. Now if it was actually faster than AFR even in synthetic benchmarks, that would really be something.
 
Even if split frame techniques displayed lower scaling than AFR, when it came to actually playing games, split-frame would be the mode I would pick. Now if it was actually faster than AFR even in synthetic benchmarks, that would really be something.
And what's about algorithms that uses accumulation to render target, you need to hold copy of that RT in the local memory of each GPU, that's also the case where AFR did't perfect scale
 
Back
Top