NVIDIA Fermi: Architecture discussion

http://www.beyond3d.com/content/articles/19/
Deferred Rendering or Deferred Shading or Deferred Lighting?

The term deferred rendering is used to describe a number of related techniques; they all share a deferment stage but differ in what portion of the pipeline is deferred. This article only defers the lighting portion of the pipeline, all other parts can be done in what ever way you like, and the only requirement is that the G-Buffers are filled prior to the lighting stage. Deferred Shading is typically where the actual surface shader execution is deferred, this is the model presented by UNC Pixel Plane project [4].
 
Looks like there's been some confusion on TBDR (which is what I was thinking) and deferred shading, which are two somewhat different concepts.

As mentioned above, TBDR bins up all the geometry for a scene, sorts it, and only renders visible pixels.

Deferred shading first goes through the scene once, performing a minimal amount of work and storing any necessary per-pixel values. Then there is a later pass that inputs those per-pixel values to compute the final color of each pixel.
 
As mentioned above, TBDR bins up all the geometry for a scene, sorts it, and only renders visible pixels.
Yes, although note that you can still have tile-based rendering that is not particularly deferred, a la. Larrabee.

Deferred shading first goes through the scene once, performing a minimal amount of work and storing any necessary per-pixel values. Then there is a later pass that inputs those per-pixel values to compute the final color of each pixel.
Right, although I use the term "deferred rendering" in general because it really covers the gamut of storing *anything* up front (even just depth, a la. pre-z pass) to *everything* (i.e. the entire output of the rasterizer). Normally engines do something in between, but there are many options. Thus I just use the term "deferred rendering" to generally cover them all.
 
Last edited by a moderator:
Yes, although note that you can still have tile-based rendering that is not particularly deferred, a la. Larrabee.

That shouldn't mean though IMO that the driver won't defer the rendering in some cases.

Right, although I use the term "deferred rendering" in general because it really covers the gamut of storing *anything* up front (even just depth, a la. a pre-z pass) to *everything* (i.e. the entire output of the rasterizer). Normally engines do something in between, but there are many options. Thus I just use the term "deferred rendering" to generally cover them all.

I'm wondering myself what the hair splitting on that topic is for. Given that IMG concentrates (and most likely will continue to do so) itself with it's graphics IP exclusively to embedded markets, there's no real relevance to anything GPU, even much more for anything high end.
 
That shouldn't mean though IMO that the driver won't defer the rendering in some cases.
Sure thing - there's always the option to do it deferred or not or mixed with tiling. I was merely pointing out that tile-based rendering does not necessarily imply deferred as well, even though this has often been true historically.
 
Sure thing - there's always the option to do it deferred or not or mixed with tiling. I was merely pointing out that tile-based rendering does not necessarily imply deferred as well, even though this has often been true historically.

An IMR will not process everything immediately nowadays as a TBDR will not defer everything either. It might be a large simplification but especially with the advent of deferred rendering engines, early-Z IMR etc. the boundaries between the two aren't as large as they used to be years ago. Sorry for the OT by the way.
 
Fermi's memory bus has an odd size of 384 bits (GDDR5). NVIDIA has used this width before back with the 8800GTX, and they've had cards with other odd sizes like the GTX260 with 448 bits (GDDR3).

CUDA deals with memory transactions of 32, 64, and 128 bytes = 256, 512, 1024 bits.
How does the hardware fit these evenly sized memory requests onto the oddly sized memory bus without wasted bandwidth?
 
Fermi's memory bus has an odd size of 384 bits (GDDR5). NVIDIA has used this width before back with the 8800GTX, and they've had cards with other odd sizes like the GTX260 with 448 bits (GDDR3).

CUDA deals with memory transactions of 32, 64, and 128 bytes = 256, 512, 1024 bits.
How does the hardware fit these evenly sized memory requests onto the oddly sized memory bus without wasted bandwidth?
The bus has multiple channels, likely 32- or 64-bits wide. Each channel will want a minimum burst size, say 128- or 256-bits. So your 1024-bit transaction could span multiple channels without wasting any bandwidth at all.
 
How does the hardware fit these evenly sized memory requests onto the oddly sized memory bus without wasted bandwidth?

GPU memory buses have been built from multiple narrower and independent memory controllers since 2001 :)

In order to maximize the utilization of data to and from the memory controller to the frame buffer, NVIDIA has implemented a crossbar memory architecture on the GeForce3. The 256-bit graphics controller on the GeForce3 has been partitioned into four 64-bit memory controllers each of which communicate with each other and the load and store units on the graphics processing unit.
 
I see, so the 384 bit bus might be multiplexing independent 256 and 128 bit reads simultaneously.

So when a CUDA app queries 512 bits, you may get 256 on the first clock and 256 on the second one, or you may get 256+128 on the first one and 128 on the second, or 128 on the first and 256+128 on the second. The memory controller stitches these all together as needed and releases each transaction after all of its bits have been assembled, sort of like TCP/IP packet assembly.

In CUDA with G200 the smallest read transaction is 32 bytes=256 bits. So if Fermi has 6 64 bit memory controllers, is the smallest read transaction a mere 64 bits?

Are the memory controllers independent and can access any desired bit of the device memory, or does each one handle a bank of memory and the controller has to queue up requests for each needed bank?
 
Are the memory controllers independent and can access any desired bit of the device memory, or does each one handle a bank of memory and the controller has to queue up requests for each needed bank?

A memory controller can access only those memory devices which attached to it. Generally there are some sorts of "spreading" algorithms to avoid congestion on a single memory controller.
 
GPU memory buses have been built from multiple narrower and independent memory controllers since 2001 :)
Even back in 1998 (Matrox G200), in fact, and even ATI released something like this with R100 with the exact same goal as nVidia with the GeForce 3.

And we don't know everything, perhaps that was the case of other processors before and we didn't hear it back then because there wasn't anything to hide by promoting it.

I wonder if it's still efficient with billions transistors chips though, a write buffer, a decoupled RAM controller and some scheduling could improve performance. It's not as if some kB SRAM still required a substantial die area.
 
Is Fermi's Architecture considered von Neumann or Harvard ?

And what about the RV870 ?

I think on current GPUs, code and data share the same physical memory (i.e. the video memory). However, programs running on the GPU don't have the ability to access the code memory (e.g. a program can't "generate" another program on the GPU). So in this sense it behaves more like Harvard rather than von Neumann.
 
Ati chips from the past few years have had instruction caches that are distinct from data caches. There's no reason code must be in video memory. It could be loaded straight from system (CPU) memory if the latency can be tolerated.

I would guess Nvidia is the same.
 
Ati chips from the past few years have had instruction caches that are distinct from data caches. There's no reason code must be in video memory. It could be loaded straight from system (CPU) memory if the latency can be tolerated.

I would guess Nvidia is the same.
NV have instruction cashes since nv40 with SM3.0 caused by demand of huge maximum instruction count per shader. This is very small caches that can effectively caching only small area of local gpu memory - buffer of instructions and constants, don't see opportunity to caching system memory by these caches
 
Last edited by a moderator:
Why don't they just let the intel chipset handle all the display tasks but intercept 3D rendering calls to render frames on the discrete GPU? Then you can just blit the framebuffer into the integrated graphic's chip memory when doing 3D, and when not doing 3D turn off the discrete graphics chip without a care in the world.
And that's exactly what they did :)
 
Back
Top