ATI loves L2 caches and NVIDIA loves to deactivate pipelines..

nAo

Nutella Nutellae
Veteran
A couple of interesting and recent (filed last year) patents; first one is from ATI:

Two level cache memory architecture

A memory architecture for use in a graphics processor including a main memory, a level one (L 1 ) cache and a level two (L 2 ) cache, coupled between the main memory and the L 1 cache is disclosed. The L 2 cache stores overlapping requests to the main memory before the requested information is stored in the L 1 cache. In this manner, overlapping requests for previously stored information is retrieved from the faster L 2 cache as opposed to the relatively slower main memory
I think the abstract makes it pretty clear, so it seems their pipe assigned to different tiles are going to share/reuse some textures data afterall.
The second patent is from NVIDIA:

SCALABLE SHADER ARCHITECTURE

A scalable shader architecture is disclosed. In accord with that architecture, a shader includes multiple shader pipelines, each of which can perform processing operations on rasterized pixel data. Shader pipelines can be functionally removed as required, thus preventing a defective shader pipeline from causing a chip rejection. The shader includes a shader distributor that processes rasterized pixel data and then selectively distributes the processed rasterized pixel data to the various shader pipelines, beneficially in a manner that balances workloads.
Nothing groundbreaking here even if I wonder if this one is about NV40/G70 family or about future GPUs as G80.
Interesting quote:
his application claims benefit of United States provisional patent application serial number 60/561,617 (Attorney docket number P001278), entitled "GRAPHICS SHADER ARCHITECTURE" which was filed on April 20,2004, and which is herein incorporated by reference.
Unfurtunately this patent it's not avaliable yet..

ciao,
Marco
 
nAo said:
The second patent is from NVIDIA:

SCALABLE SHADER ARCHITECTURE


Nothing groundbreaking here even if I wonder if this one is about NV40/G70 family or about future GPUs as G80.
Interesting quote:

Unfurtunately this patent it's not avaliable yet..

ciao,
Marco

This is about NV40/G70 but will maybe still used for G80.
 
Incredibly how well-known and often-implemented Methods can be filed as patent again as long as you put them in another context. I wonder who is first to patent a GPU-Stackregister
 
I can't see anything special about the ATI two level cache patent that makes it any different from how I imagine NVidia's existing GPUs utilise their 2-level cache.

Jawed
 
Demirug said:
This is about NV40/G70 but will maybe still used for G80.
Are you sure? I'm looking forward for the incorporated patent to make up my mind :)
Jawed said:
I can't see anything special about the ATI two level cache patent that makes it any different from how I imagine NVidia's existing GPUs utilise their 2-level cache.
AFAIK ATI has never used a L2 cache, so it would be nice to understand whether this patent is referring to some GPU ATI is already selling on the market or on some GPU that is not available yet

ciao,
Marco
 
The filing date (April 2004) seems too late for Xenos, so I imagine it'll be R600. But that's not a reliable clue...

Jawed
 
Well, R300->R590 included work in tiles, with each quad being reserved a tile, so L2 makes no sense in that context. Considering the Xenos and R600 architectures, this would seem to make the most sense on the R600, yeah.

Uttar
 
I have just been playing with tile distribution and I can assure you that a L2 will help. Take this case example: a quad covering the whole viewport is drawn combining two textures with the same size of the viewport with bilinear filtering enabled.

For this example I used two 256x256 RGBA textures (uncompressed) and the viewport is also set to 256x256. I tested 8x8 and 16x16 tiles as the work distribution unit for 4 quad pipelines (shader + rop) and two texture cache line sizes (256 and 64 bytes, 256 is what the simulator usually uses due to limitations in the current implementation of the memory controller, but 64 works fine for uncompressed textures). The fragment shader used has 2 texture instructions and 5 arithmetic instructions doing the combining (so it should be fragment shader limited and not bandwidth limited).

How much texture data do you thing is read by the GPU to render the quad covering the viewport each case?

a) all cases read 512 KBs of texture data
b) 8x8/256B reads 1.7 MB, 8x8/64B reads 800 KB, 16x16/256B reads 1 MB and 16x16/64B reads 800 KB
c) no texture data is read

Second question, which scenary do you think would benefit from using a L2 cache to feed blocks of texture data requested by one texture unit to the other texture units rather than request them again to the memory controller?

Remember. When you aren't point sampling a texture even in the best case you could imagine, texture propperly aligned with the screen pixels, you have to fetch data from texture blocks that are also accessed by other texture units.
 
RoOoBo said:
How much texture data do you thing is read by the GPU to render the quad covering the viewport each case?

a) all cases read 512 KBs of texture data
b) 8x8/256B reads 1.7 MB, 8x8/64B reads 800 KB, 16x16/256B reads 1 MB and 16x16/64B reads 800 KB
c) no texture data is read

d) none of the above?

My guess would be:
8x8/256B reads 2 MB
8x8/64B reads 1.125 MB
16x16/256B reads 1.125 MB
16x16/64B reads 800 KB
 
You are obviously right.

In fact I did the post at home and mixing results from other tests. And I obviously, as usually, hadn't tried to check the results I got against the theorical model (a bad research practice even when the results sound half right) so I couldn't see at first glance that the numbers I was pulling out of my head were wrong. In the tests I was more interested in getting in the simulator the propper rendering time (and shader utilization) rather than checking how much data was being read, as I had problems with non memory related parameters and code, so I wasn't really paying attention to the matter.

But no, the simulator isn't wrong (this time I checked for sure the amount of texture data read for all the cases), it's just my brain :LOL:. The 8x8/64B one is an obvious error that I already thought was an error after writing it and didn't change as I didn't remember a different, right, number. The 1.7 MB is, likely, something I got testing a non-tiled distribution scheme and 1 MB is just lazyness + bad memory = rounding.

But well, even with a 15% error (can't be that bad for a simulation driven reseach? can be? ;)) in the numbers the meaning of the post is the same.
 
Ok, I understand the message.
However I was under the impression that the shortest read of the GPU's are usualy 16 byte.

That's 648KB for 16x16 tiles which is about 26% above the optimal - I wonder if <=X850 cards have this inefficency, or if it is solved in the X1000 series.
26% is still a large number.
 
Hyp-X said:
However I was under the impression that the shortest read of the GPU's are usualy 16 byte.
That depends on the DRAM width and the burst length.

For example, a 32-bit wide DRAM with a burst length of 4 (like GDDR3), gives you 16 bytes/request. If you can only make accesses at the full size of the memory interface, then 256-bit GPUs coupled with DRAM with burst lengths of 4 give you 128 bytes/request.

Different DRAM have different burst lengths, and different GPUs have different memory access configurations.
 
16 bytes? The smallest configurable burst access with a 64-bit DDR bus (two chips) is 32 bytes (4 x 64 bits). Only the new 32-bit DDR bus based memory controller that ATI implements could be configured to 16 bytes accesses.

In any case I find quite weird that ATI GPUs don't implement a L2 or another method to solve this problem. I did some texture fillrate tests long ago using different texture sizes with my R350 and the effective texture bandwidth showed more than one step. It had a first bandwidth step at the limit of the texture cache size, texture size of 8 KB, and a couple of additional steps may be around 64 KB or 256 KB texture sizes before reaching the real GPU memory bandwidth level. The last big bandwidth step was the AGP bandwidth for texture sizes over 128 MBs. I wonder if someone else has done similar tests or has implemented a similar benchmark. Or I could try to dig up my small benchmark program ...
 
Bob said:
That depends on the DRAM width and the burst length.

For example, a 32-bit wide DRAM with a burst length of 4 (like GDDR3), gives you 16 bytes/request. If you can only make accesses at the full size of the memory interface, then 256-bit GPUs coupled with DRAM with burst lengths of 4 give you 128 bytes/request.

Different DRAM have different burst lengths, and different GPUs have different memory access configurations.

As far as I know there are only five diferent configurations have been implemented in ATI or NVidia GPUs. Until the R5xx Memory Controller the configurations were: one 64-bit DDR bus (all those low budget cards), two 64-bit DDR buses (the middle level cards) and four 64-bit buses (the 'too expensive' cards ;)). Each of those 64-bit buses accesses in parallel to two 32-bit DDR chips but the buses are completely independant from each other. The new ATI MC supports 32-bit DDR buses, 8 for the R520 and 8 for the RV530. The RV515 seems to be using the old 64-bit DDR MC implementing a single bus.
 
I think you meant 4 for RV530.

One thing the patent describes, which I think is strange, is how each individual pipe (per pixel) has an L1 cache.

Paragraph 23 of Description said:
In other words, the texture information for pixel P0 will be transferred to first texture cache 202; the texture information for pixel P1 will be transferred to second texture cache 204; the texture information for pixel P2 will be transferred to third texture cache 206 and the texture information for pixel P3 will be transferred to fourth texture cache 208.

I would have expected the quad-pipe, as a unit, to share an L1 cache.

But, instead, it appears that the patent describes an L1 cache structure in which a texel will appear multiple times, within a quad.

Strange.

Jawed
 
RoOoBo said:
As far as I know there are only five diferent configurations have been implemented in ATI or NVidia GPUs. Until the R5xx Memory Controller the configurations were: one 64-bit DDR bus (all those low budget cards), two 64-bit DDR buses (the middle level cards) and four 64-bit buses (the 'too expensive' cards ;)). Each of those 64-bit buses accesses in parallel to two 32-bit DDR chips but the buses are completely independant from each other. The new ATI MC supports 32-bit DDR buses, 8 for the R520 and 8 for the RV530. The RV515 seems to be using the old 64-bit DDR MC implementing a single bus.

I know that NV2x cards had 4 32-bit DDR buses and they had a smallest access of 64 bit (8 byte). Their Z compression method were optimized for 64 bit access size.

R300 introduced 4 64 bit DDR buses so I assumed thay had a smallest access of 128 bit (16 byte).

Has the burst length of 4 been introduced with DDR2?
Because that would mean that FX 5800 had 4 32 bit DDR2 w/ burst 4 (4 x 128 bit burst),
while the FX 5900 had 2 64 bit DDR w/ burst 2 (4 x 128 bit burst).
That might explain their choice of ram and their ability to change the memory controller that easily.
 
I had only read Samsung GDDR2 and GDDR3 specifications until now and those only support burst 4 and 8 (in fact until a few months ago GDDR3 only supported burst 4).

I downloaded Samsung graphic memory guide and what they list as GDDR seems more or less as 'normal' DDR memory and the chip specification I downloaded and looked on lists four burst modes: 2, 4, 8 and whole page.

I wasn't even thinking about pre R300 GPUs as I started researching on graphics just around NV3x and R3xx were releaseded (or at least paper launched, I still remember reading the NV_fragment_program extension spec months before they released the card :LOL: ). But not remembering that NV3x had a 256-bit bus and four memory controllers was for sure a mistake.
 
Back
Top