modern GPUs and on-chip caches ?

arjan de lumens · May 8, 2003

Assuming single-port SRAM (6 transistors per bit; add 2 extra transistors per bit for dual-port SRAMs) and about 1 transistor overhead per bit, 25 million transistors would be 25Mt/ (7*8 )t/byte = about 436 KiB.

As for things other than texture maps to cache: what about -- framebuffer? Z/Stencil buffers? Vertex arrays? Also, with pixel shaders, you get a large number of pixels in fligt * a large number of registers per pixel = a fairly large amount of SRAM as well.

ram · May 8, 2003

mczak said:
Well 20% in die area is probably about 40% in transistor count.

Could be. Anyone has some facts about tranistor density of SRAM cells vs logic gates?

Hyp-X · May 8, 2003

UberLord said:
However, I'm at a total loss to see why a cache would benefit anything else for a 3D GPU.

Indexed primitives as a vertex buffer with vertices, and an index buffer containing indices referencing those vertices.
During drawing a vertex can be referenced multiple times (expect it around 6 times), so it makes sense to keep a cache for them.
This is especially important when the vertex buffer is in AGP memory which is quite slow.
Hence the pre-transformed vertex cache.

Because the vertices are transformed on demand those 6 occurences of the vertex can mean that a vertex can be transformed multiply times just to reach the same conclusion.
Hence the post-transformed vertex cache.

3dcgi · May 8, 2003

UberLord said:
However, I'm at a total loss to see why a cache would benefit anything else for a 3D GPU.

Color and depth caches are used to allow memory prefetching. This is important because the latency to local memory is a number of clocks. 25, 50, 100, etc, it depends on the implementation.

The values that are needed are requested and space is allocated in the cache. In the mean time data in the pipeline is stored in fifos. By the time the pipeline data exits the fifos the requested data from memory is waiting in the cache.

This is why memory latency is relatively unimportant for graphics chips. Within reason of course because fifos do cost gates.

asicnewbie · May 9, 2003

Reverend said:
In reply to the original post -- this is usually a closely guarded internal secret and is considered proprietary information.

But the funny thing is, I bet it's like an 'open secret.' Embedded RAMs, whether they be SRAM, 1T-RAM, eDRAM, flash, present *unmistakeable* visual signatures on die-photographs. Even an untrained observer (like myself) could easily identify a RAM-block (larger than 1024 bytes) on a blow-up die photograph, thanks to their rectangular structure and regular (repetitive) structure. Once identified, it's a simple matter of taking a ruler or other straight-edge, tabulating size of all such arrays. From here, one could deduce a very good estimate of the RAM's transistor count based on known public info, like the process lithography size and RAM-technology (SRAM, 1T-SRAM, eDRAM.) So basically, companies who include die-photographs in their PR-kit, are tacitly giving away this 'secret.'

The core (digital-logic) datapath will NOT have the same kind of regular structure. And analog/mixed-signal circuits (PLLs, RAMDACs) have very large feature sizes that easily rule them out as anything except analog blocks.

Anyone has some facts about tranistor density of SRAM cells vs logic gates?

If NVidia's using the standard-cell Artisan library (given to TSMC customers free of charge), the Artisan library includes an SRAM memory-macro compiler. The macro-compiler auto-generates EDA/tool-views for layout and routing. All the info needed to answer your question is in the design kit, but unfortunately the customer must sign an NDA to acquire the design kit (even though it doesn't cost anything.) The memory-compiler for 0.18u (which I've used) can generate a variety of SRAM configs - single-port (R/W), two-port (R + R/W), and true dual-port (R/W + R/W.)

Asking about logic gate size is like asking about the length of an x86-instruction ... depends on the gate's function (OR, NOT, AND, etc.) and its drive-strength (X1, X2, X4, X8, etc.) And finally, standard-cell gates are laid out on a 'grid', so the gate's active area can be somewhat smaller than the grid's granularity. ATI has already said they use custom-design techniques in their digital-logic synthesis, meaning speed (or area) critical blocks are hand-designed (allowing the designer to defeat the grid restrictions.)

MfA · May 9, 2003

There are companies who sell reverse engineered info of "popular" ICs ... I am pretty sure I have seen such offered for NVIDIA chips, with some global info for free as a teaser, although I dont remember where.

Anyway, competitors know as much about the chip as they are willing to pay for.

RussSchultz · May 9, 2003

It only takes a few hundred dollars (plus a chip) to take it to your local FA test lab and have it decapped. Then, as asicnewbie says, its all a matter of measuring the thing. With the new bare die flip chip packaging, you might not even have to decapp it (since its not embedded in epoxy). Just pry the thing loose from the substrate/package and you can see it all.

shaderman · May 9, 2003

arjan de lumens said:
Assuming single-port SRAM (6 transistors per bit; add 2 extra transistors per bit for dual-port SRAMs) and about 1 transistor overhead per bit, 25 million transistors would be 25Mt/ (7*8 )t/byte = about 436 KiB.

As for things other than texture maps to cache: what about -- framebuffer? Z/Stencil buffers? Vertex arrays? Also, with pixel shaders, you get a large number of pixels in fligt * a large number of registers per pixel = a fairly large amount of SRAM as well.

add one for SCAN.

- SM

arjan de lumens · May 9, 2003

shaderman said:
arjan de lumens said:

Assuming single-port SRAM (6 transistors per bit; add 2 extra transistors per bit for dual-port SRAMs) and about 1 transistor overhead per bit, 25 million transistors would be 25Mt/ (7*8 )t/byte = about 436 KiB.

As for things other than texture maps to cache: what about -- framebuffer? Z/Stencil buffers? Vertex arrays? Also, with pixel shaders, you get a large number of pixels in fligt * a large number of registers per pixel = a fairly large amount of SRAM as well.

Click to expand...

add one for SCAN.

- SM

Nope. Scan applies to D flip-flops, not SRAM cells - there is a rather big difference between them. Scan costs about 6-8 transistors per bit. For SRAMs, you usually need to add test logic that is rather different from standard scan.

UberLord · May 9, 2003

Thanks for those answers guys - I think I know what you both meant

shaderman · May 14, 2003

arjan de lumens said:
shaderman said:

arjan de lumens said:

Assuming single-port SRAM (6 transistors per bit; add 2 extra transistors per bit for dual-port SRAMs) and about 1 transistor overhead per bit, 25 million transistors would be 25Mt/ (7*8 )t/byte = about 436 KiB.

As for things other than texture maps to cache: what about -- framebuffer? Z/Stencil buffers? Vertex arrays? Also, with pixel shaders, you get a large number of pixels in fligt * a large number of registers per pixel = a fairly large amount of SRAM as well.

Click to expand...

add one for SCAN.

- SM

Click to expand...

Nope. Scan applies to D flip-flops, not SRAM cells - there is a rather big difference between them. Scan costs about 6-8 transistors per bit. For SRAMs, you usually need to add test logic that is rather different from standard scan.

duh

modern GPUs and on-chip caches ?

arjan de lumens

ram

Hyp-X

Irregular

3dcgi

asicnewbie

MfA

RussSchultz

Professional Malcontent

shaderman

arjan de lumens

UberLord

shaderman

Similar threads