How to get into the field of the 3D ASIC?

You can do some reverse engineering of GPU designs using GPUBench. There are lots of knobs on the applications to "explore" the characteristics of the hardware, at least as exposed through OpenGL. There is also a DX9 port, but it hasn't been promoted as we have a few bugs in the tests. (For example, neither ATI no Nvidia seem to like our DX branching tests after a spec change in dx9c.).

Now, you can more easily do cache analysis when the caches are not fully associative like the current ATI chips.

Also remember when studying cache designs, and memory systems in general, to take a careful look at the latency behavior.

As was said, the memory system as a whole is setup for lots of references going on at the same time. You can really see this as you start drawing very little in a scene with memory intensive shaders. On the current crop of boards, we don't start getting good efficiency out of the memory or ALUs until we have ~4K fragments being executed.
 
The caches present in a GPU are typically not very large (with the Gamecube "Flipper" chip being an extreme exception, with its 1MB texture-cache). According to a quick google search, the Radeon 8500 has a 4 KByte texture cache, and that appears to be the last PC GPU with a publicly-known texture cache size. For vertex data, there are typically a pre-transform and a post-transform cache, both sized to be able to hold a few dozen vertices.

For texture caches, it may additionally be noted that the cache needs to serve an extremely large number of accesses per clock cycle (about 4 per pipeline), which is most easily fixed by just replicating the texture cache unit many times. This is however not cheap. In some modern GPUs, this problem is addressed by having many small, distributed L1 texture caches, all of which feed from a single, larger L2 texture cache (although actual sizes are still not available).

Make sense.
But I am thinking the unified shader such as XBOX2's ATI chip.
I assume it must be very similar to the CPU world's
Instruction cache and
Data Cache ( Not sure this though, since I do not know what is the shader instruction format, must be very odd from the point view of the RISC instruction format ;) )
 
Make sense.
But I am thinking the unified shader such as XBOX2's ATI chip.
I assume it must be very similar to the CPU world's
Instruction cache and
Data Cache ( Not sure this though, since I do not know what is the shader instruction format, must be very odd from the point view of the RISC instruction format ;) )

In a shader architecture, you don't really need a very large instruction cache; just large enough to hold a pair of common-case shader programs (1 vertex shader program + 1 pixel shader program) and provide some buffering for uncommonly-large programs (in which case you execute 1 instruction for a very large number of vertices/pixels, then proceed to the next instruction and so on). Random branching is generally much less common in shaders than in CPUs, and as such do not benefit from large ICaches the way they can in CPUs.

GPUs do not normally contain general-purpose Data-Caches the way CPUs do. The memory regions an ordinary GPU touches (vertex-arrays, textures, framebuffers) all have highly specialized uses and usually have separate caches, carefully adapted to the most common access patterns for these memory regions.
 
Here you exactly mean by monitoring latency behavior to guess hit/miss or more in-depth issues ? :rolleyes:

A CPU, when running into a cache miss, will usually stall and be unable to do anything until the miss is resolved. A GPU, when running into a cache miss, does NOT stall - rather, it just keeps processing pixels/vertices that weren't directly hit by the cache miss, potentially detecting additional cache misses in the process. The optimal way to handle this proliferation of cache misses depends strongly on the latency and pipelining characteristics of the memory that you are fetching data into the cache from - the difference between a "good" and a "bad" cache design here can mean an order-of-magnitude difference in performance, even while keeping cache size the same.
 
In a shader architecture, you don't really need a very large instruction cache; just large enough to hold a pair of common-case shader programs (1 vertex shader program + 1 pixel shader program) and provide some buffering for uncommonly-large programs (in which case you execute 1 instruction for a very large number of vertices/pixels, then proceed to the next instruction and so on). Random branching is generally much less common in shaders than in CPUs, and as such do not benefit from large ICaches the way they can in CPUs.

With D3D10 shader programs could be very large. But even in its D3D9 generation nVidia doesn’t store the full pixel shader program in the chip. Only a few instructions are hold there. To execute larger programs they use the texture unit to fetch the next instructions for a pixel batch.
 
A CPU, when running into a cache miss, will usually stall and be unable to do anything until the miss is resolved. A GPU, when running into a cache miss, does NOT stall - rather, it just keeps processing pixels/vertices that weren't directly hit by the cache miss, potentially detecting additional cache misses in the process. The optimal way to handle this proliferation of cache misses depends strongly on the latency and pipelining characteristics of the memory that you are fetching data into the cache from - the difference between a "good" and a "bad" cache design here can mean an order-of-magnitude difference in performance, even while keeping cache size the same.

Well said. Also remember that if you are really clever, you can also reorder requests and deliver out of order to maximize address and data line usage, as well as potentially reorder to avoid bank conflicts and the like offchip, and maybe even the associativity conflicts in the cache. Memory system and cache design is now one of the most difficult things in modern parallel architecture design.
 
The method I used to try to determine cache sizes back in the R350 days was to maximize bw usage until the fillrate was hit and then apply textures of different sizes (1 KB, 2 KB, 4 KB, etc) replicated over a whole screen quad. I was using a low end ATI 9800 so it was already quite BW starved in terms of the GPU frequency but reducing the memory frequency with an overclocking utility may also work.

To get line sizes or other info you can play with the stride of the texture accesses using texture coordinates that apply more or less minification to the texture. Something I discovered for that generation is that ATI caches were optimized for mipmapping (which is the common case) and tests reported 1 to 3 cycle penalties when the minification factor was larger of 2 or 4 (depending on the screen axis). The NV35 caches didn't show this kind of behaviour.

I never tried to use the same method for the ROP caches. You may want to try quads of different sizes over the same screen region over and over (using the vertex cache to avoid being bound by vertex shading) and taking into account ATI's distribution of the screen tiles into different pipelines.

At that time the tests were reporting fillrate reductions when the texture size was 2 KB for my Radeon 9000, at 4 KB for a GeForce 5850 (NV35) and at 8 KB for the ATI Radeon 9800. The X1600 I have now at the my work PC shows the reduction at 8 KB too. I think i tested the GeForce 6600 but I don't remember at what size the reduction started. However taking into account distributed caches and multilevels I can't be sure if those were proper cache size numbers or if the methodology was correct enough.
 
Where I can find some "typical shader programe" to read so I can understand more how the hardware will implement them efficiently ? I assume it is just DX9/DX10 programme or cg ?
It looks like there are too much code examples in the google, anyone could please give me a hint which are to follow ?

The caches present in a GPU are typically not very large (with the Gamecube "Flipper" chip being an extreme exception, with its 1MB texture-cache). According to a quick google search, the Radeon 8500 has a 4 KByte texture cache, and that appears to be the last PC GPU with a publicly-known texture cache size. For vertex data, there are typically a pre-transform and a post-transform cache, both sized to be able to hold a few dozen vertices.

For texture caches, it may additionally be noted that the cache needs to serve an extremely large number of accesses per clock cycle (about 4 per pipeline), which is most easily fixed by just replicating the texture cache unit many times. This is however not cheap. In some modern GPUs, this problem is addressed by having many small, distributed L1 texture caches, all of which feed from a single, larger L2 texture cache (although actual sizes are still not available).
 
Where I can find some "typical shader programe" to read so I can understand more how the hardware will implement them efficiently ? I assume it is just DX9/DX10 programme or cg ?
It looks like there are too much code examples in the google, anyone could please give me a hint which are to follow ?

I find the examples in Microsoft's DirectX SDK (they use HLSL as a shading language) and in NVIDIA's SDK (they use CG) are quite useful. Then there is also another shading language, GLSL, but I haven't seen any real code yet. ;) However, I don't know, if you will be able to learn much about how the hardware will implement the shader programs. Maybe it helps to examin the shaders after they have been translated to asm.
 
GS shader vs Vertex Shader

In DX10, a new shader GS shader is introdueced, which is not in the old graphics pipeline.
What is the major difference between vertex shader and gs shader ?
I really could not figure out what is the point to introduce them ?
 
In DX10, a new shader GS shader is introdueced, which is not in the old graphics pipeline.
What is the major difference between vertex shader and gs shader ?
I really could not figure out what is the point to introduce them ?
They are there to support "higher order" primitives (i.e. curved/displacement mapped surfaces).
 
To me they seem also much more ameneable to abusing for other tasks since they finally break the single input - single output limitation.
 
To me they seem also much more ameneable to abusing for other tasks since they finally break the single input - single output limitation.

Yes thats more correct, although you can to some degree of HOS on them e.g. npatch is doable.

John.
 
Thought you needed more than three adjacent vertes for catmull clark.

You just need the one ring if my memory serves, but it's been a long time.
Edit: You actually need the 1 ring faces so it doesn't work.

I assume there is no limit of 3 in the final spec it's pretty much useless since it would be a pretty limited topolgy where verts have only 3 neighbors.
 
Last edited by a moderator:
I think you could still do it you just won't be provided the verts you need by the API. You can fetch adjacent vertices from a vertex buffer and then amplify the geometry with the GS. This method might impose some limits on the mesh though, I've never tried it.
 
You just need the one ring if my memory serves, but it's been a long time.
Edit: You actually need the 1 ring faces so it doesn't work.

I assume there is no limit of 3 in the final spec it's pretty much useless since it would be a pretty limited topolgy where verts have only 3 neighbors.

No, its still limited to three. As indicated by MfA, it isn't really intended for HOS its more basic geometry expansion such as point sprite expansion or techniques like shadow volume generation (1 vert per edge allows for silhouette detection).

John.
 
GPU's Driver

For me, it looks like to me the GPU's driver stack is far more complex than the normal ASIC which I worked before. However, theoretically, the GPU's driver also just control some configuration registers, , isn't it?

On the other hand, if we think the "OS" as the "driver" of the general CPU, OS is responsibile handling the process, thread, vm, etc ... ...

What the GPU's driver should handle, is there any detailed tutor for this ?

Thanks.
 
Back
Top