How to get into the field of the 3D ASIC?

nobond

Newcomer
I am just a new comer in this field. I used to work in vlsi design (but only in the wireless communication area) during my postgraduate study. I did not have any serious knowledge on the 3D, Graphics core, and gpu etc ...

Now I will start my job working as a graphics asic designer. Is there any idea on how to learn this field quickly and efficiently ?
 
For a quick overview of 3d graphics algorithms, you may perhaps start with books like "Real-Time Rendering". To learn about the programming model that will be expected from the graphics hardware, you can try searching around on places like http://www.opengl.org and http://msdn.microsoft.com/directx/ .

There are not many good sources of information on actual implementation of GPUs (there appear to be several good books on CPU design, but none that I know of that target GPU design - there are enough similarities that understanding everything in a CPU-oriented book will be helpful, however there are also major important differences, in particular in what assumptions you can make about parallellism, throughput and latency, and what you do and don't need specialized circuits for); you can find many pretty diagrams with some amount of explanation at pretty much any site that does GPU previews/reviews (beyond3d, anandtech, hardocp, etc etc), however these do not normally drill very deep into the nitty-gritty details.

Expect the learning curve to be the steepest one you will ever run into.
 
nobond said:
I am just a new comer in this field. I used to work in vlsi design (but only in the wireless communication area) during my postgraduate study. I did not have any serious knowledge on the 3D, Graphics core, and gpu etc ...

Now I will start my job working as a graphics asic designer. Is there any idea on how to learn this field quickly and efficiently ?

I would have thought the best way is to get employed by one of the companies actually designing 3D graphics chips, though that rather limits your choices of location!
 
Thanks for the help.

The truth is I got employed by one 3d graphic corp so I have to learn that

While I still prefer more on the general processor .or. singal processing circuits ... ... :cry:
 
You could start with just reading whatever looks interesting in the conference proceedings of the Graphics Hardware Workshop.
 
nobond said:
Now I will start my job working as a graphics asic designer. Is there any idea on how to learn this field quickly and efficiently ?
My experience tells me that in this particularly wide field, you'll find yourself most comfortable and most interested in a specific area. Once you realize this, read up to get the essential basics and more importantly, just experiment like crazy -- the books should be there for the time when you need them the most i.e. "where could I have gone wrong?"
 
nobond said:
Thanks for the help.

The truth is I got employed by one 3d graphic corp so I have to learn that

While I still prefer more on the general processor .or. singal processing circuits ... ... :cry:
Which one?


Whilst not a hardware guy myself, the impression I get from seeing our guys work is that you will be working on a fairly small piece of the whole thing, and it is the interaction with the other hw modules that will be the more important thing. But a good understanding of basic 3d principles certainly won't hurt. I think the Foley and Van Dam 'Computer Graphics' is still considered the reference book to have. http://www.amazon.co.uk/exec/obidos/ASIN/0201848406/203-5710545-2525539
I would be amazed if there isn't a copy you can borrow in your company.

If you have the option then talking with the software guys who are going to be programming the thing is good. Things that are easy for you to change (such as the placement of fields withing registers) can make it easier and more efficient for the sw guy to use.

CC
 
The only place where you can really find low level implementation details is in the IHV secret document vaults :p This is different from the CPU world where Intel, AMD or IBM allow writing papers about (some of) the internal workings of their CPUs.

Aside from that, good places to look at are the Graphics Hardware proceedings, and other miscelaneous research papers from here and there, for example the three or so classic papers on Texture Caches or the ones about Compaq's Neon rasterizer, some chapters on graphics hardware in the Real-Time Rendering book and other books, and some online PhD courses for example: the now relatively old one by Akeley and Hanrahan, and the more actualized ones from John Owens (which looks it has been actualized since the time I discovered it) and Akenine-Moller. The one from Bill Mark (linked in Owen's course) looked good too.
 
Really? I'm still downloading the slides ... Well, then I'm not the only 'academic' considering B3D as a source of information ;)
 
Right.
Actually, I have skimmed through the cs448 one, which looks pretty good but I really do not know how much I understand it exactly.

For the general processor, the basic model is well known. Based on the classic text book five stage pipeline, the other variant is not hard to understand by playing around with the number of the stage and the parallism. In the gpu filed, it looks like there is not a typical model as the five stage for the risc processor. The conception of the gemo processor and rasterizaion processor looks too rough to provide any idea, i think.

RoOoBo said:
The only place where you can really find low level implementation details is in the IHV secret document vaults :p This is different from the CPU world where Intel, AMD or IBM allow writing papers about (some of) the internal workings of their CPUs.

Aside from that, good places to look at are the Graphics Hardware proceedings, and other miscelaneous research papers from here and there, for example the three or so classic papers on Texture Caches or the ones about Compaq's Neon rasterizer, some chapters on graphics hardware in the Real-Time Rendering book and other books, and some online PhD courses for example: the now relatively old one by Akeley and Hanrahan, and the more actualized ones from John Owens (which looks it has been actualized since the time I discovered it) and Akenine-Moller. The one from Bill Mark (linked in Owen's course) looked good too.
 
nobond said:
In the gpu filed, it looks like there is not a typical model as the five stage for the risc
processor. The conception of the gemo processor and rasterizaion processor looks
too rough to provide any idea, i think.
The first observation that motivates having a GPU in the first place is parallellism. Each pixel is essentially independent of every other pixel, which means that for a 1280x1024 framebuffer, you have a bit over 1 million sequences of calculations that are pretty much independent of each other. This has some rather immediate implications:
  • You will want to keep track of execution state for more than 1 pixel at a time, or else you are f***ed from the get-go; think multithreading and multi-core - a modern GPU keeps track of several hundred to a few thousand such states and can have dozens of execution units.
  • If you think that you might be unable to keep an execution unit busy because of a data dependency, a cache miss, a branch mispredict or whatnot, don't stall the processor. Instead, make sure you have instructions ready for other pixels, so that you can continue feeding the execution unit for 100% utilization.
  • Instruction latencies should not be affecting performance: if your execution unit has 100 cycles of latency but can accept 1 instruction per clock, you collect 100 pixels and interleave execution between them, so that your execution unit stays 100% busy all the time.
That should give you some hints to the overall architecture of a pixel shader processing unit - vertex shader units are similar, but tend to have less high-latency operations (like in particular texturing). There are lots of stuff to learn about the internal workings of the various execution units (iterators, texture mappers, arithmetic circuits) which will give you something to chew on for a very long time. You can generally make the assumption that every execution unit will be fully pipelined.

In addition, there are a substantial number of subunits that do not look like processors, but serve more specialized purposes in the general 3d graphics data flow:
  • Vertex shader pre-transform and post-transform caches
  • Triangle Setup Unit
  • Scan-converter
  • Z and Stencil test units
  • Framebuffer blend units
These units are conceptually not very complicated, but they have room for large amounts of optimization at all levels from the gate level to the algorithm level, which often makes them extremely complex in practice. One example of such an optimization would be Hierarchical Z; there are many, many others.

Finally, there is the memory subsystem, that will supply all the various units with the data they need, and return to memory the data that the units produce (the data produced should be limited to framebuffer data - that is, color, Z, and stencil). This subsystem needs to be rather deeply pipelined and highly parallel as well, to allow it to serve/prioritize a large number of execution units efficiently; a GDDR3 memory chip can easily transfer data twice per clock cycle and perform random memory accesses at a throughput of once every 2 cycles, but you can safely expect its latency of the chip to be on the order of 30 cycles (for the chip itself; this comes in addition to any latency imposed by the memory controller).


Of course, as Captain Chickenpants notes, what probably will happen in practice is that you will be assigned to work on just a small, well-defined part of this whole. If you want more than that, you will likely need a substantial aptitude for algorithm and GPU architecture development.
 
Last edited by a moderator:
Just one basic question

I know for the general cpu, the "main input element" is the instruction memory, or instruction cache more realistically. What is the counter part in the gpu ? For example, is that just opengl(or D3D) command was fed into the gpu pipeline ?
 
nobond said:
I know for the general cpu, the "main input element" is the instruction memory, or instruction cache more realistically. What is the counter part in the gpu ? For example, is that just opengl(or D3D) command was fed into the gpu pipeline ?
For the vertex and pixel shaders, there are usually small on-chip instruction memories that they execute from. These have traditionally been purely local memories, that you initialize and then tell the GPU to run a shader program from, however for more recent GPUs, they may take form of more traditional instruction caches, loading sections of shader program upon cache misses like an ordinary CPU instruction cache.

Note however, that the vertex/pixel shaders programs only tell the GPU what to do with the vertices input to the API and the pixels produced by the scan conversion; there is also usually a command queue. This queue, if present, receives commands one by one which tell the GPU which shader programs to run, what other render state to use with those shaders (blender state, Z/stencil test state, framebuffer configuration etc; these are not considered part of the shaders themselves) and which vertex arrays to run these shaders on. OpenGL/D3D draw commands are usually converted by the driver to a form suitable to be presented to such a queue.
 
A detailed question

Small progress of learning. :)

It looks like vertex part of the gpu is relative simple compare to the pixel shader part.
Vertex shader just like an enchanced computation intensive processor ( transformation etc..)
Pixel shader has too many new words to understand.

I am just confused by the transluent object.Can it be rendered by the normal z-buffer algorithm? Is there any rules for the transluent object?
 
nobond said:
Small progress of learning. :)

It looks like vertex part of the gpu is relative simple compare to the pixel shader part.
Vertex shader just like an enchanced computation intensive processor ( transformation etc..)
Pixel shader has too many new words to understand.
That is to be expected; the computation load demand on the vertex shader is usually much lower than for the pixel shader, and there is not usually any need to do fast texturing during vertex processing - this makes things a lot simpler.
I am just confused by the transluent object.Can it be rendered by the normal z-buffer algorithm? Is there any rules for the transluent object?
Translucent objects? Unless you draw them strictly in back-to-front order (or front-to-back if you use destination alpha), the ordinary Z-buffer algorithm will not handle these correctly. In nearly all hardware 3d accelerators (except for a few older PowerVR designs), it is left to the programmer/game-developer rather than the GPU itself to impose such an ordering.
 
To work with the Z buffer algorithm translucent objects must be ordered back to front so they can be alpha blended into the frame buffer. Also, these days the complexitiy of the vertex and pixel shaders in hardware are similar as for DX10 they share the same instruction set.

Edit: too slow. arjan beat me to it.
 
further: cache

It looks like the cache parameter is not public in the GPU world.
I do not know anyone can tell me a bit where I can find them for some
typical gpu.
I assume the configuration(way, cache line size) might be not possilbe to know.
At least the size?
 
It looks like the cache parameter is not public in the GPU world.
I do not know anyone can tell me a bit where I can find them for some
typical gpu.
I assume the configuration(way, cache line size) might be not possilbe to know.
At least the size?
The caches present in a GPU are typically not very large (with the Gamecube "Flipper" chip being an extreme exception, with its 1MB texture-cache). According to a quick google search, the Radeon 8500 has a 4 KByte texture cache, and that appears to be the last PC GPU with a publicly-known texture cache size. For vertex data, there are typically a pre-transform and a post-transform cache, both sized to be able to hold a few dozen vertices.

For texture caches, it may additionally be noted that the cache needs to serve an extremely large number of accesses per clock cycle (about 4 per pipeline), which is most easily fixed by just replicating the texture cache unit many times. This is however not cheap. In some modern GPUs, this problem is addressed by having many small, distributed L1 texture caches, all of which feed from a single, larger L2 texture cache (although actual sizes are still not available).
 
Back
Top