Silouette extraction can be done on the GPU, but since we can't create geometry on the GPU at this time you have to attach 2 degenerates triangles on each edge of a 2-manifold mesh.
NVIDIA describes a way to use 1 degenerate triangle and exploit the fact that a w=0 projection can generate the same shape as a quad.
A 2-manifold mesh with N triangles has 3N/2 edges and about 3N/6 vertices ..the 'new' mesh will have N + (2*3N/2) = 4N triangles and about 3N vertices.. I don't like that
To extract the shadow volume you need a normal for each triangle and AFAIK you can't compute that normal on the GPU if you don't store other extra data within each vertex.
Yes, you need extra geometry, but the geometry is reasonably compact.
For non-skinned models, the position vector and the plane equation would suffice (or even only the normal, but that would take some more per-vertex processing). So that would be 28 (or 24) bytes per vertex, plus a set of indices (you don't have to store indices for the caps if you don't mind rendering in two calls instead of one. You can arrange the edge quads in a way that the vertex-order forms the original mesh, so rendering without an indexbuffer would generate the caps, and the indices would create the rest of the volume. You lose some efficiency, also in the vertexcaching).
For skinned models you'd need to store the three positions of the triangle for each vertex, the blend weight (and possibly the index), so that would be 36+(B-1)*4 (or 40+(B-1)*8 ) bytes per vertex for B bones.
So the amount of extra geometry shouldn't be all that high (less than 10 mb for the average game?)... especially if you subtract the amount of geometry that a CPU-based approach would send over per frame, on average (you would need to allocate buffers for that too). And the advantage ofcourse is that the data here is static (this should also allow the driver to 'swap out' unused geometry from videomem to mainmem, or do this manually).
Also, you could use lower resolution meshes as shadowcasters, if you like (extruding frontfaces instead of backfaces will move the selfshadowing bugs to the back of the object, which is unlit anyway, so they won't be noticable).
And if we compare to shadowmaps, well... if you need 6 maps per lightsource, you will quickly run into higher memory demands than shadowvolumes.
At this time even if silouette extraction can be done on the GPU I wouldn't advocate it, imho.
With the results I've seen, compared to what Doom3 does, I would certainly pick the GPU-based method on all R3x0 cards and up (most DX8-generation cards don't have enough shaderpower to beat the CPU, and on DX7 cards the method would only work with shader-emulation, which is definitely lots slower than a proper CPU-based approach).
I haven't decided yet if I would favour shadowmaps over shadowvolumes on R3x0 and up though. My decision at the time for shadowvolumes was for the better compatibility with DX7 hardware, and shadowmaps still have some open issues that keep them from being a robust solution.
And my experiment with cubemap shadowmaps was disappointing, because it wasn't faster than the same scene with shadowvolumes, but much lower quality. But on R3x0 you could increase resolution and accuracy, so that would leave mainly the speed and robustness issues.