I guess a lot of it's black boxes. Once you have the hardware set up to recieve polygons, you just send "set render state" commands and vertex commands and it does everything automatically. The advantage of knowing how it works is that you get a better idea of how to optimize for the hardware. By measuring the timing and some internal registers, you can get an idea of how the hardware works.
For example, there's a register that you can use to measure how much RAM is used by polygon data. So you can send a single triangle and see that it takes up 84 bytes of space. Then you try sending a 2-triangle strip and it takes up 108 bytes of space, 24 bytes more. You can guess that the extra 24 bytes are the position (12 bytes), UVs (8 bytes), and color (4 bytes), So the formula for space used for a strip is probably 12 + 24 * vert_count bytes. You try a 3 triangle strip, and it takes 192 bytes. That's odd, the formula didn't work. Then you try a 4 triangle strip. That one follows the formula. Oh,
that document on the GPU mentioned something about strip sizes of 1, 2, 4, and 6, but not 3. Maybe the rendering side of the hardware doesn't 3 triangle strips. If you treat the 3 triangle strip as a 2 triangle and 1 triangle strip, the formula works again.
Doing experiments like that can figure out a lot.
It probably starts writing stuff to memory as soon as either when it gets a end-of-strip vertex or the buffer fills up enough that you have a 6-triangle substrip ready. It would be tricky to measure exactly when it triggers a write, but I think that's pretty likely that's how it works. There's actually a "finish this list" command that you need to send it once you've submitted everything that causes it to flush some unwritten data before you render the list, but I'm not sure exactly what it writes.
I think substrip generation was just "keep using the largest substrip size we can until we're done", so on-screen position doesn't really matter. You don't really need to do more than that to be optimal within a single strip. When you generate the "full" strips when creating an entire model, it's probably worth trying to avoid odd numbered strip lengths, so that it doesn't need to generate single triangle substrips.
The polygon data is double buffered, so while you're submitting data for one frame, it can also rendering the previous frame.
I'm not sure I understand what your getting at. Er, the hardware runs when you ask it to? You aren't forced to immediately start rendering when a list finishes, or submit polygons when rendering finishes.
To build a display list, you initialize some stuff in video RAM, set some registers to tell the hardware where you want the list written to, then write a register to tell it to get ready to accept polygon data. Then you feed it vertices, and when you're done, you tell it to finish the list. Then you get an interrupt when the list is ready to render.
To render a display list, you set some registers to tell it where the list is, then write another register to tell it to start. When it finishes, you get another interrupt.
Both of these can be happening at the same time.