Asynchronous Hardware Communication

trinibwoy

Meh
Legend
Supporter
Guys, quick question -

Communication between components running at different clocks/speeds is taken for granted in hardware today. But exactly how does it work. For example, if the ROPs on a GPU are all busy, are the shader pipelines stalled or is there some sort of intermediate buffer ? This is just one example, but is that how it works in general?
 
Generally speaking elements have some sort of FIFO's / Buffers inbetween. Should they fill/starve then elements will slow down.
 
I just found out fairly recently that the ROPs on NV40 are running at memory frequency. Is it the same on Radeons?
 
To expand on Dave's answer. Incorrectly sizing FIFOs can lead to big performance problems or a lot of wasted transistors. A big goal of performance testing is to figure out ideal FIFO sizes.

By this a mean a FIFO that is too small will result in a lot of stalls. In contrast using a 32 deep FIFO when only 8 locations are typically needed is a waste of transistors.
 
One thing to note is that most of what you've talked about here is not really asynchronous. Rather this is just communication between blocks in the graphics pipeline that take differing numbers of cycles to complete their work.

Most of the time all of the parts of that pipeline run on the same clock ("engine" or "core" clock), so the data passing between them is still synchronous. The FIFOs are there to even out the workflow between blocks. They let the source block continue on with computations and generating data for the next block, until the FIFO is nearly full, and then they provide the destination block with a steady stream of data to work on.

Asynchronous communication is trickier. Depending on how much data needs to be send across clock boundaries, you might put all of the actual data into a FIFO that is written with one clock and read with another, while having a parallel set of gray-coded signals communicating that data was added or removed. There is extra delay in this while you make sure everything is synchronized. So most of the time such async boundaries are only used when necessary. You must have one somewhere between the block running on AGP/PCIE clock and the engine clock, and you must have one somewhere in the path between engine clock and external memory clock.
 
BobbleHead said:
You must have one somewhere between the block running on AGP/PCIE clock and the engine clock, and you must have one somewhere in the path between engine clock and external memory clock.

And between the core and ROPs/Memory Controller as well?
 
BobbleHead said:
you might put all of the actual data into a FIFO that is written with one clock and read with another, while having a parallel set of gray-coded signals communicating that data was added or removed. There is extra delay in this while you make sure everything is synchronized.
Why not just have a FIFO that signals the supply end when it is full to not write any more to it? Why this other elaborate stuff, I can see no reason why it should be needed. :)
 
Guden Oden said:
BobbleHead said:
you might put all of the actual data into a FIFO that is written with one clock and read with another, while having a parallel set of gray-coded signals communicating that data was added or removed. There is extra delay in this while you make sure everything is synchronized.
Why not just have a FIFO that signals the supply end when it is full to not write any more to it? Why this other elaborate stuff, I can see no reason why it should be needed. :)

Try implementing a multi-threaded FIFO in C/C++ and you'll understand why... ;)
 
trinibwoy said:
BobbleHead said:
You must have one somewhere between the block running on AGP/PCIE clock and the engine clock, and you must have one somewhere in the path between engine clock and external memory clock.
And between the core and ROPs/Memory Controller as well?

I don't know the details of that design. But that would fall under "between engine clock and external memory clock." Given the extra time cost of the asynchronous crossing, they probably do it as little as possible. So if the ROPs are running on memory clock there is one place the data crosses from engine to memory clock. There is some piece of the chip running at memory clock (or some nice 2^n or 1/2^n multiple), and once the data makes it there it moves along a synchronous path to the external memory. Again there are bound to be some synchronous FIFOs in that path, just like there are between parts of the graphics pipeline.

Guden Oden said:
BobbleHead said:
you might put all of the actual data into a FIFO that is written with one clock and read with another, while having a parallel set of gray-coded signals communicating that data was added or removed. There is extra delay in this while you make sure everything is synchronized.
Why not just have a FIFO that signals the supply end when it is full to not write any more to it? Why this other elaborate stuff, I can see no reason why it should be needed. :)

All that "elaborate stuff" is the method to properly signal that it is full. :)
 
Guden Oden said:
Why not just have a FIFO that signals the supply end when it is full to not write any more to it? Why this other elaborate stuff, I can see no reason why it should be needed. :)
Well, the only issue with doing this is: when do you tell the threads to start up again? In what order?
 
Back
Top