I have been thinking about the triangle setup stage of the pipeline for a while now and I am curious why it has not been sped further than 1 tri/clk. Even gf100's 4tri's/clk seems rather low when ALU's, bandwidth etc have increased by 2 orders of magnitude over the last decade.
A little bit of googling led me to this
http://www.extremetech.com/article2/0,2845,1155159,00.asp
It seems to suggest that triangle setup is just the calculation of slopes. That means 2 subtractions, and one division. Big deal. All of it costs 3 flops so far. Let's multiply that by 10. So 30 flops for one edge. 90 flops for 1 triangle. Let's make it 100 flops per triangle.
Cypress can do 1600 FMA per clock. So, even with this loose estimate, it should be able to do 16 tri/s per clock.
With this kind of disparity between hardware and doing it in shaders, I wonder why it has not been made into a kernel. I am sure the disparity (math wise) is there in GF100 too. Then why have 4 hw setup units?
What am I missing?
A little bit of googling led me to this
http://www.extremetech.com/article2/0,2845,1155159,00.asp
It seems to suggest that triangle setup is just the calculation of slopes. That means 2 subtractions, and one division. Big deal. All of it costs 3 flops so far. Let's multiply that by 10. So 30 flops for one edge. 90 flops for 1 triangle. Let's make it 100 flops per triangle.
Cypress can do 1600 FMA per clock. So, even with this loose estimate, it should be able to do 16 tri/s per clock.
With this kind of disparity between hardware and doing it in shaders, I wonder why it has not been made into a kernel. I am sure the disparity (math wise) is there in GF100 too. Then why have 4 hw setup units?
What am I missing?