A bunch of questions.

jolle said:
It does sound slightly troubeling, what with the history of NV3x, devs had to spend alot of extra time to make it work as it should, of totally different reasons ofcource..
but if this ends up being up to the game developers it sounds as if it could potentially end up the same way... or skipped totally perhaps..
If the time spent cant be justifed by the result or something..

Well hopefully Nvidia would have taken measures not to run into the same sort of situations they did with nv3x..


from the looks of it, the nV40 provides a relatively easy platform to develop shaders for, if you use SM3.0. If SM3.0 is fast enough on the NV40 it could definitely simplify things, e.g. http://www.beyond3d.com/forum/viewtopic.php?t=11752 you can see some examples where you can write what you want on SM3.0 but doing it in SM2.0 would require static approximations and significantly more instructions.
 
Some more questions

From one of the threads in this forum discussing about nv40 architecture, I have understood that fragments have to be FIFOed at each pipeline (Or rather quads maybe FIFOed at each pipeline).

A) I'd like to know if all these fragments must be from the same batch of primitives (considering the fact that there are no state changes within a single batch).

B) Can anyone give me an idea of how long (no. of fragments/quads) the FIFOs have to be, and the no. of internal pipeline stages for the ALU/TMU so that the latencies can be absorbed?

Thanks!
 
Re: Some more questions

I can answer B:

krychek said:
B) Can anyone give me an idea of how long (no. of fragments/quads) the FIFOs have to be, and the no. of internal pipeline stages for the ALU/TMU so that the latencies can be absorbed?

You can only give an absolute number for the length of the FIFO if you don't use flow control, as you cannot predict how long it will take for the current fragment to get processed if you do use it. And you need a buffer for the fragments that are "on hold" as well, you just add fresh fragments to the buffer (quad) if there is space.

The same goes for the texture lookup latency: you probably need to store the fragment afer a texture lookup request anyway, as the texel can reside in the L1 cache, L2 cache, local memory or system memory. So there is no way to know how long it will take for the result to arrive.
 
joe emo said:
jolle said:
It does sound slightly troubeling, what with the history of NV3x, devs had to spend alot of extra time to make it work as it should, of totally different reasons ofcource..
but if this ends up being up to the game developers it sounds as if it could potentially end up the same way... or skipped totally perhaps..
If the time spent cant be justifed by the result or something..

Well hopefully Nvidia would have taken measures not to run into the same sort of situations they did with nv3x..


from the looks of it, the nV40 provides a relatively easy platform to develop shaders for, if you use SM3.0. If SM3.0 is fast enough on the NV40 it could definitely simplify things, e.g. http://www.beyond3d.com/forum/viewtopic.php?t=11752 you can see some examples where you can write what you want on SM3.0 but doing it in SM2.0 would require static approximations and significantly more instructions.

My question to that is why would it be more significant instructions (instruction slots) since the shader would have to 'role out' the code anyways, despite how nice and neat the HLSL code is? I haven't looked at the 3.0 spec, so maybe I am missing something, but in general, flow control is just for organization purposes, not necessarily to reduce actual instructions slot usage, which is the real instruction count that matters.
 
consider DemoCoder's example. I hope he isn't offended by this, and I'm not trying to put words in his mouth...

Code:
upper_bound = f(x);

for (i=0;i<255;i++) {
  //do something usefull
  if (i >= upper_bound) {
   break;
  } 
}

upper_bound is dynamic based on the return of f(x). Without SM3.0 you'd have to approximate upper_bound to a static quant. or perhaps have several upper_bound1, upper_bound2, upper_boundn... and then do a series of if (test >= upper_boundn), and write out all the statements; If upper_bound was a large number from one run of f(x) you'd probably run out of slots doing it the hard way on SM2.0 but SM3.0 should handle it. If upper_bound is a lower value on an execute of f(x) then SM3.0 executes fast, exiting the for loop when it should, but you might not know which static upper_boundn to use on SM2.0 and therefore be SOL.
 
Or, in other words (if you don't mind, joe): Megadeath, you are correct. But there might be no way to know when to stop executing instructions without checking a condition and breaking out of the program when it is met.

Without flow control your program is limited to pure math. With it, you can react and change the calculation to a better one if needed. Or stop and execute the next one.
 
Even without flow control, you can achieve this, it just has a potentially large cost. You can fully unroll the loop and use predication or disabled writes to emulate BREAK.
 
Back
Top