The Experts Speak.. "Automatic Shader Optomizer"

Simon F said:
Oh I can't take it anymore. I confess. We're putting a cray YMP on a chip which will emulate the x86 in microcode which in turn runs the refrast.
That is a wiley idea. You should give a large bonus to the layout engineer that can fit that in your silicon budget. ;)
 
RussSchultz said:
Simon F said:
Oh I can't take it anymore. I confess. We're putting a cray YMP on a chip which will emulate the x86 in microcode which in turn runs the refrast.
That is a wiley idea. You should give a large bonus to the layout engineer that can fit that in your silicon budget. ;)
:LOL: :LOL: :LOL:

Now you've went and done it!
 
Simon F said:
arjan de lumens said:
Umm, no - GPUs are very different from CPUs in this respect. In a CPU, a cache miss usually means that you stall the instruction stream and thus the entire processor - in a GPU, a texture cache miss only means that you switch processing to another pixel (modern GPUs are easily capable of holding several tens or hundreds of pixels in flight and swap processing freely between them to maximize efficiency). If several pixels suffer texture cache misses, the memory controllers in modern GPUs are easily capable of pipelining the memory requests for the texture cache misses, usually to such an extent that you sustain ~90-95% of either theoretical texel fillrate or effective memory bandwidth, whichever limitation you hit first.
You're telling me how GPUs work? To think, I must have been asleep here for the past 10 years...
OK - then please explain: on synthetic tests (like this one), modern GPUs are capable of reaching >98% of theoretical maximum texel fillrate, even with multiple textures that are much too large to fit into the texture cache. How is this at all remotely possible if the GPU stalls on texture cache misses??
 
arjan de lumens said:
OK - then please explain: on synthetic tests (like this one), modern GPUs are capable of reaching >98% of theoretical maximum texel fillrate, even with multiple textures that are much too large to fit into the texture cache. How is this at all remotely possible if the GPU stalls on texture cache misses??
Latency.
 
OpenGL guy said:
arjan de lumens said:
OK - then please explain: on synthetic tests (like this one), modern GPUs are capable of reaching >98% of theoretical maximum texel fillrate, even with multiple textures that are much too large to fit into the texture cache. How is this at all remotely possible if the GPU stalls on texture cache misses??
Latency.
Umm, could you explain that a little further? Latency in what component?
 
arjan de lumens said:
OpenGL guy said:
arjan de lumens said:
OK - then please explain: on synthetic tests (like this one), modern GPUs are capable of reaching >98% of theoretical maximum texel fillrate, even with multiple textures that are much too large to fit into the texture cache. How is this at all remotely possible if the GPU stalls on texture cache misses??
Latency.
Umm, could you explain that a little further? Latency in what component?
Texture requests don't happen instantaneously. Chips are designed so that the data is available when it's needed. In other words, there is latency between when the data is fetched and when it is actually used. The trick is to hide as much latency as possible so that you stall as little as possible.
 
OK... seems to me that one way to do it would be to insert in the texture cache a number of pipeline steps so that it takes a little more than 1 memory latency for data to trickle through them, record and gather cache miss addresses at the beginning of the pipeline, fetch the texture data from memory while the other data trickle down the pipe and then get data safely into the cache just before the other data reach the end of the pipeline, so that we at the end get a cache hit (almost) all of the time? Am I making sense?
 
arjan de lumens said:
OK... seems to me that one way to do it would be to insert in the texture cache a number of pipeline steps so that it takes a little more than 1 memory latency for data to trickle through them, record and gather cache miss addresses at the beginning of the pipeline, fetch the texture data from memory while the other data trickle down the pipe and then get data safely into the cache just before the other data reach the end of the pipeline, so that we at the end get a cache hit (almost) all of the time? Am I making sense?
So what's your shader doing while you're waiting on the texture data? See? This is all very complicated :)
 
And don't forget:

- memory latency is highly variable
- it could be across AGP
- this may happen multiple times during the processing of a single pixel
 
RussSchultz said:
Simon F said:
Oh I can't take it anymore. I confess. We're putting a cray YMP on a chip which will emulate the x86 in microcode which in turn runs the refrast.
That is a wiley idea. You should give a large bonus to the layout engineer that can fit that in your silicon budget. ;)
Strange.. he said it'd be easy just as soon as those 1 metre diameter wafers come on line...
 
Arjan,
Just saw this quote from John Carmack:
Light and view vectors normalized with math, rather than a cube map. On
future hardware this will likely be a performance improvement due to the
decrease in bandwidth, but current hardware has the computation and bandwidth
balanced such that it is pretty much a wash.
 
Simon F said:
RussSchultz said:
Simon F said:
Oh I can't take it anymore. I confess. We're putting a cray YMP on a chip which will emulate the x86 in microcode which in turn runs the refrast.
That is a wiley idea. You should give a large bonus to the layout engineer that can fit that in your silicon budget. ;)
Strange.. he said it'd be easy just as soon as those 1 metre diameter wafers come on line...

Cool. Does it get to be fluorinert cooled like YMP ? :)

Cheers
Gubbi
 
Simon F said:
Strange.. he said it'd be easy just as soon as those 1 metre diameter wafers come on line...

Bah...you people need to think outside the box once in a while.

Three 300mm wafers + a little duct tape = problem solved.
 
ROFL. You guys really know how to make a serious thread degenerate into pure insanity ;)

Anyway...

Simon F said:
Enbar said:
However if nvidia can do a cos as fast as they can do a mad I'll eat my socks. It just wouldn't make any sense to use all the necessary transistors for that.
Don't see why not. Cosine (restricted over a sensible domain) is not terribly difficult.

I just thought I did a comment about this:
While that's true for the NV30, it's not for the NV35: The NV35 is 8 MAD/... +4 COS/SIN/... ( I'm not sure if it's AND or OR, actually - I don't think this was ever tested here )


Uttar
 
OpenGL guy said:
arjan de lumens said:
OK... seems to me that one way to do it would be to insert in the texture cache a number of pipeline steps so that it takes a little more than 1 memory latency for data to trickle through them, record and gather cache miss addresses at the beginning of the pipeline, fetch the texture data from memory while the other data trickle down the pipe and then get data safely into the cache just before the other data reach the end of the pipeline, so that we at the end get a cache hit (almost) all of the time? Am I making sense?
So what's your shader doing while you're waiting on the texture data? See? This is all very complicated :)
Basically, my shader tries to texture other pixels, possibly causing additonal texture cache misses, which I then serve in parallel with the first cache miss (4-way crossbar = 4 misses served in parallel * effects of pipelining in the DRAMs themselves). As for when pixels leave the texture cache and enter the rest of the shader, they should probably do so once all necessary texture data are fetched (possibly with checking whenever you receive texture data from memory ...)
 
Simon F said:
Arjan,
Just saw this quote from John Carmack:
Light and view vectors normalized with math, rather than a cube map. On
future hardware this will likely be a performance improvement due to the
decrease in bandwidth, but current hardware has the computation and bandwidth
balanced such that it is pretty much a wash.

Came to think of another reason why using textures as LUTs may be slow in GPUs: If you have, say, 4 pipelines working on 2x2 pixel patches and do e.g. bilinear texturing with mipmapping, all the texel accesses of the pipelines in a given clock cycle will hit within a very small block of texels within the same mipmap - around 4x4 texels or so. This is a very common scenario, and as such I expect texture caches to be optimized for it. If you then instead do LUT lookups in the 4 pipelines, you will be hitting 4 unrelated locations in the texture map, any 4x4-texel optimizations will fail, and you will spend something like 4 cycles just reading out the texels you need from the cache.

If this is the case, then the performance hit of LUT lookups relative to standard texturing will be roughly fixed at 4x, almost unaffected by LUT size, texture cache miss rate, and memory latency.
 
It is bettter if there is coherence in the data fetches - random fetches cause a lot of cache misses. Fortunately, even LUT data is often reasonably spatially coherent, and small LUT's (particularly 1D ones) help a lot.

I'd say it's a general optimisation (all architectures) to aim to reduce the number of very scattered LUT accesses. If it can't be reduced - well, just do it, the performance hit might be small anyway.
 
Back
Top