how to implement the soft pre-cache and post -t&l cache

vrmm

Newcomer
Dear all,
I am doing some work for a soft opengl es. But the performance is not good。:) so i want to do some optimizations for the soft opengles. first, I want to do vertex cache including the pre and post -T&L vertex cache. But I don't learn more about it. I know Nivida has the "NvTriStrip" for the cache. But how does it use? Could i add it to my pre and post -cache? I am very eager to know it.

I think I don't have a deep knowledge about the pre and post - T&L cahce now. Someone can give me some suggestions or good informations about this vertex cache techonlogy ?
Thank you very much!
 
There was a big discussion on implementing a vertex cache in a software renderer on these very boards perhaps 2 years back.
 
There was a big discussion on implementing a vertex cache in a software renderer on these very boards perhaps 2 years back.

where could I find these discussion? Thank you very much!
 
It's nice to see people remember my discussions. :D

Having a pre-T&L cache is not useful for software. In a GPU, it's used to bring unprocessed vertices into the core early. A CPU already has a cache that works automatically for all data. The closest equivalent is prefetching, but you need special instructions for that. What is the target platform?

A post-T&L cache is very useful. In my implementation, it stores completely processed vertices, in a format with all components. Just 16 or 32 of these vertices is enough. It works very simple: when a vertex is requested, the index is looked up in the cache. When it's found, that completely processed vertex is returned. Else, it is send through to the T&L pipeline and the cache is updated. Many update mechnanisms can be used, but I use a very simple circular buffer.

Let me know if you need more details!
 
Thanks a lot!

The closest equivalent is prefetching, but you need special instructions for that. What is the target platform?

If I want to implement the soft in the WinCE , What are the special instructions for the prefetching? I do not know it clearly. what's more, what's the efficency for prefetching in soft ? I think I will not use it if it does not work well.

When it's found, that completely processed vertex is returned. Else, it is send through to the T&L pipeline and the cache is updated.

Is it Send to the T&L pipeline if the requested vertex is not found in the cache for the post-T&L.Is it true? I think the requested vertex maybe be sent to the next stage- rasterization.


Another question is the efficiency for the post-T&L cache. Have you compared the performance with and without the cache ? how many times it will promote the performance with the cache?

Thanks again!
 
vrmm said:
If I want to implement the soft in the WinCE , What are the special instructions for the prefetching? I do not know it clearly. what's more, what's the efficency for prefetching in soft ? I think I will not use it if it does not work well.
Windows CE is not a processor. And that's what matters here ;)
Some architectures (like K6-2/3, Athlon, Pentium 3) have these prefetch instructions. PREFETCH/PREFETCHW are a part of 3DNow!, PREFETCH/PREFETCHNT{A|0|1|2} are a part of SSE, and also available on the Athlon. It depends on your compiler whether you need to get your hands dirty with (inline) assembly or intrinsics to use them. On some compilers you can't use them at all.

Now, if you're targetting embedded processors, you probably won't have these instructions available to you. It might be that some of your specific target processors have something similar. I don't know.

Another way to get something similar to software prefetch is to just load a value from the cache line you want to prefetch into a register. Don't use it, just load it to a variable, and the processor will load the entire cache line. This is problematic, because it can stall the processor until the load has finished. If your target is strictly an in-order processor, this is entirely counter-productive. Even if it does some form of OOOE, this is very hard to pull off without intricate knowledge about pipeline lengths, scheduler depths, cache line sizes and expected memory latency and bandwidth. I'd say that's hopeless. Embedded devices have lots of variation in these parameters. "Emulated" prefetching just isn't worth it for this kind of moving target.

My advice would be to have a look at the docs for the target processors you're aware of and try to find out if they support some prefetch instructions, and if they do, how you'd go about using them -- without breaking compatibility with other processors!

If you don't find anything suitable, just leave it alone.
 
vrmm said:
Is it Send to the T&L pipeline if the requested vertex is not found in the cache for the post-T&L.Is it true? I think the requested vertex maybe be sent to the next stage- rasterization.
If the requested vertex is not found in the cache, the unprocessed vertex is read from the vertex buffer, send through the T&L pipeline, and then stored in the cache. When all three vertices of a triangle are in the cache, they go to rasterization.

What I do in my implementation is not sending them directly to rasterization. The problem with that is that you'd constantly be executing different code (vertex/pixel). So to avoid that I first process the vertices for several triangles. I call this a batch. It holds the vertices of about 16 triangles, thus 48 vertices. Once the batch is filled, all 16 triangles are rasterized. This approach might not be worth it if your processor isn't instruction cache sensitive.
Another question is the efficiency for the post-T&L cache. Have you compared the performance with and without the cache ? how many times it will promote the performance with the cache?
In theory a rectangular patch reuses the same vertex six times. In practice I see four times higher vertex processing performance is quite common. But it doesn't help total performance all that much. Vertex processing is not the biggest bottleneck for software rendering. The number of pixels is much bigger than the number of vertices. So I advise you not to put more time into this than necessary. Implement it, test it, and then spend your time on more important bottlenecks.
 
vrmm said:
Another way to get something similar to software prefetch is to just load a value from the cache line you want to prefetch into a register. Don't use it, just load it to a variable, and the processor will load the entire cache line. This is problematic, because it can stall the processor until the load has finished. If your target is strictly an in-order processor, this is entirely counter-productive. Even if it does some form of OOOE, this is very hard to pull off without intricate knowledge about pipeline lengths, scheduler depths, cache line sizes and expected memory latency and bandwidth. I'd say that's hopeless. Embedded devices have lots of variation in these parameters. "Emulated" prefetching just isn't worth it for this kind of moving target.

If you load a value in a register and ignore the processor will still have to wait for the load to finish to retire it and this will cause bubbles in your pipeline neverless and OOO won't help you much. The good thing about prefetch instructions is that they can be retired even if the load hasn't completed. If his target are embedded processors which are usually in-order processors with short pipelines (and a small instruction window) it wouldn't be much useful to simulate prefetch this way though it could bring marginal gains on processors with a large instruction window (a la P4). If regular prefetch instructions are not avaible I'd suggest to group loads which are likely to cause cache misses together to maximize pipelining on the external bus.
 
vrmm said:
The closest equivalent is prefetching, but you need special instructions for that. What is the target platform?

If I want to implement the soft in the WinCE , What are the special instructions for the prefetching? I do not know it clearly. what's more, what's the efficency for prefetching in soft ? I think I will not use it if it does not work well.
You may have to me more specific about the target platform - Windows CE exists for at least 5 different processor families - ARM, MIPS, PowerPC, SuperH and x86 - while prefetch instructions are defined for all these families, not all processors support them, and the behaviour and performance boost on processors that do support them may vary greatly.

On processors that don't support multiple outstanding transfers (like e.g. AMD K6 and ARM9E) the benefit of prefetching is often not worth the effort, as the prefetching blocks the memory bus for actual cache misses. OTOH, if the processor does support multiple outstanding transfers, the performance boost from carefully done prefetching may well be quite large, like 20 to 100%.
 
Vertex processing is not the biggest bottleneck for software rendering. The number of pixels is much bigger than the number of vertices.

So, the biggest bottleneck for software is about the pixel shader. Could you tell me some techniques for it to promote the performances?

Thanks all!
 
vrmm said:
So, the biggest bottleneck for software is about the pixel shader. Could you tell me some techniques for it to promote the performances?
That depends a lot on how far you want to go... The two most succesful technologies I'm using in swShader both require assembly programming. But the same rules apply, you want to do as much as possible in parallel, and you want to avoid state checks...

Windows CE runs on many different processors, so using assembly code is out of the question (unless you really want to write a back-end for every processor). So I assume you're using a C++ compiler? One way to construct many different pixel pipelines is to use preprocessor directives to conditionally include certain parts of the pipeline (how many textures to sample, which blending operations, etc). Make it compile every possible variant, and use a hashing system to quickly find the correct function that performs the requested rendering operations. The problem with this is that there can be a very big number of variants, and this results in a big executable and long compilation times.

Another approach is to use a deferred rendering technique in tiles. Break the polygons into tiles and process them separately. For every tile, first sample all the texutures you need, and store that into buffers the size of a tile. Do all blending operations for a whole tile, so you avoid checking the render states for every pixel. Given the restrictions of your platform, I think this might be the most succesful method.

Anyway, don't look for optimizations too soon, or you'll trip over your own feet. I know it's tempting to try to do things 'the right way' from the start, but as soon as you change something this becomes a mess. So implement a reliable reference first. Always make sure you have something to go back to that works.
 
Thank you very much! Your suggestions are very good. :D



Another approach is to use a deferred rendering technique in tiles. Break the polygons into tiles and process them separately. For every tile, first sample all the texutures you need, and store that into buffers the size of a tile. Do all blending operations for a whole tile, so you avoid checking the render states for every pixel. Given the restrictions of your platform, I think this might be the most succesful method.

I also think the deferred rendering is a good method for my platform from your descriptions. But I do not know it clearly. Could you tell me in detail or tell me some references about this method? Thanks a lot!



The two most succesful technologies I'm using in swShader

I have been visited your web site. These technologies are very succesful . I think they also can be used in my soft.
 
Back
Top