Unified Shader Architecture: Point sampling in addition to Bilinear Texturing

Jawed

Legend
In Xenos there's point-sampling functionality (primarily aimed at vertex fetching) in addition to the normal bilinear (or better) texture fetching/filtering.

Since it's a unified architecture, these point-sampling units are available to any shader concurrently with the bilinear texturing units (erm, I presume they are!). What I'm wondering is, what's the impact going to be on the performance of pixel shading?

It's my understanding that there's generally a fair amount of point-sampled texturing used in pixel shaders, to perform "look-ups". At the same time (not being a dev) I don't know the degree to which point sampling is typically used.

So, will the ability of a pixel shader to perform bilinear (or better) texturing concurrently with point sampled texturing make a signicant performance difference :?: If there are 16 TMUs and 16 TPSs available (texture point samplers, for want of a better abbreviation - perhaps I should just stick to VTF, vertex texture fetch; even if it's a bit confusing because of the context), will this make a significant difference to the performance of pixel shading?

One caveat, as I understand it from Xenos, is that the point-sampling units are really meant for 1D access. I don't know why this constraint exists, or whether it's enforced in any way - it might simply be about performance (e.g. half speed address calculation for 2D textures). I don't know whether it's reasonable to expect this constraint to carry over to future USAs, such as R600.

Jawed
 
Jawed said:
It's my understanding that there's generally a fair amount of point-sampled texturing used in pixel shaders, to perform "look-ups". At the same time (not being a dev) I don't know the degree to which point sampling is typically used.
I doubt it's what I would call "a fair amount". Linear interpolation is often better than none at all. The most important exceptions being
- lookup textures that contain indices
- shadow mapping on hardware without PCF
- custom filter kernels (although some can take advantage of bilinear filtering)

So, will the ability of a pixel shader to perform bilinear (or better) texturing concurrently with point sampled texturing make a signicant performance difference :?: If there are 16 TMUs and 16 TPSs available (texture point samplers, for want of a better abbreviation - perhaps I should just stick to VTF, vertex texture fetch; even if it's a bit confusing because of the context), will this make a significant difference to the performance of pixel shading?
Likely not for the majority of shaders. Probably only in the shadow mapping with custom filter case. Which can be quite important of course.

One caveat, as I understand it from Xenos, is that the point-sampling units are really meant for 1D access. I don't know why this constraint exists, or whether it's enforced in any way - it might simply be about performance (e.g. half speed address calculation for 2D textures). I don't know whether it's reasonable to expect this constraint to carry over to future USAs, such as R600.
If that's true, they are really simple memory load units (with format conversion, and possibly wrap). Far cheaper than a full TMU, but also far less useful.
 
Shadow mapping is definately something I was thinking about when reading about this feature in the B3D Xenos article. Imagine 8-16 jittered samples of a high precision format, especially if selectively sampled near edges via shadowmap post processing and dynamic branching during scene rendering. You could get very nice shadows, and your filtered texture units would be free for other uses.

However, I heard that the point samplers have a different cache structure. So Jawed, you may be right about the performance implications.

While on this topic, does anyone know Xenos' filtering abilities? FP10? FP16? I16?

EDIT: God, I'm sorry Jawed. I knew it was you, just wrote it wrong.
 
Last edited by a moderator:
Hey! it's Jawed, not Jaws!

Textures are fetched from main memory through a 32-KB, 16-way set-associative texture cache. The texture cache is optimized for 2D and 3D, single-element, high-reuse data types. The purpose of the texture cache is to minimize redundant fetches when bilinear-filtering adjacent samples, not to hold entire textures.

The texture samplers support per-pixel mipmapping with bilinear, trilinear, and anisotropic filtering. Trilinear filtering runs at half the rate of bilinear filtering. Anisotropic filtering is adaptive, so its speed varies based on the level of anisotropy required. Textures that are not powers of two in one or more dimensions are supported with mipmapping and wrapping.

The texture coordinates can be clamped inside the texture polygon when using multisample antialiasing to avoid artifacts that can be caused by sampling at pixel centers. This is known as centroid sampling in Direct3D 9.0 and can be specified per interpolator by the pixel shader writer.

The following texture formats are supported:
  • 8, 8:8, 8:8:8:8
  • 1:5:5:5, 5:6:5, 6:5:5, 4:4:4:4
  • 10:11:11, 11:11:10, 2:10:10:10
  • 16-bit per component fixed point (one-, two-, and four-component)
  • 32-bit per component fixed point (one-, two-, and four-component)
  • 16-bit per component floating point (limited filtering)
  • 32-bit per component floating point (no filtering)
  • DXT1, DXT2, DXT3, DXT4, DXT5
  • 24:8 fixed point (matches z-buffer format)
  • 24:8 floating point (matches z-buffer format)
  • New compressed formats for normal maps, luminance, and so on, as described following
Fetching up to 32-bit deep textures runs at full speed (one bilinear cycle), fetching 64-bit deep textures runs at half-speed, and fetching 128-bit deep textures runs at quarter speed. A special fast mode exists that allows four-component, 32-bit-per-component floating-point textures to be fetched at half-speed rather than quarter-speed. The packed 32-bit formats (10:11:11, 11:11:10, 2:10:10:10) are expanded to 16 bits per component when filtered so they run at half speed. Separate nonfilterable versions of these formats exist that run at full speed.

When filtering 16-bit per component floating-point textures, each 16-bit value is expanded to a 16.16 fixed-point value and filtered, potentially clamping the range of the values. The total size of the expanded values determines at which rate the sampling operates (full speed for one-component sampling, half speed for two-component sampling, and quarter speed for four-component sampling). Separate nonfilterable 16-bit-per-component floating-point formats also exist.

DXT1 compressed textures are expanded to 32 bits per pixel, resulting in a significant improvement in quality over the expansion to 16 bits per pixel that existed on Xbox.

The following new compressed texture formats are available:
  • DXN—a two-component 8-bit-per-pixel format made up of two DXT4/5 alpha blocks
  • DXT3A—a single-component 4-bit-per-pixel format made up of a DXT2/3 alpha block
  • DXT5A—a single-component 4-bit-per-pixel format made up of a DXT4/5 alpha block
  • CTX1—a two-component 4-bit-per-pixel format similar to DXT1 but with 8:8 colors instead of 5:6:5 colors
  • DXT3A_AS_1_1_1_1—a four-component format encoded in a DXT2/3 alpha block where each bit is expanded into a separate channel
The texture formats with eight or fewer bits per component, including the compressed texture formats, can be gamma corrected using an approximation to the sRGB gamma 2.2 curve to convert from gamma space to linear light space. This correction happens for free and is applied to sampled data prior to any texture filtering.

A special type of texture fetch can index into a texture “stackâ€￾ that is up to 64 textures deep. A texture stack is a set of 2D textures (potentially with mipmaps) that are stored contiguously in memory.
Jawed
 
Dude, Xenos supports 32-bit fixed point filtering? Judging how FP16 is converted to 16.16 and then filtered, it certainly seems like it. That's awesome.

From the sounds of it, they have special 8-bit filtering units that are capable of chaining together. The filtering weights are probably only 8-bits at most, so TMU's will need to multiply an 8-bit number by a 32-bit number, which I think can be done with four 8-bit MADD operations. Adding the four weighted samples together can also be pipelined this way if your adders have the necessary carry ports.

Why didn't they put this in the PC cards? Man, I would love to get my hands on a XB360 for coding. To bad the dev kits cost an arm and a leg.

BTW, apologies again for the name mixup.
 
I'm not too familiar with DX10 (I think JHoxley is the resident expert on that), but I'm hoping a good chunk of these features will be at least in ATI's DX10 parts. They will be leveraging Xenos tech, after all.

I'm planning on getting a X1600 Pro to tide me over until then. Won't be much of a performance boost, if at all, over my 9800 Pro, but in terms of coding there's a lot of stuff I can do in the meantime.
 
Some more stuff, which seems to be along precisely the lines you were talking about:

b3d49.jpg




b3d50.jpg


Here the same sampler is used by both the getWeights2D instruction and the tfetch2D instruction.
Note that the sampler is overridden to use bilinear filtering for getWeights and point filtering for tfetch.
This would be useful for working with depth map sampling, which cannot be filtered in hardware.


b3d51.jpg



So, the benefit of using the Xbox 360 instructions is twofold: you get a few ALU ops for free, and you don’t have to pass the (inverse) texture size down as a shader constant.
This can be especially useful if you’re sampling from multiple textures, each with their own sizes.

getWeights and tFetch on 360 have a lot more features that cannot easily be emulated on other platforms like the PC. For example, getWeights also returns the mip factor, which is difficult to obtain on the PC. And you can’t do texture sampler overrides on the PC right now.

Jawed
 
Jawed said:
Since it's a unified architecture, these point-sampling units are available to any shader concurrently with the bilinear texturing units (erm, I presume they are!). What I'm wondering is, what's the impact going to be on the performance of pixel shading?
I haven't read up much on Xenos. But from the text above, I don't see anything saying that bilinear and point sampling units are available in paralell. I'd say the most likely way is that it's the same unit, but that it runs faster with point sampling. (Because it removes some arithmetic and internal bandwidth bottlenecks.)
 
I think this is fairly explicit:

The GPU can perform the following operations per clock cycle:
  • Write eight pixels (up to 32 samples with MSAA)
  • Write 16 depth-stencil-only pixels (up to 64 samples with MSAA)
  • Reject up to 64 pixels that fail the hierarchical depth-stencil test
  • Fetch 16 32-bit words of vertex data from up to two vertex streams
  • Execute 16 bilinear filtered texture fetches
  • Execute 48 vector and scalar ALU operations
  • Interpolate 16 four-component pixel shader input vectors
  • Process one triangle
  • Process one vertex
  • Resolve eight pixels from EDRAM to main memory
All of the operations happen in parallel except for the resolve, which cannot happen in parallel with writing pixels to the EDRAM frame buffer. Pixels that are 64 bits deep can only be written and resolved at half rate (4 pixels/cycle).
I've emboldened the relevant bits.

Jawed
 
Jawed said:
Some more stuff, which seems to be along precisely the lines you were talking about:
Actually I was just talking about how the hardware implementation of 32-bit fixed point filtering can be very cheap, provided you only need either a quarter the channels or a quarter the speed of normal filtering. Why didn't they do this for the PC parts? :cry:

For shadow mapping I was talking about the point samplers, but those slides are referring to the filtered samplers. Weights are nice for fast bilinear PCF, but that's a rather crappy way of doing shadow mapping given recent developments. Does Xenos support Fetch4 as well?

Getting access to the mip-level is neat, and quite useful. The fractional part there is probably the 3rd weight for blending between mipmaps that's used in trilinear filtering. I don't know how to calculate this in a pixel shader. Anyone know how?

In the last slide, it's rather odd they chose to divide by invTexSize rather than just multiply by TexSize. Maybe to save constants? Bah, I'm nitpicking.
 
Last edited by a moderator:
It seemed to me that the second slide uses point sampling on the tfetch2D instruction :???: If it does, I don't know which texture unit (TMU or VTF) would execute it, though...

As for Fetch4:

Percentage-closest filtering of shadow maps can be done in a pixel shader program reasonably efficiently. The shader instruction set has extensions to sample at an offset from a texture coordinate and to get the bilinear filter coefficients of a texture fetch. Using these extensions, it is possible to implement bilinear percentage-closest filtering of shadow maps using seven shader instructions.
which seems to imply no, it's completely manual.

Jawed
 
Jawed said:
I think this is fairly explicit:


I've emboldened the relevant bits.

Jawed
Interesting. So it looks like instead of hardwiring the way an index buffer is used to retrieve vertex data, Xenos exposes it in the shader. Cool.

Chances are it's just as you speculated: A 1D array lookup. I still think it could be used for fetching shadow map samples if you were clever, leaving the filtered samplers free for everything else. Not sure how cache friendly it would be, but it seems reasonable at first glance.

As for Fetch4:

which seems to imply no, it's completely manual.
Oh well. Like I said before, PCF doesn't look that good anyway. I don't know of any other major uses.
 
Ar you sure that they shouldn't be matched as:
The GPU can perform the following operations per clock cycle:
  • Write eight pixels (up to 32 samples with MSAA)
  • Write 16 depth-stencil-only pixels (up to 64 samples with MSAA)
  • Reject up to 64 pixels that fail the hierarchical depth-stencil test
  • Fetch 16 32-bit words of vertex data from up to two vertex streams
  • Execute 16 bilinear filtered texture fetches
  • Execute 48 vector and scalar ALU operations
  • Interpolate 16 four-component pixel shader input vectors
  • Process one triangle
  • Process one vertex
  • Resolve eight pixels from EDRAM to main memory
All of the operations happen in parallel except for the resolve, which cannot happen in parallel with writing pixels to the EDRAM frame buffer. Pixels that are 64 bits deep can only be written and resolved at half rate (4 pixels/cycle).
Ie: that those bolded lines are kind of the equivalent for VS and PS.
And that both happen outside the unified part of the shader, aren't controlled from inside the shader, and the result of them just appears in the input registers of the respective shader. Thus, vertex textures and non filtered textures run through the 16 bilinear TMUs, but with filtering turned off.

Another argument is that the lines explicitly talks about "vertex data" and "pixel shader", which doesn't fit in the unified shader architecture. So it's likely it describes something outside it.
 
One more question I have about Xenos is regarding multisampling. Can you control the sampling positions? And can you get access to the unresolved MSAA buffer? If you could revert to a square grid, you'd get pseudo-high-res rendering for free. Great for shadow maps.

I'm guessing no because there would have to be some synchronization with the eDRAM for it to do the Z interpolation. Not a deal breaker, but the eDRAM logic is pretty basic.
 
Note the separate caches and that the texture pipes (TP) only process normal textures:

012l.jpg


I'm thinking I shouldn't be calling it vertex texture fetch - argh, that implies something else. I should have just called it vertex fetch. Sorry Basic, I think that's the source of the confusion.

Jawed
 
Last edited by a moderator:
Basic said:
Ar you sure that they shouldn't be matched as:
Ie: that those bolded lines are kind of the equivalent for VS and PS.
I don't think so, because the second line you put in bold simply refers to the iterators. They just interpolate between already calculated values.

What do you think of my theory above?



Basic said:
Another argument is that the lines explicitly talks about "vertex data" and "pixel shader", which doesn't fit in the unified shader architecture. So it's likely it describes something outside it.
Well, this is what's in the B3D article:
Dave Baumann said:
There are both 16 texture fetch units (filtered texture units, with LOD) and 16 vertex fetch units (unfiltered / point sample units) giving 16 of each type of texture samplers. Note that as the output data from the texture samplers is supplied to the unified shader arrays both types of texture lookups are available to either Vertex or Pixel Shader programs, if needed, and there are no limitations on the number of dependant texture reads.
 
Yes, as the article states, they were described to me as independant units, both available simultaneously, if needed (its perfectly feasible to doing vertex texture lookups on one shader array while filtered texturing for PS in another). Its described as "Vertex Fetch Units" for want of a better description really; I think this is for any type of fixed point float texture sampling really.

WRT to the point on shadowmap sampling, though, isn't this done is a separate pass? [Edit: scratch that; generation of will be, sampling of won't].
 
  • Like
Reactions: Geo
Back
Top