spin GPU Capabilities (API-spec) & Efficiency

sebbbi · Jun 9, 2013

liolio said:
Doesn't Direct x 10 compliance require the GPU to process integers at I don't remember which precision (which xenos doesn't)?

Yes. DX10 requires 32 bit integer processing, and HLSL has robust integer instruction set. However integer processing is not that useful for pure DX10 hardware, because DX10 didn't support compute shaders. In compute shaders, you need integers for address calculation (complex data structures, array indexing, thread block local memory addressing, etc, etc).

32 bit floating point (supported by Xenos as well), has 24 bit mantissa. You can do integer calculations in these 24 bits (with bit perfect results), if you are careful and know exactly what you are doing. Most common integer operations can be emulated by floating point operations. For example if you want to shift up a value and add it to a bit mask, you can do that with a single floating point multiply-add instruction (multiply by power of two). Some operations (for example shift down) need extra floor instructions added after the operation, to guarantee the precision. In general the 24 bit integers are more than enough for pure graphics rendering (in pixel and vertex shaders). For complex compute shader code, you would need full integer support, but that's not a big deal, since DX10 doesn't support compute shaders.

The addition of full integer support in DX10 was a good stepping stone towards GPU compute (in DX11), but it didn't help DX10 games that much. I have written a real time DXT compression algorithm purely using floating points for Xbox 360. The same floating point algorithm actually runs exactly as fast on the DX10 PC hardware than the comparable algorithm written in integer instructions (both are BW bound). With floating point code you can abuse the full speed multiply-adds to do two things at once. In comparison 32 bit integer multiply (or multiply add) is quite slow on most GPUs (1:6 rate on Kepler). So if 24 bits is enough (and your algorithm does not need overflow/underflow support), floating point unit is often good enough for simple integer processing, and should perform similarly compared to real integer processing.

(maybe this discussion could be moved to a separate thread... it's starting to be OT here)

ltcommander.data · Jun 10, 2013

sebbbi said:
Yes. DX10 requires 32 bit integer processing, and HLSL has robust integer instruction set. However integer processing is not that useful for pure DX10 hardware, because DX10 didn't support compute shaders. In compute shaders, you need integers for address calculation (complex data structures, array indexing, thread block local memory addressing, etc, etc).

Wii U having basic compute shader support is often pointed out as an advantage over the 360/PS3 and a possible saving grace for the weaker CPU. Is CS4.1 flexible enough to be useful or is it too gimped? Off the top of my head, I can't think of a shipping game on PC using CS4.x which isn't encouraging. With low level access, there's presumably more functionality and optimization exposed in the Wii U compared to the CS4.1 spec, but I'm guessing it won't significantly close the gap with CS5.0.

MJP · Jun 10, 2013

I don't know what functionality the WiiU supports, but the 4.x compute shaders in D3D11 are pretty gimped. You can't write to textures, you only get one output, you're limited to what regions of shared memory you can write to, and there's no atomics.

steviep · Jun 10, 2013

function said:
Didn't think it was controversial.

http://forum.beyond3d.com/showpost.php?p=1679112&postcount=17

Not the first time he's said this, and not the only developer that's said the like.

Hey may say that, but Microsoft says otherwise:

Now, the Wii U obviously doesn't operate on Direct X, but its openGL equivalent is higher than DX 10.1.

function · Jun 11, 2013

steviep said:
Hey may say that, but Microsoft says otherwise:

<pic snip>

Now, the Wii U obviously doesn't operate on Direct X, but its openGL equivalent is higher than DX 10.1.

Microsoft certainly don't say otherwise. Even the thing you quote say "a DX9/10 rendering core", and it's by no means trying to be a comprehensive description of XGPU features.

If you really want to know how awesome XGPU is, and what kind of additional features beyond DX 10.1 XGPU provides, you should check out this thread:

http://forum.beyond3d.com/showthread.php?t=63732

It was spun off from this thread. Ignore the part where I got annoyed at Willard. Point is that you can't assume a DX9 vs DX10.1 style advantage for the Wii U, because it probably doesn't have one.

steviep · Jun 11, 2013

function said:
Microsoft certainly don't say otherwise. Even the thing you quote say "a DX9/10 rendering core", and it's by no means trying to be a comprehensive description of XGPU features.

They absolutely do. That's straight out of the developing docs. Read the first line on the first pic, too. Anyway, this is getting off topic and is better served in the other thread.

Kaotik · Jun 11, 2013

For all I've read about Xenos, it's shading capabilities in general lie somewhere between D3D9 and D3D10, but it features some things not even the DX10.1 chips can do on "other areas"

function · Jun 11, 2013

steviep said:
They absolutely do. That's straight out of the developing docs. Read the first line on the first pic, too. Anyway, this is getting off topic and is better served in the other thread.

No, the developer docs do not say that XGPU offers no functionality beyond the scope of DX10/10.1. It is a fact that it does.

This is getting tiresome. It is sad that Wii U tech threads have gone from hunting for "mega power" that doesn't exist to trying to play down the Xbox 360's GPU features, all in the hope that it might make the Wii U look less weak.

sebbbi · Jun 12, 2013

MJP said:
I don't know what functionality the WiiU supports, but the 4.x compute shaders in D3D11 are pretty gimped. You can't write to textures, you only get one output, you're limited to what regions of shared memory you can write to, and there's no atomics.

Yes, CS 4.X is just a convenience API. It offers no extra functionality or efficiency over DX10.1 pixel shaders.

Because threads can only write to their own regions in groupshared memory(*), and there are no atomics, there's simple no way to do any cooperative work among multiple threads. This is the main purpose of compute shaders. CS 4.X is useless.

Full list of CS 4.X limitations can be found here:
http://msdn.microsoft.com/en-us/library/windows/desktop/ff476331(v=vs.85).aspx

(*) CS 4.X implementation doesn't require any real GPU groupshared memory since it limits the access to 256 byte region per thread. That is equal to 16 GPU (vec4) registers. CS 4.0 compute shaders must be able to be compiled as DirectX 10.1 compatible pixel shaders, as there is no guarantee that all existing DirectX 10.1 hardware has any compute extensions beyond that (harware existed before DX11 compute shader API).

As MJP said above, we have no (public) information about WiiU GPU. It might have extensions over DirectX 10.1 feature set, and those extensions might extend it's compute capability. I just wish Nintendo had as open approach as Microsoft does regarding to console technology details. Microsoft has basically revealed all the low level x360 hardware / API details in their Gamefest whitepapers and presentations. XNA community website also has GPU microcode programming guides, etc low level optimization details for indie developers / hobbyists (XBLIG).

If anyone wants to dig deeper in AMD Radeon 3000/4000 series compute capabilities, there might be additional details in AMD CAL documents: http://developer.amd.com/wordpress/media/2012/10/AMD_CAL_Programming_Guide_v2.0.pdf

Gipsel · Jun 12, 2013

sebbbi said:
Yes, CS 4.X is just a convenience API. It offers no extra functionality or efficiency over DX10.1 pixel shaders.

Because threads can only write to their own regions in groupshared memory(*), and there are no atomics, there's simple no way to do any cooperative work among multiple threads. This is the main purpose of compute shaders. CS 4.X is useless.

But you can read the data of other threads (which is otherwise impossible). There are definitely algorithms where this is enough for collaboration within a threadgroup.

sebbbi said:
(*) CS 4.X implementation doesn't require any real GPU groupshared memory since it limits the access to 256 byte region per thread. That is equal to 16 GPU (vec4) registers.

But you usually can't access the registers of another thread (and the hardware hardly provides any means to do so in DX10 generation GPUs, it slightly changed in the DX11 generation but that is generally not available through OpenCL or DX CS). You can only mimic that with global memory access (which would be dead slow in comparison without a general cache architecture like GCN, Fermi or Kepler [the latter two basically allocate the shared memory in the L1 cache, saving some address calculations]). OpenCL on the HD4000 series actually does this (as OpenCL requires a less limited access scheme than available in R700 hardware). It's not what you want to do.

sebbbi said:
CS 4.0 compute shaders must be able to be compiled as DirectX 10.1 compatible pixel shaders, as there is no guarantee that all existing DirectX 10.1 hardware has any compute extensions beyond that (harware existed before DX11 compute shader API).

The cs_4_x profiles are optional. DX10.x hardware is not required to support them (the profiles are defined in a way that it matches with the common subset of AMD's and nV's DX10.0/10.1 hardware capabilities, but other vendors not so much). CS4.x compute shaders definitely do offer additional features over PS4.0/4.1 and it's of course not required that they can be compiled as 4.1 pixel shaders (which would be wierd anyway as the thread creation/enumeration works differently between pixel and compute shaders). Where did you get this from?

sebbbi · Jun 12, 2013

Gipsel said:
CS4.x compute shaders definitely do offer additional features over PS4.0/4.1 and it's of course not required that they can be compiled as 4.1 pixel shaders (which would be wierd anyway as the thread creation/enumeration works differently between pixel and compute shaders). Where did you get this from?

Thanks for the correction.

I did some experiments a few years ago with CS 4.1. Basically everything I tried to do that exceeded the pixel shader model gave me a compilation error. The biggest limitation is that you cannot write values to programmable addresses in shared memory (index by dynamic variable). Not even inside your own 256 byte region. You need to use SV_DispatchThreadID/SV_GroupIndex as the index. That is a huge limitation, and most of the algorithms I tried to write didn't compile because of that. I didn't spent that much time trying to reformulate the algorithms (as I was writing a radix sorter, and it is quite awkward to implement without indexed writes

). I moved quickly to CS 5.0, as I had a brand new 5850 Radeon. Didn't look back after that, because CS 5.0 did everything I asked (all my test programs compiled without much worries).

With that new information, I must agree that the limited sharing (only read access to other thread's data) gives you some benefits over pixel shaders (even without shared memory indexing and atomics). However the cases are considerably more limited compared to CS 5.0, and it is more difficult to formulate the algorithms around the limitations. And of course the DX10.1 era hardware without general purpose read&write caches would provide another big roadblock to get the compute shaders running at any decent performance (and that's a bad combination with the limited data movement/indexing in shared memory, as you'd had to aim for perfect coalescing without caches).

Gipsel said:
The cs_4_x profiles are optional.

Didn't know that. Do both 3000 and 4000 series Radeons support CS4.1, or just the 4000 series? I assume both 8800 and GTX 280 support it (because they had a superset of functionality in CUDA).

Gipsel · Jun 12, 2013

sebbbi said:
Didn't know that. Do both 3000 and 4000 series Radeons support CS4.1, or just the 4000 series?

Just the HD4000 series (the HD3000 lacks the shared memory).

sebbbi said:
I assume both 8800 and GTX 280 support it (because they had a superset of functionality in CUDA).

AFAIK the GT8800 and GTX280 support cs_4_0 but not 4_1 (latter reqires SM4.1 capabilities). The better shared memory implementation of nV's GPUs in the DX10 generations don't help them here (as MS set the profiles to the lowest common denominator between AMD and nV).

spin GPU Capabilities (API-spec) & Efficiency

sebbbi

ltcommander.data

MJP

steviep

function

None functional

steviep

Kaotik

Drunk Member

function

None functional

sebbbi

Gipsel

sebbbi

Gipsel

Similar threads

*spin* GPU Capabilities (API-spec) & Efficiency

None functional

Drunk Member

None functional

Similar threads

spin GPU Capabilities (API-spec) & Efficiency