A deep look into NAO32 by Christer Ericson

wowfactor

Banned
I strolled upon DeanoC's blog and got his link to Christer Ericson's blog where this guy talked about NAO32. I thought it's a good idea for all of you developers to have a look at it if you have not.

http://realtimecollisiondetection.net/blog/?p=15

There are a few places on the net that I visit pretty regularly. One such place is the forums at Beyond3D where several cool developers hang around and post good stuff. Some of my favourite Beyond3D contributors are DeanA (Dean Ashton, a colleague over at SCEE ATG, with a never-updated blog), and DeanoC (Dean Calver) and nAo (Marco Salvi) who happen to be lead and graphics programmers, respectively, on Heavenly Sword over at Ninja Theory. All fellow PS3 programmers in other words.

Of note in recent times is Dean C talking about the Atomic Cache facility of the SPEs both on his blog and on the forums. Though here I'm going to talk about something both Dean C and Marco have talked a lot about in the past, namely how they use LogLuv encoding as part of their HDR solution. (Lots of funny speculation ensued on the forums as they didn't quite give enough info for people to connect the dots, even to the extent where websites felt they needed to conduct interviews on the topic.)

As most (good) developers do, when respected people talk about a piece of tech of theirs, which you are not currently employing, you investigate. So, back then, I decided to look into (amongst other things) what it would take to encode RGB into LogLuv in a pixel shader. Here's what I found.

The canonical reference for LogLuv is Greg Ward's paper The LogLuv Encoding for Full Gamut, High Dynamic Range Images. His paper talks about both 24-bit and 32-bit LogLuv encodings, but here we're only interested in the 32-bit one, which uses 16 bits for luminance information and 16 bits for chrominance. Figure 1 of Ward's paper gives the representation as:

bit 31: flag negative luminances (1 bit)
bit 30..16: log encoding of luminance (15 bits)
bit 15..8: u coordinate (8 bits)
bit 7..0: v coordinate (8 bits)

The pertinent information on converting from [R,G,B] to [Le,Ue,Ve] (i.e. LogLuv) is spread over multiple sections of Ward's paper, so to save you some work, I'll summarize. The conversion is done as follows:

[X,Y,Z] = [R,G,B]*M
x = X/(X+Y+Z)
y = Y/(X+Y+Z)
u'=4*x/(-2*x+12*y+3)
v'=9*y/(-2*x+12*y+3)
Ue = floor(410*u')
Ve = floor(410*v')
Le = floor(256*(log2(Y)+64))

where M is the 3×3 matrix [0.497,0.339,0.164; 0.256,0.678,0.066; 0.023,0.113,0.864].

To explain the "magic" constants in Ward's math, we note that we support Y in the range (5.4*10-20, 1.8*1019), because log2() of these values give the range (-64.0,64.0), which for Ward's Le calculation brings Le into the desired integer range [0, 215-1] (a 15-bit integer).

Ward states the gamut of perceivable u and v values lies in the range [0, 0.62] and he therefore scales the u and v values by 410 to result in an integer [0, 255]. For a fragment shader we need Le, Ue, and Ve to lie in the [0, 1] range, as the hardware will automatically turn floats in that range into a [0, 255] integer (clamped). However, we will in the end be splitting Le over two such integers, so we'll turn Le into a float of range [0,256). Making the appropriate changes turns the math into:

[X,Y,Z] = [R,G,B]*M
x = X/(X+Y+Z)
y = Y/(X+Y+Z)
u'= 4*x/(-2*x+12*y+3)
v'= 9*y/(-2*x+12*y+3)
Ue = (1/0.62)*u'
Ve = (1/0.62)*v'
Le = 2*(log2(Y)+64)

There are quite a few optimizations we can do at this point. In an attempt at being educational, I'll apply them one by one. First, substitute the expressions for x and y in the expressions for u' and v' and simplify, to obtain this calculation:

[X,Y,Z] = [R,G,B]*M
u' = 4*X/(X+15*Y+3*Z)
v' = 9*Y/(X+15*Y+3*Z)
Ue = (1/0.62)*u'
Ve = (1/0.62)*v'
Le = 2*(log2(Y)+64)

Next we fold the computations for U', v', Ue, and Ve:

[X,Y,Z] = [R,G,B]*M
Ue = (4/0.62)*X/(X+15*Y+3*Z)
Ve = (9/0.62)*Y/(X+15*Y+3*Z)
Le = 2*(log2(Y)+64)

Here we note that it is possible to fold the dot product dot([1,15,3], [X,Y,Z]) into the vector-matrix multiplication so that it ends up in the Z component of the result (which I'll call XYZ). The new math is then

[X,Y,XYZ] = [R,G,B]*M'
Ue = (4/0.62)*X/XYZ
Ve = (9/0.62)*Y/XYZ
Le = 2*(log2(Y)+64)

where M' = M * [1,0,1; 0,1,15; 0,0,3]. We can now also fold the (4/0.62) and (9/0.62) constants into the matrix multiply:

[X',Y,XYZ'] = [R,G,B]*M'
Ue = X'/XYZ'
Ve = Y /XYZ'
Le = 2*(log2(Y)+64)

The new matrix is M' = M * [1,0,1; 0,1,15; 0,0,3] * [4/9,0,0; 0,1,0; 0,0,0.62/9]. At this point, there's hardly any math left and no(?) optimizations left to apply, so now it's time to code. However, turning this into production code we have two potential problem sources:

1. Division by zero.
2. log2() arguments less-than or equal to zero.

To avoid visible glitches both issues must be handled, which we can do by strategically adding in some small epsilons to force values to be strictly positive where it matters. When all that is done, we get the following code (Cg code, of course) as a result:

const static float3x3 m = float3x3(
0.2209, 0.3390, 0.4184,
0.1138, 0.6780, 0.7319,
0.0102, 0.1130, 0.2969);

inline float4 PS3_LogLuv_Encode(in float3 rgb) {
float4 res; // float4(Ue, Ve, LeHigh, LeLow)
float3 Xp_Y_XYZp = mul(rgb,m);
Xp_Y_XYZp = max(Xp_Y_XYZp, float3(1e-6, 1e-6, 1e-6));
res.xy = Xp_Y_XYZp.xy / Xp_Y_XYZp.z;
float Le = 2 * log2(Xp_Y_XYZp.y) + 128;
res.z = Le / 256;
res.w = frac(Le);
return res;
}

Running this code through NVShaderPerf gives (from memory) 5 cycles for 9 instructions. When inserted at the end of a longer shader where there is plenty of room for instruction pairing, the total overhead for the LogLuv conversion will be less than this, perhaps around 3 cycles. I haven't checked with Marco to see how this compares to what he's doing, but it matches the cycle numbers he mentioned in various posts so it'll be pretty close.

As Marco discusses on e.g. Dean's blog you might want to adjust this representation a little to avoid getting carry problems during interpolation, which I haven't done here but left as an exercise to the reader. Another exercise is to do the conversion from LogLuv back to RGB. Enjoy!
 
An excellent and comprehensive analysis.

I don't think I have ever seen the technique well explained all in one place. I know it's kind of scattered all over the place on B3D in many different posts. So it's nice someone put it all together and even provide code giving an approximation of the technique for everyone to enjoy.
 
Back
Top