Nvidia GT300 core: Speculation

Status
Not open for further replies.
You would think your theory is correct if they had a solution that is more capable then CUDA/and g80 and up tech, they would right? Or you just think its just AMD playing dumb so they can blindside nV in the near future?
No, I'm not arguing AMD's entire solution is better/more-advanced than CUDA - the software side is useless for most people - it's very much alpha/beta quality, about 2 years behind NVidia (been saying this for a long time now). Maybe I need to put it in my sig?

Jawed
 
No, I'm not arguing AMD's entire solution is better/more-advanced than CUDA - the software side is useless for most people - it's very much alpha/beta quality, about 2 years behind NVidia (been saying this for a long time now). Maybe I need to put it in my sig?

Jawed


Personally I think its more of a software solution for AMD, but also from a business stand point its possible they dropped the ball because of the buy out and the also because of lack of funds. I have to say the software side is very useful for most people, building software neccessary to run on AMD's products to get performance that nV has right now would probably cost two or more times. So it really comes down to an immediate monitary solution. Added to this what would the performance difference be if the software was taliored for maximum performance on AMD hardware, would it be cost effective?
 
Can you explain at least what the starting point is? I can't see it :oops:
Jawed
The first thing you want to do is to pair (where possible) scalar and vector instructions. You can also amortize the cost of scalar instructions that cannot be paired doing more work per loop (process 32, 48, 64, etc..pixels at once).
In the end it might be actually better to have a loop that is slightly slower in theory, but faster in practice (given how spatially incoherent fractals are)

Marco
 
The first thing you want to do is to pair (where possible) scalar and vector instructions.
That's the problem, I can't find any scalar instructions in this loop that could be parallelised.

You can also amortize the cost of scalar instructions that cannot be paired doing more work per loop (process 32, 48, 64, etc..pixels at once).
I decided against this - to keep Larrabee's advantage of 16-way branch divergence penalty versus the 64/128-way penalty that ATI pays.

Though I did in effect amortise in the DP case, by keeping the Larrabee code at 16 points instead of 8. Thinking about it, I now realise that the mask registers are effectively only 8-wide in this, so I didn't count properly (needs a pair of mask registers). For the sake of divergence penalty I guess 8-wide would be better.

Jawed
 
That's the problem, I can't find any scalar instructions in this loop that could be parallelised.
Than look harder ;) hint: as long as you don't modify the result at iteration N you can overlap the code that handles the loop counter with code that runs at iteration N+1.

I decided against this - to keep Larrabee's advantage of 16-way branch divergence penalty versus the 64/128-way penalty that ATI pays.
128? I thought latest ATI hardware always works on 64 pixel batches.
 
Than look harder ;) hint: as long as you don't modify the result at iteration N you can overlap the code that handles the loop counter with code that runs at iteration N+1.
The ATI code does this (the compiler did that, not me!) - but the problem I see is that moving the masks to the scalar pipe and then back again takes too long :???:

Hmm, now I'm wondering if the scalar pipe can manipulate mask registers in place. I thought there's specific instructions to move masks twixt scalar and VPU... Yeah, MASK2INT and INT2MASK.

128? I thought latest ATI hardware always works on 64 pixel batches.
My theory is that since batches run paired in ATI, divergence affecting one batch affects the other in the pair.

I'm assuming that a pair of batches is defined for their lifetime - due to banking/operand-fetch bandwidth in the register file. i.e. this is a compromise to keep the cost of the register file and operand fetching to a minimum.

Wish there were some decent test results out there.

Jawed
 
The ATI code does this (the compiler did that, not me!) - but the problem I see is that moving the masks to the scalar pipe and then back again takes too long :???:
The loop counter is stored on a scalar reg, not on a mask reg. Also LRBni instructions that operate on mask regs are pairable with vector instructions too (read Forsyth's presentation).

Hmm, now I'm wondering if the scalar pipe can manipulate mask registers in place. I thought there's specific instructions to move masks twixt scalar and VPU... Yeah, MASK2INT and INT2MASK.
See Abrash's article on Dr.Dobb's journal for more details about mask registers:
Now that we've seen how predication works, let's look at how vector masks get set. They are primarily either generated by vector compares or copied from general-purpose registers (general-purpose registers are the familiar x86 scalar registers -- rax, ecx, and so on), although they can also come from add-and-generate-carry and subtract-and-generate-borrow instructions, or from a couple of special add-and-set-vector-mask-to-sign instructions designed for rasterization. Vector mask registers can also be operated on by a set of vector mask instructions.
 
Ah, I see now thanks. KORTEST is a scalar instruction in effect - I was reading it as a purely vector instruction (which is VKORTEST) - and so the "move" from VPU to scalar is implicit, no latency.

Then, as you say, KORTEST can be moved up and those first 4 scalar instructions can be executed in parallel with the math :cool:

So that makes 10 cycles.

Jawed
 
No :) I was talking about overlapping the counter-scalar code with vector code.
kortest works over the predication masks, not the loop counter.
 
No :) I was talking about overlapping the counter-scalar code with vector code.
kortest works over the predication masks, not the loop counter.
Isn't KORTEST k2 k2 deciding if the entire set of 16 strands has gone out of bounds, and therefore ends the loop? So when they all go out of bounds the jnz fails. KORTEST is effectively a dummy to set the scalar predicate for jnz to act upon. That's my interpretation, anyway...

Jawed
 
Isn't KORTEST k2 k2 deciding if the entire set of 16 strands has gone out of bounds, and therefore ends the loop? So when they all go out of bounds the jnz fails. KORTEST is effectively a dummy to set the scalar predicate for jnz to act upon. That's my interpretation, anyway...

Jawed
Yes, but (at least at first sight) you can't overlap that code without partially unrolling the loop due to a dependency between the mask register and the vector instructions. While the global loop counter math can be easily overlapped because there are no dependencies.

That loop can end because all strands found points over the complex plane that belong to the Mandelbrot's set or because you reached the max number of iterations.

Marco
 
As long as you test for both loop exit conditions before incrementing there's no harm done. The result is very similar in structure to the ATI code then:

Code:
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
    02 ALU_BREAK: ADDR(50) CNT(5) KCACHE0(CB0:0-15) 
         10  y: SETGT_INT   T0.y,  KC0[1].x,  R0.x      
             w: ADD         ____,  R5.x,  R4.x      VEC_021 
         11  z: SETGT_DX10  ____,  R1.x,  PV10.w      
         12  x: AND_INT     R8.x,  PV11.z,  T0.y      
         13  x: PREDNE_INT  ____,  R8.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    03 ALU: ADDR(55) CNT(7) 
         14  x: ADD_INT     R0.x,  R0.x,  1      
             z: MUL_e*2     ____,  R6.x,  R7.x      VEC_120 
         15  x: ADD         R6.x,  R3.x,  PV14.z      
             y: ADD         ____, -R5.x,  R4.x      VEC_120 
         16  x: ADD         R7.x,  R2.x,  PV15.y      
             t: MUL_e       R5.x,  PV15.x,  PV15.x      
         17  x: MUL_e       R4.x,  PV16.x,  PV16.x      
04 ENDLOOP i0 PASS_JUMP_ADDR(2)

It always loops at CF-04 (control flow instruction 04) and can only exit the loop after 13 (ALU instruction 13) which causes CF-01 to jump to 05, beyond the end of the loop, as soon as that predicate says stop.

Jawed
 
As long as you test for both loop exit conditions before incrementing there's no harm done. The result is very similar in structure to the ATI code then:

Code:
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
    02 ALU_BREAK: ADDR(50) CNT(5) KCACHE0(CB0:0-15) 
         10  y: SETGT_INT   T0.y,  KC0[1].x,  R0.x      
             w: ADD         ____,  R5.x,  R4.x      VEC_021 
         11  z: SETGT_DX10  ____,  R1.x,  PV10.w      
         12  x: AND_INT     R8.x,  PV11.z,  T0.y      
         13  x: PREDNE_INT  ____,  R8.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    03 ALU: ADDR(55) CNT(7) 
         14  x: ADD_INT     R0.x,  R0.x,  1      
             z: MUL_e*2     ____,  R6.x,  R7.x      VEC_120 
         15  x: ADD         R6.x,  R3.x,  PV14.z      
             y: ADD         ____, -R5.x,  R4.x      VEC_120 
         16  x: ADD         R7.x,  R2.x,  PV15.y      
             t: MUL_e       R5.x,  PV15.x,  PV15.x      
         17  x: MUL_e       R4.x,  PV16.x,  PV16.x      
04 ENDLOOP i0 PASS_JUMP_ADDR(2)

It always loops at CF-04 (control flow instruction 04) and can only exit the loop after 13 (ALU instruction 13) which causes CF-01 to jump to 05, beyond the end of the loop, as soon as that predicate says stop.

Jawed

This particular code doesn't map very well to AMD hardware, while you should be able to get close to 100% efficiency on LRB.
You could also compile your little shader with nvshadeperf, I am curious to know how many cycles it would take on G80.
 
This particular code doesn't map very well to AMD hardware, while you should be able to get close to 100% efficiency on LRB.
The low utilisation on ATI is precisely the reason I chose this - and the dynamic branching adds extra value because of the divergence penalty which is worst on ATI ;)

But the DP rather improves the picture I reckon...

You could also compile your little shader with nvshadeperf, I am curious to know how many cycles it would take on G80.
Last time I tried NVShaderPerf there was no support for G80 and later, it stops at G71 :cry:

I think you need to have the CUDA SDK installed, which requires Geforce drivers and then you have to run the unofficial disassembler on the compiled binary and well, I'm stuck.

The CUDA SDK actually includes a mandelbrot sample - it's my understanding that it runs SP until it needs the precision, then runs DP.

I think a comparison of SP and DP would be particularly interesting as this is a great example of something that's near useless on SP. OK, so it's pretty much useless anyway, but ...

Jawed
 
Last time I tried NVShaderPerf there was no support for G80 and later, it stops at G71 :cry:

I think you need to have the CUDA SDK installed, which requires Geforce drivers and then you have to run the unofficial disassembler on the compiled binary and well, I'm stuck.
The tatest nvshaderperf definitely supports G80 (I played with it a few weeks ago) and you don't need to install CUDA or any other NVIDIA driver, afaik.
Give it a try or post your hlsl code..
 
Wait a minute! Isn't that what's called Global Data Share in RV770s documents?
LDS is for intra-work-group (wavefront) sharing, supporting a programming model similar to what NVidia calls shared memory and what OpenCL calls local memory.

GDS is, erm, some data that can be shared by anything. I still don't understand how it's accessed and fenced - i.e. how threads synchronise on update of data in GDS.

---

It's worth pointing out that LDS has a different programming model than shared memory. It's really best thought of as a communication channel rather than a blob of memory. As a channel it can hold data for an arbitrary length of time, but the intention is that the same data is not repeatedly fetched from it. In fact an operand in an ordinary instruction cannot specify an LDS address.

Instead LDS data has to be copied into a work-item's context for further use (i.e. copied into a register). This copy mechanism (in both directions) effectively defines when fences for work-group-wide synchronisation should be actuated.

As far as I can tell this is more similar to the D3D11-CS approach than the OpenCL approach.

Other interpretations sought :smile:

Jawed
 
Give it a try or post your hlsl code..
Brook+ is the source I'm playing with here:

Code:
kernel void
mandelbrot(float scale, int maxIterations, float size, out float mandelbrotStream<>)
{
    float2 vPos = (float2)instance().xy;
    float2 pointt = vPos;
    float x, y, x2, y2;
    int iteration ;
    float scaleSquare;
    pointt.x = (pointt.x - size/2.0f) * scale;
    pointt.y = (pointt.y - size/2.0f) * scale;
 
    x = pointt.x;
    y = pointt.y;
    x2 = x*x;
    y2 = y*y;
    scaleSquare = scale * scale * size * size;
    for(iteration = 0.0f; (x2+y2 < scaleSquare) && (iteration < maxIterations) ; iteration += 1)
    {
        y = 2.0f*(x*y) + pointt.y;
        x = (x2 - y2) + pointt.x;
        x2 = x*x;
        y2 = y*y;
    }
    mandelbrotStream = iteration/maxIterations;
}
That vPos makes me think someone took some HLSL and just tweaked for Brook+ :!:

I should point out I changed the iteration parameter and variable to ints for the sake of "purity" and to see if that caused hiccups in compilation.

If you (or anyone) can do a double version too, that'd be cool.

To make the double version I had to change the loop condition to this:

Code:
(float)(x2+y2 - scaleSquare)<0 && iteration < maxIterations

because Brook+ doesn't support conditionals on doubles! I had so much grief discovering this, because the compiler just totally falls over with fatal exceptions that are total gibberish.

I dare say what would be most interesting is to compare actual performance to see the effect of divergence penalty.

Jawed
 
The CUDA SDK actually includes a mandelbrot sample - it's my understanding that it runs SP until it needs the precision, then runs DP.

I believe the standard behaviour is to run completely SP or DP, depending on what the user selected.
But there is a flag to toggle DP for a certain zoom level.
On my G80 the performance completely goes to crap with DP anyway (goes from about 70 fps to about 7 fps :)), because DP is emulated. But zooming is smooth as long as I keep it set to SP :)
 
So how does all this latest mumbo jumbo relate to Nvidia GT300 core Speculation?
 
Status
Not open for further replies.
Back
Top