HLSL Array reference issue

vindos

Newcomer
Helloo..

I gotta shader code which tries to find an element from an array
for (i = 0; i < 13 ; i++)
{
mini = i;
for (j = i+1; j < 25; j++)
{
if (b[j] < b[mini])
{
mini = j;
}
}
minor = min( b, b[mini] );
major = max( b, b[mini]);
b = minor;
b[mini] = major;
}
It ended up giving a compilation error
error X3500: Array reference cannot be used as an l-value,not natively addressable
The statement b[mini] = major is where compiler has problem with.
My doubt is if b = minor is a valid operation then why not the other???
Am confused.......:oops:
Am using PS 3.0
Help meee :?:
 
I couldn't find any direct proof of this, but I think you might find that you can only access the "array" using the iterated variable (i.e. i or j).


[Speculation mode]
The reason behind this is that perhaps some hardware systems do not have true array indexing or looping and so the compiler simply unwind the loops. That's doable for the incrementing index but not for the random "mini" variable.
[/Speculation mode]
 
SM4.0 (Direct3D 10) should definitely allow indexed temporaries though.
Really? Wow, I never knew that. I was trying to see if this is was a feature of SM4.0, but everywhere I looked it seemed like you could only index the constant register file with a non-loop value. I could never find a clear and 100% complete explanation of indexing restrictions, though.

Architecturally, the ability to index temporaries is really nasty. The hardware can no longer organize and access the register file in large blocks (e.g. 256-bit for scalar 8-SIMD) unless you want dramatic slowdown with indexing. Even worse is the fact that you can't tell how much register space each pixel needs at compile time!
 
I could never find a clear and 100% complete explanation of indexing restrictions, though.
It certainly doesn't discuss it explicitly anywhere that I know of! There are also definitely restrictions about non-constant indexing of output arrays (render targets), but indexed temporaries do work. I can guarantee it in GLSL since that's what I've been working on today (although it turns out that NVIDIA has a optimizer bug related to indexed temporaries and control flow :S).

Certainly quite useful for stuff like a "real stack" for data structure traversal.

Architecturally, the ability to index temporaries is really nasty. The hardware can no longer organize and access the register file in large blocks (e.g. 256-bit for scalar 8-SIMD) unless you want dramatic slowdown with indexing. Even worse is the fact that you can't tell how much register space each pixel needs at compile time!
Yeah I'm not totally sure how they do it architecturally... as forthcoming as AMD/NVIDIA have been being about their architectures themselves, they are still very secretive about how they *use* those architecture features to implement graphics API functionality. I've heard anecdotal evidence that local arrays are stored in the Parallel Data Cache on G80, but I have no idea whether they simply store all register data there, and how they manage that across different threads. I believe there is some guarantee that a single kernel "warp" will always be run on the same "streaming microprocessor" - even if it gets context switched due to stalls, etc - which means that local data can stick around (space permitting).

In any case, that's an aspect of the hardware/software that I'd really be interested to know more about. The equivalent of "how to implement D3D10/GL in CUDA" :)
 
It certainly doesn't discuss it explicitly anywhere that I know of!
This is all I could find, and it's PS3.0 stuff:
http://msdn2.microsoft.com/en-us/library/bb172921.aspx
http://msdn2.microsoft.com/en-us/library/bb172920.aspx

Too bad there's no PS4 asm and equivalent info for it.
Certainly quite useful for stuff like a "real stack" for data structure traversal.
Definately useful for a stack, though sometimes heirarchical structures can be traversed without one (see the NVidia dynamic ambient occlusion example).

I've heard anecdotal evidence that local arrays are stored in the Parallel Data Cache on G80, but I have no idea whether they simply store all register data there, and how they manage that across different threads.
I'm pretty sure G80 has a register file on top of the PDC, but I have a feeling you're right about arrays being done there.

Indexable temporaries is a feature that's probably responsible for a lot of the die space increase of DX10 vs. DX9. Most of the other things can probably be done with minor tweaks of DX9 hardware, but this requires a big change in the structure and data flow of the shader units.
 
Too bad there's no PS4 asm and equivalent info for it.
Yeah documentation seems a bit lacking so far. The other thing I can't seem to find is a spec about the integer overflow behaviour of ALUs in DX10. Seems like NVIDIA uses wraparound (which is convenient for me) but AMD does something different, at least if I'm correct in my assumption that that is what's buggering up SAVSM on R600 (don't have a card myself to test on).

I found a post by Humus that seems to confirm indexable temporaries here.

I'm pretty sure G80 has a register file on top of the PDC, but I have a feeling you're right about arrays being done there.
Yeah I definitely think there's a register file since I've heard from several NVIDIA people that the PDC is "basically as fast as registers". It would be *really* nice to know how some of this is implemented though as it has non-trivial affects on how we want to do things - for example - when compiling arbitrary code @ RapidMind.

Indexable temporaries is a feature that's probably responsible for a lot of the die space increase of DX10 vs. DX9. Most of the other things can probably be done with minor tweaks of DX9 hardware, but this requires a big change in the structure and data flow of the shader units.
Fair enough. They're pretty useful in practice, although NVIDIA seems to have pulled another "256-loop limit" with local arrays in that as soon as you go over about 512 float4's (admittedly quite a lot, and more than just chopping up the PDC), things just silently get the wrong answers...

Clearly not a feature that they've advertised a lot, and not one which is probably being used much yet (judging from the significant compiler bugs, etc. that I've found so far even with trivial code).
 
Even worse is the fact that you can't tell how much register space each pixel needs at compile time!

Well, you declare the size of the array, so that gives you the number of registers needed, or are you thinking of something else entirely?
 
Too bad there's no PS4 asm and equivalent info for it.

Well, here's an example shader just for a little bit of insight:
Code:
HLSL:
 
float main(int coord: TEXCOORD) : SV_Target {
    float array[2];
 
    array[coord] = 1;
    array[coord + 1] = 0;
 
    return array[coord & 1]; 
}
 
Asm:
 
ps_4_0
dcl_input constant v0.x
dcl_output o0.x
dcl_temps 1
dcl_indexableTemp x0[2], 4
mov r0.x, v0.x
mov x0[r0.x + 0].x, l(1.000000)
mov x0[r0.x + 1].x, l(0)
and r0.x, v0.x, l(1)
mov o0.x, x0[r0.x + 0].x
ret
 
Well, you declare the size of the array, so that gives you the number of registers needed, or are you thinking of something else entirely?
Yeah, that was a bit of a brainfart on my part. I was thinking in terms of ps3.0 asm for some reason. The "dcl_indexableTemp x0[2], 4
" line in your next post explains a lot.

What kinds of limits are there on indexable arrays? How many? How big? Do they impact performance in the same way as regular registers?
 
What kinds of limits are there on indexable arrays? How many? How big? Do they impact performance in the same way as regular registers?
I'd definitely like to know this sort of information as well. Particularly I would have expected them to impact performance the same way as registers, but that doesn't seem to be the case at least on G80. For instance implementing a reduction using a large local array (Cell-like) is a *bit* slower on G80 than just doing it the normal way (reducing as you read), but not by much...
 
DX10 gives you 4096 vec4 temps, and both normal temps (r#) and temp arrays (x#[]) count against that limit.

I don't have experience with how temp arrays perform, but in this post I speculated Nvidia used thread-local off-chip memory; CUDA apparently spills registers into this region.

I don't think the PDC works .. each thread gets up to 64kB of temps, which is too big for PDC. Maybe they use it as a cache though, before going out to memory.
 
Back
Top