The nvidia future architecture thread (G100/GT300 and such)

Jawed · Jul 17, 2008

Mintmaster said:
Only because as far as we know that's the only difference in architectures.

Then we would have seen R600 outdo RV670 in other texturing-heavy tests. This deficit only applies when branching is involved.

But this is texturing with DB so the incoherency/cache-thrashing isn't like other tests.

On this page:

http://www.digit-life.com/articles2/video/rv630-part2-page1.html

HD2600XT and HD2400XT are shown:

Compared with HD2900XT, HD2600XT is:

FLOPs 40%
TEX 54%
BW 33%
Steep Parallax Mapping 36%

Compared with HD2900XT, HD2400XT is:

FLOPs 12%
TEX 24%
BW 12%
Steep Parallax Mapping 16%

I can't determine what clocks they used for HD2600Pro on that page's tests. The article says that SPM has a heavy TEX workload as well as heavy ALU and DB.

It still seems like a split between TEX rate and BW that are determining the performance of this shader.

Yup, and even some of that is due to driver/compiler improvements (RV670 scores are improved a few percent in the 4850 review). Or did you already account for that?

I wasn't using RV670 scores for the figures I posted, I was using R600's, but I didn't properly account for R600's ALU clock, so the percentages I listed are 1% too high. But of course there's a high chance that R600's scores are better with later drivers, but I haven't seen them.

Jawed

ShaidarHaran · Jul 17, 2008

Mintmaster said:
It applies to G7x and NV4x also.

Ah, thanks for the correction

Mintmaster · Jul 18, 2008

Jawed said:
But this is texturing with DB so the incoherency/cache-thrashing isn't like other tests.

I suppose, but we should see at least one other texturing test without DB that does better on R600 than RV670 if it was a cache issue.

BTW, steep parallax mapping doesn't cause thrashing, despite branching. Texture samples are all very close to each other and are only taken from the height map until the final intesection is found. You're just perturbing the location for parallax.

But of course there's a high chance that R600's scores are better with later drivers, but I haven't seen them.

Yup, that's what I meant. RV670 improved between those reviews, so it stands to reason that R600 would too.

Pantagruel's Friend · Jul 18, 2008

Mintmaster said:
It applies to G7x and NV4x also.

Looks like I missed the mark by one generation then - thanks for the correction. :smile:

CarstenS · Jul 23, 2008

Jawed said:
Dunno, the code would help. This is the best I can find:

You mean like this?

Code:

// Steep parallax mapping shader
// Based on shaders:
//   Steep parallax mapping     (c) 2005 Morgan McGuire and Max McGuire (Iron Lore Entertainment)
//   Parallax occlusion mapping (c) 2005 Natalya Tatarchuk (ATI Research, Inc.)
//

struct PARALLAX_DATA
{
	float4 vPosition	: POSITION;
	float2 vTexCoord	: TEXCOORD0;
	float3 vEye			: TEXCOORD1;
	float3 vLight		: TEXCOORD2;
};

PARALLAX_DATA ParallaxVS(
	float4 vPosition	: POSITION,
	float4 vNormal		: NORMAL,
	float2 vTexCoord	: TEXCOORD0,
	float3 vTangent		: TEXCOORD1,
	float3 vBinormal	: TEXCOORD2
)
{
	PARALLAX_DATA o;

	// Tangent Space

	float3 binormal = mul( vBinormal.xyz, mW );
	float3 tangent  = mul( vTangent.xyz,  mW );
	float3 normal	= mul( vNormal.xyz,	mW );

	// Vertex Position -> World Space

	float4 vPositionWorld = mul( vPosition, mW );// * fGeometryScale;

	// Eye Vector

	float3 eye = vCameraPos.xyz - vPositionWorld.xyz;
	o.vEye.x = dot( eye, binormal ); 
	o.vEye.y = dot( eye, tangent ); 
	o.vEye.z = dot( eye, normal ); 

	vPositionWorld *= fGeometryScale;

	// Lighting

	float3 light = (Lights[0].vPosition.xyz - vPositionWorld.xyz)*Lights[0].fAttenuation;
	o.vLight.x = dot( light, binormal );
	o.vLight.y = dot( light, tangent );
	o.vLight.z = dot( light, normal );

	// Texture Coords

	o.vTexCoord  = vTexCoord;

	// Position

	o.vPosition = mul( vPosition, mWVP );

	// Finalize

	return o;
}

const float fThreshold = 4;
const float fTexSize = 512;
const int maxSamples = 50;
const int minSamples = 8;

float4 ParallaxPS ( PARALLAX_DATA In ) : COLOR
{
	float fHeight = 0.0;
	float2 vTexCoord = In.vTexCoord;
	float3 vEye = normalize( In.vEye );

	// Compute current gradients:
	float2 fTexCoordsPerSize = vTexCoord * fTexSize;

	// Compute all 4 derivatives in x and y in one instruction:
	float2 dxSize, dySize;
	float2 dx, dy;

	float4( dxSize, dx ) = ddx( float4( fTexCoordsPerSize, vTexCoord ) );
	float4( dySize, dy ) = ddy( float4( fTexCoordsPerSize, vTexCoord ) );
					
	// Find min of change in u and v across quad: compute du and dv magnitude across quad
	float2 dTexCoords = dxSize * dxSize + dySize * dySize;
	// standard mipmapping uses max here
	float fMinTexCoordDelta = max( dTexCoords.x, dTexCoords.y );
	//compute mip level  (* 0.5 is effectively computing a square root before the )
	float fMipLevel = max( 0.5 * log2( fMinTexCoordDelta ), 0 );
	
	if ( fMipLevel <= fThreshold )
	{
		int nNumSteps = (int) lerp( maxSamples, minSamples, vEye.z );
	
		float fStep = 1.0 / (float)nNumSteps;
		float2 vDelta = float2( In.vEye.x, In.vEye.y ) * fParallaxOffset * fStep / In.vEye.z;

		float fCurHeight = 1.0;
		int nStepIndex = 0;

		while ( nStepIndex < nNumSteps ) 
		{
			vTexCoord += vDelta;
			fHeight = tex2Dgrad( HeightMapSampler, vTexCoord, dx, dy ).x;
			fCurHeight -= fStep;
			if ( fHeight > fCurHeight ) 
				nStepIndex = nNumSteps + 1;
			else
				nStepIndex++;
		}
	}

	// Bump Mapping

	float3 vN = tex2D( NormalMapSampler, vTexCoord ) * 2.0 - 1.0;

	// Lighting
	float3 vLight = normalize( In.vLight );
	float3 vHalfAngle = vLight + vEye;
	float3 vHalf = normalize( vHalfAngle );

	float fNdotL = saturate( dot( vN.xyz, vLight.xyz ) );
	float fNdotH = saturate( dot( vN.xyz, vHalf.xyz ) );

	float fSpec = pow( fNdotH, fMaterialPower );
	float vAttenuation = saturate( 1.0 - dot( In.vLight, In.vLight ) );
	float4 vDiffuse  = Lights[0].vDiffuse * fNdotL * vAttenuation;
	float4 vSpecular = Lights[0].vSpecular * fSpec * vAttenuation;

	float selfShadow = 1.0;

	if ( fMipLevel <= fThreshold && fNdotL > 0 )
	{
		// Trace a shadow ray along the light vector.
		int nNumShadowSteps = (int) lerp( maxSamples, minSamples, vLight.z );
		float fStep = 1.0 / (float)nNumShadowSteps;
		float2 vShadowCoord = vTexCoord;
		float2 vDelta = float2( vLight.x, vLight.y ) * fParallaxOffset * fStep / vLight.z;

		float fCurHeight = fHeight + fStep * 0.1;
		int nStepIndex = 0;

		while ( nStepIndex < nNumShadowSteps ) 
		{
			vShadowCoord -= vDelta;
			fHeight = tex2Dgrad( HeightMapSampler, vShadowCoord, dx, dy ).x;
			fCurHeight += fStep;
			if ( fHeight > fCurHeight || fCurHeight >= 1.0 ) 
				nStepIndex = nNumShadowSteps + 1;
			else
				nStepIndex++;
		}

		// We are in shadow if we left the loop because
		// we hit a point
		selfShadow = fHeight < fCurHeight;
	}

	// Finalize

	return  vDiffuse * selfShadow * tex2D( DiffuseMapSampler, vTexCoord ) +
			vSpecular * tex2D( SpecularMapSampler, vTexCoord );
}

It's included in Rightmark 3D.

Jawed · Jul 24, 2008

CarstenS said:
It's included in Rightmark 3D.

Awesome, thanks! I couldn't find version 2.0 to download anywhere

I've had to wrangle it to get it to compile - missing some definitions. I shall play more tomorrow

Jawed

CarstenS · Jul 24, 2008

Jawed said:
Awesome, thanks! I couldn't find version 2.0 to download anywhere

I've had to wrangle it to get it to compile - missing some definitions. I shall play more tomorrow

Jawed

All versions are right here under our very nose. I myself needed a nudge in the right direction, though.

http://www.ixbt.com/video3/rv770-2-part2.shtml

Just below this line (whatever it means...)
"Синтетические тесты"

http://www.ixbt.com/video/itogi-video/ini/rm1050new.rar
http://www.ixbt.com/video/itogi-video/ini/rmdx10.rar http://www.ixbt.com/video/itogi-video/ini/rmdx101.rar

marllt2 · Oct 18, 2008

Have we any idea what the G100 is ?

I mean something else than "Nvidia G100, TeraFLOPS Visual Computing".

DX11 nex-gen ? Or other codename refering to the current GT200 / GT 206 / GT 216 ?

Rufus · Oct 18, 2008

marllt2 said:
Have we any idea what the G100 is ?

I mean something else than "Nvidia G100, TeraFLOPS Visual Computing".

DX11 nex-gen ? Or other codename refering to the current GT200 / GT 206 / GT 216 ?

Real World Tech's GT200 writeup originally was named a G100 writeup (as mentioned in the b3d thread). If you go to the write up and view the page's source, it has in the header:

<meta name="description" content="Our analysis of NVIDIA's latest GPU, the G100 (also known as the GT200 or GTX280)">

ShaidarHaran · Oct 20, 2008

Rufus said:
Real World Tech's GT200 writeup originally was named a G100 writeup (as mentioned in the b3d thread). If you go to the write up and view the page's source, it has in the header:

RWT doesn't exactly cover GPUs on a regular basis though. I very much trust DK's architectural analysis, but doubt he has any insight as to the codename history for GT200.

CarstenS · Oct 20, 2008

Jawed said:
Hmm, is it possible to eliminate bandwidth though? Bandwidth is the only factor of an equivalent magnitude between R600 and RV670.
Jawed

Sorry for the late reply, but it happens, that I've tested just this portion of Rightmark 3D with RV670 and R600 with quite recent drivers.

There's a gap in performance (PS3.0 - Steep Parallax Mapping Test) which - in 1920x1200 - is with an advantage of 77,599% for R600 almost identical to it's advantage in theoretical bandwidth (77,6% - I've used HD 3870 GDDR4@1125 MHz vs. HD2900 XT/1G @999 MHz).

So, in this case, I tend to think that there's something going on, which really taxes bandwidth a lot on AMDs sixth generation hardware. Maybe for some reason those lookups inside loops cannot be served from cache or whatever.

Jawed · Oct 20, 2008

CarstenS said:
Sorry for the late reply, but it happens, that I've tested just this portion of Rightmark 3D with RV670 and R600 with quite recent drivers.

There's a gap in performance (PS3.0 - Steep Parallax Mapping Test) which - in 1920x1200 - is with an advantage of 77,599% for R600 almost identical to it's advantage in theoretical bandwidth (77,6% - I've used HD 3870 GDDR4@1125 MHz vs. HD2900 XT/1G @999 MHz).

So, in this case, I tend to think that there's something going on, which really taxes bandwidth a lot on AMDs sixth generation hardware. Maybe for some reason those lookups inside loops cannot be served from cache or whatever.

Earlier, apart from theoretical bandwidth, I also hypothesised that the ring-bus could be getting in the way.

Jawed said:
Well, maybe it's a ring-bus bandwidth issue then, since ring bus scales with the size of the memory bus. Remember that TUs in R6xx are shared by all SIMDs, and it seems that texture results are distributed to SIMDs by the ring bus. So if RV670 has "half" the ring-bus bandwidth of R600, then this might be the bandwidth bottleneck, which is a function of the kind of dependent texturing in this test.

This could be getting in the way of copying texels from L2 to L1. Or it could be getting in the way of moving texture results from TUs to ALUs. Or it could be both.

Though having said that, I think the ring bus is clocked at core speed, not memory clock speed - dunno how easy it would be to play with clock speeds (either memory or core) to see how it affected the scaling of this test.

Jawed

TimothyFarrar · Oct 20, 2008

Jawed said:
Overall, I think NVidia's going to stick with its ALU architecture.

EDITED: NVidia could "easily" go with 8-clock instructions instead of 4-clock instructions, to arrive at 64-element batches.

Increasing the ALU:TEX ratio also reduces the per FLOP control overhead, since each cluster appears to have some control logic common to all the SIMDs in each cluster.

Apart from that, I think as far as ALUs are concerned, it's a case of getting them to 2GHz and beyond...

Jawed

The CUDA docs refer to current SIMD granularity at half warp (ie 16-wide), but say that programs should plan for full warp (ie 32-wide) granularity to be portable to future hardware. So perhaps an increase in internal batch size for future hardware (16 to 32 for memory access, and 8 to 16 or to 32 for ALUs) is already on the table.

Also I wonder if there is anything to be learned by Intel going with an NVidia-esk scalar model as well with shader computation on Larrabee?

Jawed · Oct 20, 2008

TimothyFarrar said:
The CUDA docs refer to current SIMD granularity at half warp (ie 16-wide), but say that programs should plan for full warp (ie 32-wide) granularity to be portable to future hardware.

The very latest CUDA docs still say that? Blimey.

With GT200 I thought the only bits that are 16-wide are the memory and register file accesses.

So perhaps an increase in internal batch size for future hardware (16 to 32 for memory access, and 8 to 16 or to 32 for ALUs) is already on the table.

I'm pretty sure this quote originally came from the G80 context (which has half-warps), implying that 32-wide was incoming and that it would stay this way for quite a long time.

Larger than 32-wide could be years away, I reckon.

Also I wonder if there is anything to be learned by Intel going with an NVidia-esk scalar model as well with shader computation on Larrabee?

NVidia's model isn't scalar though - it's issuing to MAD + SF/MUL + DP ALUs all of which have differing widths (8, 2/8, 1) and doing so at various frequencies (+ TMUs as a fourth kind of unit, as far as I can tell).

Larrabee's vector unit seems to be a true scalar, 16-wide, unit. There's no sign of how transcendentals will be computed, I admit.

Intel definitely made a comparison with NVidia's design. But the control logic overheads are wildly different in comparing the two architectures. And that's where I think lies the key to NVidia's very serious disadvantage.

though if you label the scalar part of each of Larrabee's x86s cores as "control logic overhead", maybe that particular comparison isn't so lopsided

Jawed

DegustatoR · Oct 21, 2008

marllt2 said:
Have we any idea what the G100 is ?

G100 is GT200.

Rufus · Oct 21, 2008

Jawed said:
Intel definitely made a comparison with NVidia's design. But the control logic overheads are wildly different in comparing the two architectures. And that's where I think lies the key to NVidia's very serious disadvantage.

though if you label the scalar part of each of Larrabee's x86s cores as "control logic overhead", maybe that particular comparison isn't so lopsided

I don't really think that's a joke, and that's how I've been thinking of them. Nvidia has all the branch divergence, warp tracking, dependency tracking, etc stuff as a big blob of control hardware. Intel has none of that, but they have this simple in-order x86 core that'll probably be doing nothing but control flow stuff. LRB really is taking as much as possible from GPU architecture, including flow control stuff, and moving it into software.

It's an interesting tradeoff. GT200 can have 32 warps (threads) in flight at once, and manages all of the control logic stuff transparently for software. LRB is limited to only 4 threads at once and all the control logic has to be done by hand, but it is obviously far more flexible. I wonder how the areas compare between the two designs.

Jawed · Oct 21, 2008

Rufus said:
I don't really think that's a joke, and that's how I've been thinking of them. Nvidia has all the branch divergence, warp tracking, dependency tracking, etc stuff as a big blob of control hardware. Intel has none of that, but they have this simple in-order x86 core that'll probably be doing nothing but control flow stuff. LRB really is taking as much as possible from GPU architecture, including flow control stuff, and moving it into software.

But GPUs have other dedicated hardware to manage thread creation and load-balancing, interact with the CPU/PCI-Express and other hidden stuff. This is just more grist for the scalar parts of Larrabee. Then there are all the task-parallel thread types in rendering (or the kind of new rendering algorithms that Larrabee laps-up) which are an awkward match for current GPUs.

It's an interesting tradeoff. GT200 can have 32 warps (threads) in flight at once, and manages all of the control logic stuff transparently for software. LRB is limited to only 4 threads at once and all the control logic has to be done by hand, but it is obviously far more flexible. I wonder how the areas compare between the two designs.

Larrabee's manual threading is certainly costly, so the pressure will be on developers to minimise threads (fibres in Intel speak) at all costs, it seems.

Jawed

CarstenS · Oct 21, 2008

Jawed said:
[…]dunno how easy it would be to play with clock speeds (either memory or core) to see how it affected the scaling of this test.

With regular clock speeds (core/mem) that should be no problem - this weekend at the latest, though I'll have to try and get hold of another HD3870 because I only own the R600 myself.

Just played with my GTX280 in this particular test and fillrates are starting do not drop significantly (>3%) until I reach about 550 MHz memclock. So this doesn't seem to be a general problem with this test as RV770 already was able to indicate.

iwod · Oct 23, 2008

I am more interested in the Scalar Vs SuperScaler side. Which is what fundamentally set RV770 and GT200 apart.

Is Nvidia's solution able to sustain into the future.

And Since apple has choose Nvidia as their partner, i really hope they do well.

CarstenS · Oct 25, 2008

Sorry, this is the last OT-Posting from me in here.

I've switched around mem clocks somewhat on both GDDR4-variants of R600 and RV670. Unless the mem clock also dictates the speed of the internal ring bus, then I'd conclude that the vastly different performance of both chips in Rightmark3d's Steep Parallax Mapping test (DX9) is not a function of external memory bandwidth (and neither is fixed in a newer driver up until now)

All of the following results are in Mpix of fillrate at a Full HD res (19x12)

Code:

R600, Cat 8.10 WHQL def.
eng/mem
776/1053	138,4
776/999	138,5
776/945	138,0
776/891	138,1
776/855	137,5
776/801	137,7
776/749	136,8
776/693	136,2		RV670 Cat 8.10 WHQL
776/648	135,3		eng/mem        
776/603	134,6		776/1206	73,9
776/549	133,1		776/1098	73,8
776/504	132,0		776/0999	73,8
776/450	129,9		776/0900	73,5
776/396	127,5		776/0801	72,6
776/351	124,6		776/0702	71,7
776/297	120,2		776/0603	70,2
776/252	114,7		776/0504	68,3
776/225	110,7		NA
776/198	105,7		776/0405	65,2
776/185	102,7		NA
776/153	93,7		776/0297	60,3

Note that I've inserted RV670 into the rows where external memory bandwidth matches that of R600 most closely and I've taken the liberty of clocking both chips engines the same.

The nvidia future architecture thread (G100/GT300 and such)

Jawed

ShaidarHaran

hardware monkey

Mintmaster

Pantagruel's Friend

CarstenS

Moderator

Jawed

CarstenS

Moderator

marllt2

Rufus

ShaidarHaran

hardware monkey

CarstenS

Moderator

Jawed

TimothyFarrar

Jawed

DegustatoR

Rufus

Jawed

CarstenS

Moderator

iwod

CarstenS

Moderator

Similar threads