FULL-HD GPU raytracing demo<dx9>

RacingPHT · May 20, 2011

Hi,

Here's a really fast raytracing demo for a game-like car showcase.
It should be run on any DX10 hardware, although it actually used dx9 api with some hacks and precision assumptions.
You could expect it to run very smoothly at 1920x1080 on a AMD gpu with 400sp and up and NVidia GPUs with 96sp and up.
If you're using a lower end GPU, please try low resolution first as the demo might cause GPU reset on these platforms.

Sorry for the unprofessional file uploading platform and no source code available for now. The code is not general enough to be shared.

Here's the link:
http://www.freefilehosting.net/raytrace-hlsl-merge_3
http://www.mediafire.com/?zbmgw9v84lzsddw

Hope you enjoy it!

Any thoughts:
racingpht@gmail.com

Mintmaster · May 20, 2011

Pretty nice, but it could definitely use vertex normals and interpolation across the face instead of flat shading.

How well does it scale with polygon count? This model seems to use very few polygons.

homerdog · May 20, 2011

"The program can't start because VCOMP90.dll is missing from your computer. Try reinstalling the program to fix this problem."

I think I know how to fix this but is this error message normal?

Edit Well I downloaded the missing dll, placed it in the bin directory and it works now. Getting around or right under 200fps on my GTX260.

Here since I'm a super cool and nice guy I'll attach the file :smile:

RacingPHT · May 21, 2011

Mintmaster said:
Pretty nice, but it could definitely use vertex normals and interpolation across the face instead of flat shading.

How well does it scale with polygon count? This model seems to use very few polygons.

Sorry for the programmer's art

the model was downloaded somewhere on the internet, And i'm not lucky enough to have a artist to fix the normal for me. And the reflected normal looks so cheap when interpolated.

As long as polygon scalability, I think this algorithm is not very scalable as I used uniform grid to avoid stack storage(and thus per-pixel register pressure). However, the base mesh could be in arbitrary density(compared to the simplified reflection mesh)

The idea of this method is to use ray tracing only when absolutely needed, and does so in the local range. Larger scale reflections are faked very well with dynamic cube maps, so why not just use it and combined with true local ray tracing. I believe what people will notice is the rough shape of reflection(thus why some even cheaper method such as height field reflection works), but this shape can't be rendered correctly without ray tracing.

At least for the near future hardware, in the context of real-time game rendering, pure ray tracing engine makes little sense to me.

RacingPHT · May 21, 2011

homerdog said:
"The program can't start because VCOMP90.dll is missing from your computer. Try reinstalling the program to fix this problem."

I think I know how to fix this but is this error message normal?

Edit Well I downloaded the missing dll, placed it in the bin directory and it works now. Getting around or right under 200fps on my GTX260.

Here since I'm a super cool and nice guy I'll attach the file :smile:

Oh it's the OpenMP dll, thanks for fixing it!:smile:
I also updated the link, it should not need the dll anymore.

Davros · May 21, 2011

Can I ask an offbeat question
could a gpu do wavetracing (like aureal vortex 2 soundcards used to do)

RacingPHT · May 22, 2011

Not sure what type of workload wavetracing has. But personally I don't buy the idea of "GPU can do all".

For example, in a game, you want to use GPU to do something other than graphics. You are paying the price of context switching between graphics APIs and compute, and it's idling the GPU which is simply a waste of time. Today's CPU's are quite fast, I wrote the ray tracing algorithm with a little effort to get similar to 96 nvidia sp's performance on a quad core Phenom2, with much better debug environment.

On the other hand, you want to wave trace on the GPU for a desktop program. Now you are facing random device lost problem and you are expecting glitches/interrupts in your sound playback when that happens.

Davros · May 22, 2011

ok thanks for that.....

Mintmaster · May 22, 2011

RacingPHT said:
For example, in a game, you want to use GPU to do something other than graphics. You are paying the price of context switching between graphics APIs and compute, and it's idling the GPU which is simply a waste of time. Today's CPU's are quite fast, I wrote the ray tracing algorithm with a little effort to get similar to 96 nvidia sp's performance on a quad core Phenom2, with much better debug environment.

Wow, that's a lot different from my experience with path tracing. I did a CUDA port of Smallpt after being inspired by Dade's OpenCL implementation. I got over 500x speedup from a Core2 Duo (I multithreaded the smallpt code) to a 8800 GTS (96 sp). I didn't use any vector libraries with the CPU code, so I only got compiler optimizations, but that's still an enormous increase. I do, however, agree with you that it's a good idea to stick with a CPU when possible.

Any plans on making the source code available?

RacingPHT · May 23, 2011

Mintmaster said:
Wow, that's a lot different from my experience with path tracing. I did a CUDA port of Smallpt after being inspired by Dade's OpenCL implementation. I got over 500x speedup from a Core2 Duo (I multithreaded the smallpt code) to a 8800 GTS (96 sp). I didn't use any vector libraries with the CPU code, so I only got compiler optimizations, but that's still an enormous increase. I do, however, agree with you that it's a good idea to stick with a CPU when possible.

Any plans on making the source code available?

Well, I'm a bit lying.
The performance number's real, but when I reveiw the code, I realized that the CPU porting I did half year ago did put quite a bit efforts into it.

let me show you a fragment of the CPU code:

Code:

			if (_mm_movemask_ps(finishMsk) != 0xf)
			{
				Vector4 flowCounter(0, 0, 0, 0);

				__m128* prayOrgs = (__m128*)&rayOrgs[i];
				__m128* prayDirs = (__m128*)&rayDirs[i];

				__m128 rayOrgX = *prayOrgs;
				__m128 rayOrgY = *(prayOrgs + 1); prayOrgs += winWidth;
				__m128 rayOrgZ = *prayOrgs;
				__m128 rayOrgW = *(prayOrgs + 1);
				_MM_TRANSPOSE4_PS(rayOrgX, rayOrgY, rayOrgZ, rayOrgW);

				__m128 rayDirX = *prayDirs;
				__m128 rayDirY = *(prayDirs + 1); prayDirs += winWidth;
				__m128 rayDirZ = *prayDirs;
				__m128 rayDirW = *(prayDirs + 1);
				_MM_TRANSPOSE4_PS(rayDirX, rayDirY, rayDirZ, rayDirW);

				__m128 rcpDirX = rcp_ps(rayDirX);
				__m128 rcpDirY = rcp_ps(rayDirY);
				__m128 rcpDirZ = rcp_ps(rayDirZ);

				__m128 stepsX = _mm_and_ps(_mm_cmpgt_ps(rayDirX, ZERO), ONE);
				__m128 stepsY = _mm_and_ps(_mm_cmpgt_ps(rayDirY, ZERO), ONE);
				__m128 stepsZ = _mm_and_ps(_mm_cmpgt_ps(rayDirZ, ZERO), ONE);

				__m128 signsAdvX = _mm_mul_ps(_mm_or_ps(stepsX, _mm_and_ps(_mm_cmplt_ps(rayDirX, ZERO), Neg_ONE)), VEC_EPSILON);
				__m128 signsAdvY = _mm_mul_ps(_mm_or_ps(stepsY, _mm_and_ps(_mm_cmplt_ps(rayDirY, ZERO), Neg_ONE)), VEC_EPSILON);
				__m128 signsAdvZ = _mm_mul_ps(_mm_or_ps(stepsZ, _mm_and_ps(_mm_cmplt_ps(rayDirZ, ZERO), Neg_ONE)), VEC_EPSILON);

				__m128 texBaseX = _mm_mul_ps(rayOrgX, VEC_GRIDSZ);
				__m128 texBaseY = _mm_mul_ps(rayOrgY, VEC_GRIDSZ);
				__m128 texBaseZ = _mm_mul_ps(rayOrgZ, VEC_GRIDSZ);

				__m128i voxInfo4, normCone4i;
				fetch3DiVector(volGridStartSize, rayOrgX, rayOrgY, rayOrgZ, voxInfo4, normCone4i);

				__m128 normCone4x = _mm_cvtepi32_ps(_mm_srai_epi32(_mm_slli_epi32(normCone4i, 8),  24));
				__m128 normCone4y = _mm_cvtepi32_ps(_mm_srai_epi32(_mm_slli_epi32(normCone4i, 16), 24));
				__m128 normCone4z = _mm_cvtepi32_ps(_mm_srai_epi32(_mm_slli_epi32(normCone4i, 24), 24));
				__m128 normCone4w = _mm_cvtepi32_ps(_mm_srli_epi32(normCone4i,                     24));
				normCone4x = _mm_mul_ps(normCone4x, _mm_set1_ps(1.f / 255.0));
				normCone4y = _mm_mul_ps(normCone4y, _mm_set1_ps(1.f / 255.0));
				normCone4z = _mm_mul_ps(normCone4z, _mm_set1_ps(1.f / 255.0));
				normCone4w = _mm_mul_ps(normCone4w, _mm_set1_ps(1.f / 510.0));

				__m128 dot4 = _mm_mul_ps(rayDirX, normCone4x);
				dot4 = _mm_add_ps(dot4, _mm_mul_ps(rayDirY, normCone4y));
				dot4 = _mm_add_ps(dot4, _mm_mul_ps(rayDirZ, normCone4z));

				__m128_ms normalConeMsk = _mm_andnot_ps(finishMsk, _mm_cmplt_ps(dot4, normCone4w));
				__m128i geoInfo4 = _mm_and_si128(voxInfo4, normalConeMsk);

				__m128 tMaxX = _mm_sub_ps(_mm_add_ps(floorPositive(texBaseX), stepsX), texBaseX);
				__m128 tMaxY = _mm_sub_ps(_mm_add_ps(floorPositive(texBaseY), stepsY), texBaseY);
				__m128 tMaxZ = _mm_sub_ps(_mm_add_ps(floorPositive(texBaseZ), stepsZ), texBaseZ);
				tMaxX = _mm_mul_ps(tMaxX, rcpDirX);
				tMaxY = _mm_mul_ps(tMaxY, rcpDirY);
				tMaxZ = _mm_mul_ps(tMaxZ, rcpDirZ);
				stepsX = _mm_mul_ps(stepsX, _mm_set1_ps(255.f));
				stepsY = _mm_mul_ps(stepsY, _mm_set1_ps(255.f));
				stepsZ = _mm_mul_ps(stepsZ, _mm_set1_ps(255.f));

				__m128 lastHitU = ZERO;
				__m128 lastHitV = ZERO;
				__m128 lastHitID = ZERO;
				__m128 lastHitT = _mm_min_ps(tMaxX, _mm_min_ps(tMaxY, tMaxZ));

				int finished4 = _mm_movemask_ps(finishMsk);
				while (finished4 != 0xf)
				{
					flowCounter.z++;
					__m128i geoInfoW = _mm_andnot_si128(finishMsk, _mm_and_si128(geoInfo4, _mm_set1_epi32(0xff000000)));
					__m128_ms primLeftMsk = _mm_cmpgt_epi32(geoInfoW, _mm_set1_epi32(0));
					int anyPrims = _mm_movemask_epi8(primLeftMsk);
					while (anyPrims)
					{
						flowCounter.x++;
						geoInfoW = _mm_subs_epu8(geoInfoW, _mm_set1_epi32(0x01000000));
						__m128_ms id4 = _mm_and_si128(primLeftMsk, _mm_add_epi16(geoInfo4, _mm_srli_epi32(geoInfoW, 24)));

						int pointer4[4];
						pointer4[0] = mesh.pointers[id4.m128i_u16[0]];
						pointer4[1] = mesh.pointers[id4.m128i_u16[2]];
						pointer4[2] = mesh.pointers[id4.m128i_u16[4]];
						pointer4[3] = mesh.pointers[id4.m128i_u16[6]];
						__m128 id4f = _mm_cvtepi32_ps(_mm_and_si128(id4, _mm_set1_epi32(0x0000FFFF)));

						__m128 faceNormalX = _mm_loadu_ps(mesh.faceNormals[pointer4[0]]);
						__m128 faceNormalY = _mm_loadu_ps(mesh.faceNormals[pointer4[1]]);
						__m128 faceNormalZ = _mm_loadu_ps(mesh.faceNormals[pointer4[2]]);
						__m128 faceNormalW = _mm_loadu_ps(mesh.faceNormals[pointer4[3]]);
						_MM_TRANSPOSE4_PS(faceNormalX, faceNormalY, faceNormalZ, faceNormalW);

						__m128 dotProd4 = _mm_mul_ps(faceNormalX, rayDirX);
						dotProd4 = _mm_add_ps(dotProd4, _mm_mul_ps(faceNormalY, rayDirY));
						dotProd4 = _mm_add_ps(dotProd4, _mm_mul_ps(faceNormalZ, rayDirZ));
						__m128 dpTestMsk = _mm_and_ps(primLeftMsk, _mm_cmplt_ps(dotProd4, ZERO));
						primLeftMsk = _mm_cmpgt_epi32(geoInfoW, _mm_set1_epi32(0));
						anyPrims = _mm_movemask_epi8(primLeftMsk);
						int dpTest4 = _mm_movemask_ps(dpTestMsk);

						if (dpTest4)
						{
							const DoubleTriangle* quad4[4];
							quad4[0] = &mesh.quads[pointer4[0]];
							quad4[1] = &mesh.quads[pointer4[1]];
							quad4[2] = &mesh.quads[pointer4[2]];
							quad4[3] = &mesh.quads[pointer4[3]];

							__m128 e1X = _mm_loadu_ps(quad4[0]->posNorm[0]);
							__m128 e1Y = _mm_loadu_ps(quad4[1]->posNorm[0]);
							__m128 e1Z = _mm_loadu_ps(quad4[2]->posNorm[0]);
							__m128 vBX = _mm_loadu_ps(quad4[3]->posNorm[0]);
							_MM_TRANSPOSE4_PS(e1X, e1Y, e1Z, vBX);

							__m128 e2X = _mm_loadu_ps(quad4[0]->posNorm[1]);
							__m128 e2Y = _mm_loadu_ps(quad4[1]->posNorm[1]);
							__m128 e2Z = _mm_loadu_ps(quad4[2]->posNorm[1]);
							__m128 vBY = _mm_loadu_ps(quad4[3]->posNorm[1]);
							_MM_TRANSPOSE4_PS(e2X, e2Y, e2Z, vBY);

							__m128 e3X = _mm_loadu_ps(quad4[0]->posNorm[2]);
							__m128 e3Y = _mm_loadu_ps(quad4[1]->posNorm[2]);
							__m128 e3Z = _mm_loadu_ps(quad4[2]->posNorm[2]);
							__m128 vBZ = _mm_loadu_ps(quad4[3]->posNorm[2]);
							_MM_TRANSPOSE4_PS(e3X, e3Y, e3Z, vBZ);

							__m128 vaX = _mm_sub_ps(rayOrgX, vBX);
							__m128 vaY = _mm_sub_ps(rayOrgY, vBY);
							__m128 vaZ = _mm_sub_ps(rayOrgZ, vBZ);

							__m128 qX = _mm_sub_ps(_mm_mul_ps(vaY, e2Z), _mm_mul_ps(vaZ, e2Y));
							__m128 qY = _mm_sub_ps(_mm_mul_ps(vaZ, e2X), _mm_mul_ps(vaX, e2Z));
							__m128 qZ = _mm_sub_ps(_mm_mul_ps(vaX, e2Y), _mm_mul_ps(vaY, e2X));

							VecIntersects(	vaX, vaY, vaZ,
											e2X, e2Y, e2Z,
											e1X, e1Y, e1Z,
											qX,  qY,  qZ,
											rayDirX, rayDirY, rayDirZ,
											lastHitU, lastHitV, lastHitID, lastHitT, 
											id4f, dpTestMsk);

							VecIntersects(	vaX, vaY, vaZ,
											e2X, e2Y, e2Z,
											e3X, e3Y, e3Z,
											qX,  qY,  qZ,
											rayDirX, rayDirY, rayDirZ,
											lastHitU, lastHitV, lastHitID, lastHitT, 
											_mm_or_ps(id4f, NEG_SIGN), dpTestMsk);

							__m128 isHitMsk = _mm_cmpneq_ps(lastHitID, ZERO);
							finishMsk = _mm_or_ps(finishMsk, isHitMsk);
							finished4 = _mm_movemask_ps(finishMsk);
						}
					}
					int foundVox4 = finished4;

					__m128 foundVoxMsk = finishMsk;
					while (foundVox4 != 0xf)
					{
						flowCounter.y++;
						__m128 intersectPointX = _mm_add_ps(_mm_mul_ps(lastHitT, rayDirX), texBaseX);
						__m128 intersectPointY = _mm_add_ps(_mm_mul_ps(lastHitT, rayDirY), texBaseY);
						__m128 intersectPointZ = _mm_add_ps(_mm_mul_ps(lastHitT, rayDirZ), texBaseZ);

						__m128 curCellX = _mm_add_ps(intersectPointX, signsAdvX);
						__m128 curCellY = _mm_add_ps(intersectPointY, signsAdvY);
						__m128 curCellZ = _mm_add_ps(intersectPointZ, signsAdvZ);
						
						__m128 texCoordX = _mm_mul_ps(curCellX, _mm_set1_ps(1.f / GRID_SIZE));
						__m128 texCoordY = _mm_mul_ps(curCellY, _mm_set1_ps(1.f / GRID_SIZE));
						__m128 texCoordZ = _mm_mul_ps(curCellZ, _mm_set1_ps(1.f / GRID_SIZE));

						__m128i voxInfo4, gridSize_Conei;
						fetch3DiVector(volGridStartSize, texCoordX, texCoordY, texCoordZ, voxInfo4, gridSize_Conei);

						__m128 gridStartX = _mm_cvtepi32_ps(_mm_srli_epi32(_mm_slli_epi32(voxInfo4, 8),  24));
						__m128 gridStartY = _mm_cvtepi32_ps(_mm_srli_epi32(_mm_slli_epi32(voxInfo4, 16), 24));
						__m128 gridStartZ = _mm_cvtepi32_ps(_mm_srli_epi32(_mm_slli_epi32(voxInfo4, 24), 24));
						__m128i bCont4 = _mm_srli_epi32(voxInfo4, 24);
						__m128 contMsk = _mm_andnot_ps(foundVoxMsk, _mm_cmpneq_ps(_mm_castsi128_ps(bCont4), ZERO));
						int bCont = _mm_movemask_ps(contMsk);

						__m128 gridSize_ConeX = _mm_cvtepi32_ps(_mm_srai_epi32(_mm_slli_epi32(gridSize_Conei, 8),  24));
						__m128 gridSize_ConeY = _mm_cvtepi32_ps(_mm_srai_epi32(_mm_slli_epi32(gridSize_Conei, 16), 24));
						__m128 gridSize_ConeZ = _mm_cvtepi32_ps(_mm_srai_epi32(_mm_slli_epi32(gridSize_Conei, 24), 24));
						gridSize_ConeX = _mm_mul_ps(gridSize_ConeX, _mm_set1_ps(1.f / 255.0));
						gridSize_ConeY = _mm_mul_ps(gridSize_ConeY, _mm_set1_ps(1.f / 255.0));
						gridSize_ConeZ = _mm_mul_ps(gridSize_ConeZ, _mm_set1_ps(1.f / 255.0));

						__m128 backFoundVoxMsk = foundVoxMsk;
						if (bCont)
						{
							__m128 dotProd4 = _mm_mul_ps(gridSize_ConeX, rayDirX);
							dotProd4 = _mm_add_ps(dotProd4, _mm_mul_ps(gridSize_ConeY, rayDirY));
							dotProd4 = _mm_add_ps(dotProd4, _mm_mul_ps(gridSize_ConeZ, rayDirZ));

							gridSize_ConeX = select(contMsk, _mm_set1_ps(1.f / 255.0), gridSize_ConeX);
							gridSize_ConeY = select(contMsk, _mm_set1_ps(1.f / 255.0), gridSize_ConeY);
							gridSize_ConeZ = select(contMsk, _mm_set1_ps(1.f / 255.0), gridSize_ConeZ);

							__m128 normalConeW = _mm_cvtepi32_ps(_mm_srli_epi32(gridSize_Conei, 24));
							normalConeW = _mm_mul_ps(normalConeW, _mm_set1_ps(1.f / 510.0));

							__m128_ms normalConeMsk = _mm_and_ps(contMsk, _mm_cmplt_ps(dotProd4, normalConeW));
							geoInfo4 = select(normalConeMsk, voxInfo4, geoInfo4);
							foundVoxMsk = _mm_or_ps(normalConeMsk, foundVoxMsk);

							gridStartX = select(contMsk, floorPositive(curCellX), gridStartX);
							gridStartY = select(contMsk, floorPositive(curCellY), gridStartY);
							gridStartZ = select(contMsk, floorPositive(curCellZ), gridStartZ);
						}

						__m128 targetConerX = _mm_add_ps(_mm_mul_ps(gridSize_ConeX, stepsX), gridStartX);
						__m128 targetConerY = _mm_add_ps(_mm_mul_ps(gridSize_ConeY, stepsY), gridStartY);
						__m128 targetConerZ = _mm_add_ps(_mm_mul_ps(gridSize_ConeZ, stepsZ), gridStartZ);

						__m128 tMaxX = _mm_mul_ps(_mm_sub_ps(targetConerX, texBaseX), rcpDirX);
						__m128 tMaxY = _mm_mul_ps(_mm_sub_ps(targetConerY, texBaseY), rcpDirY);
						__m128 tMaxZ = _mm_mul_ps(_mm_sub_ps(targetConerZ, texBaseZ), rcpDirZ);

						lastHitT = select(backFoundVoxMsk, lastHitT, _mm_min_ps(_mm_min_ps(tMaxX, tMaxY), tMaxZ));

						__m128 outSideMsk = _mm_cmplt_ps(curCellX, ONE);
						outSideMsk = _mm_or_ps(outSideMsk, _mm_cmplt_ps(curCellY, ONE));
						outSideMsk = _mm_or_ps(outSideMsk, _mm_cmplt_ps(curCellZ, ONE));
						outSideMsk = _mm_or_ps(outSideMsk, _mm_cmpgt_ps(curCellX, _mm_set1_ps(GRID_SIZE-1)));
						outSideMsk = _mm_or_ps(outSideMsk, _mm_cmpgt_ps(curCellY, _mm_set1_ps(GRID_SIZE-1)));
						outSideMsk = _mm_or_ps(outSideMsk, _mm_cmpgt_ps(curCellZ, _mm_set1_ps(GRID_SIZE-1)));
						finishMsk = _mm_or_ps(finishMsk, outSideMsk);
						foundVoxMsk = _mm_or_ps(foundVoxMsk, finishMsk);
						foundVox4 = _mm_movemask_ps(foundVoxMsk);
					}
					finished4 = _mm_movemask_ps(finishMsk);
				}
}

Compared to the GPU code:

Code:

	float4 flowCounter = 0;
	const float3 steps = rayDir > 0;
	const float3 signsAdv = sign(rayDir) * SAMP_EPSILON;
	const float3 texBase = posOrg * mapSize;

	float4 curCell = float4(floor(texBase) + 0.5, 0);
	float4 voxInfo = tex3Dlod(sampGridStart, curCell * cellSize);
	float4 normCone = tex3Dlod(sampGridSize, curCell * cellSize);
	
	float2 geoInfo = 0;
	float dotProd = dot(normCone.xyz * 2 - 1, rayDir);
	if (dotProd < normCone.w)
	{
		geoInfo.y = voxInfo.w * 255;
		geoInfo.x = voxInfo.z * 255 + voxInfo.y * 256 * 255;
	}
	
	float3 tMax = abs((floor(texBase) + steps - texBase) / rayDir);	
	float4 lastHit = float4(0, 0, 0, min(tMax.x, min(tMax.y, tMax.z)));

	float finished = 0;	
	FASTOPT
	while(!finished)
	{
		flowCounter.z ++;
		FASTOPT
		while (geoInfo.y > SAMP_EPSILON)
		{
			flowCounter.x++;
			geoInfo.y -= 1;
			float id = geoInfo.x + geoInfo.y + 0.125;
			float2 crd = id_to_uv(id);
			float4 norm = tex2Dlod(sampNormal, float4(crd, 0, 0));
			if (dot(norm.xyz * 2 - 1, rayDir) < norm.w)
			{
				flowCounter.w++;
				float3 vertA = tex2Dlod(sampGeo, float4(crd.x + 0.f / 1024, crd.y, 0, 0));
				float3 vertB = tex2Dlod(sampGeo, float4(crd.x + 1.f / 1024, crd.y, 0, 0));
				float3 vertC = tex2Dlod(sampGeo, float4(crd.x + 2.f / 1024, crd.y, 0, 0));
				float3 vertD = tex2Dlod(sampGeo, float4(crd.x + 3.f / 1024, crd.y, 0, 0));
				
				float3 e1 = vertA - vertB;
				float3 e2 = vertC - vertB;
				float3 e3 = vertD - vertB;
				float3 va = posOrg - vertB;
				float3 q = cross(va, e2);

				Intersects(va, e2, e1, q, rayDir, lastHit, id);
				Intersects(va, e2, e3, q, rayDir, lastHit, -id);
				finished += lastHit.x;
			}
		}
		float foundVox = finished;

		FASTOPT
		while (!foundVox)
		{
			flowCounter.y ++;
			
			float3 intersectPoint = texBase + lastHit.w * rayDir;
			curCell.xyz = intersectPoint.xyz + signsAdv;
			
			float4 texCoord = floor(curCell) * cellSize; //floor() For RV630
			float4 voxInfo = floor(tex3Dlod(sampGridStart, texCoord) * I8DENORM); //floor() For RV530, maybe needed for the next line as well
			float4 gridSize = tex3Dlod(sampGridSize, texCoord);
						
			if (voxInfo.w > 0)
			{
				float4 normCone = gridSize;
				float dotProd = dot(normCone.xyz * 2 - 1, rayDir);
				gridSize.xyz = 1.f / 255;
				if (dotProd < normCone.w)
				{
					geoInfo.y = voxInfo.w;
					geoInfo.x = voxInfo.z + voxInfo.y * 256;
					foundVox = 1;
					//break;
				}
				voxInfo.xyz = floor(curCell);
			}

			float3 targetConer = voxInfo + gridSize * steps * 255;
			float3 tMax = abs((targetConer - texBase) / rayDir);
			lastHit.w = min(tMax.x, min(tMax.y, tMax.z));
			
			float3 outside = saturate(curCell.xyz + 1 - mapSize) + saturate(1 - curCell.xyz);
			finished = dot(float4(outside, finished), 1);
			foundVox += finished;			
		}
	}

I have 2 conclusions:
1: The CPU code is much bigger. But coding is much more straightforward.
2: The GPU code is much shorter. But when a single small bug can easily take you 2 days to fix (especially precision related)

The 500x number sounds like a magic to me. The Core2 DUO should be within 10x compared to a 96sp GPU IMO. Maybe path tracing is thrashing your CPU's cache, and the CPU doesn't have the luck to have thousands of threads when cache didn't helps. That's the reason I can think of. In my case however, the entire data set is well within a decent 6MB cache.

SIMD does not help that much, I'm getting around 2x for SSE port. Maybe I have math dependancy issue.

About source code, I've already show you the inner loop(although it's an older version because for which I got the corresponding CPU part). Maybe I'll have a chance to talk about it later.

RacingPHT · May 23, 2011

I reallized that the number of non-blocking memory contexts are the key for thing like path tracing to be efficient. So such task greatly favors GPU's thread model. It also benefits CELL, because each SPE has up to 16 non-blocking DMA streams, and each stream can bave thousands of non-blocking memory request that are executing serially.

Sadly, x86 does not have this. We have 2 thead at most for a core. It's not the math capability that makes the CPU lose in this case. Even SandyBridge didn't fill the gap.

There're prefetch instructions though, but it's hardly as useful. Anyway, maybe you should try it in your CPU implementations.

Mintmaster · May 23, 2011

RacingPHT said:
The 500x number sounds like a magic to me. The Core2 DUO should be within 10x compared to a 96sp GPU IMO.

Well, aside from the output buffer, the data set in this test is really, really small (less than half a kilobyte). Take a look at the smallpt C code I linked to. Everything sits in the cache, so it's probably not very representative of general path tracing performance. The only thing keeping the GPU from running at full instruction throughput is branching granularity, and that's probably a minor issue.

So flop comparisons should give you give you the right ratio of performance, and without SSE that is 100x. IMO, 500x is plausible.

Mintmaster · May 23, 2011

Oh yeah, meant to talk about your earlier post:

RacingPHT said:
Sorry for the programmer's art the model was downloaded somewhere on the internet, And i'm not lucky enough to have a artist to fix the normal for me. And the reflected normal looks so cheap when interpolated.

I was thinking about procedural processing of the mesh. Each vertice's normal would be a weighted average of touching faces. Not quite sure what you're saying about the reflected normal.

At least for the near future hardware, in the context of real-time game rendering, pure ray tracing engine makes little sense to me.

Agreed, but path tracing could have some use. This is what wrote in the smallpt thread:

IMO secondary diffuse lighting is the only place that raytracing has a chance to outdo rasterization for realtime graphics, and once you aim for a renderer with that quality, speeding up primary rays or coherent shadow rays does very little for you. People are doing things like spherical harmonics and light propogations volumes to help rasterization to deal with this deficiency, but figuring out visibility along arbitrary paths is a fundamental weakness of rasterization while being a strength of RT.

RacingPHT · May 23, 2011

Hmm, but the flops number is far from 100x.

Assuming a 3Ghz Core 2 Duo: 3(Ghz) * 2(Cores) * 4(SSE width) * 2(Math issue rate) = 48 GFlops.
Assuming a 1.5Ghz 96sp GPU: 1.5(Ghz) * 96(sp) * 2(mad) = 288 GFlops.
In reality, the NVidia is capable is issue more muls. I tested the co-issue shows the effect is about 1.3x more. So it's about 374 GFlops. Less than 10x.

Some details to consider: A single core in Core2 Duo is capable of issuing 3x128bit SSE instructions per-cycle(so did phenom, too). Typically FPAdd + FPMul + FPLoadStore. So it has an even higher SSE rate considering these misc instructions are indeed a lot.

Without SSE, the ratio is higher and can be up to 30x(maybe 40x for different frequency). But at this time, the branch granularity is single pixel on CPU and this can help the CPU A LOT in raytracing.

RacingPHT · May 23, 2011

Mintmaster said:
Oh yeah, meant to talk about your earlier post:
I was thinking about procedural processing of the mesh. Each vertice's normal would be a weighted average of touching faces. Not quite sure what you're saying about the reflected normal.

Not good because the position itself is not smooth enough. Just imagine a car body gets fixed right after an big accident, that's what you expect to see in the demo when the normal is interpolated.

Agreed, but path tracing could have some use. This is what wrote in the smallpt thread:

Good argument. Now we have 2 reasons to do raytracing

Mintmaster · May 25, 2011

RacingPHT said:
Without SSE, the ratio is higher and can be up to 30x(maybe 40x for different frequency). But at this time, the branch granularity is single pixel on CPU and this can help the CPU A LOT in raytracing.

Okay, gotta plead mea culpa on a couple fronts. First, I didn't really look very hard for the peak x87 math rate. Just found a benchmark getting 2.5Gflops for an E6400.

More importantly, I didn't realize how much the GPU code and CPU code differed, and I was counting stuff all wrong. Now that I've updated the CPU code, I'm pretty sure the correct figures are now ~1pass/s on the CPU and 100pass/s on the GPU. So it's 100x, not 500x. Oddly, it runs slower when I use floats instead of doubles.

Naturally, not everything in path tracing is flops, as there are lots of loops, comparisons, etc. So I think the results are sensible. Branch granularity won't help the CPU much for the smallpt scene (cornell box with two spheres).

hoho · May 25, 2011

I hope the OP won't mind if I link to another interesting GPU (path) tracer

http://code.google.com/p/kajiya-gpu/

FULL-HD GPU raytracing demo<dx9>

RacingPHT

Mintmaster

homerdog

donator of the year

Attachments

RacingPHT

RacingPHT

Davros

RacingPHT

Davros

Mintmaster

RacingPHT

RacingPHT

Mintmaster

Mintmaster

RacingPHT

RacingPHT

Mintmaster

hoho