The R580 uses four-wide in your implementation, while G80 would be full-speed at scalar. Are you saying the state of the G80 drivers was so bad that it wasn't faster, even though its coherence requirements were 6 times lower than R580?!We currently are talking ~19Mray/s on an X1900XTX (Conference room, shadow rays), and about the same on a G80 with DirectX and the current state of the drivers and shader compilers.
Yeah it kind of confuses me that the G80 is slower as well, as even in the most R580-friendly cases that I've tested, the G80 wins, excepting CPM of course although CUDA promises something similar.Are you saying the state of the G80 drivers was so bad that it wasn't faster, even though its coherence requirements were 6 times lower than R580?!
Yeah that's odd. In almost all of the examples that we're tried the G80 outperforms the R580 by at least 2x - in one benchmark as much as 10x. There are a few cases where the race is closer, but the G80 is always at least slightly faster.We have other GPGPU apps where we have seen a similar trend of the G80 underperforming expectations, but often running faster than R580, but some, like those that are purely scalar, are performing somtimes >2X the R580 (as expected really).
Don't know the precise register counts, but the different benchmarks vary widely in shader size and complexity.How "big" are your shaders/apps and how many registers do you use?
Latest drivers, using OpenGL on Linux and Windows. Admittedly ATI doesn't have terribly good OpenGL drivers (especially on Linux), but after some working around bugs, etc. the R580 numbers are reasonably good and comparable to a 7900GTX when the latter isn't getting killed by branching, etc.And, what drivers are you running? We are running the latest official public XP drivers and using DX9.
Some of them are bandwidth bound, but some are certainly not. There's a good range of tests.Also, are your apps bandwidth bound into the shader, or can be on R580?
No texture filtering in any of the current benchmarks I don't think.Do you use filtering on the textures? G80 has much more bandwidth from texture that R580 unless you use R580's fetch4 features, which are tricky to use.
That I don't know exactly... it would have been a 9 series driver. In any case I'm quite sure that the drivers are pretty fragile right now, and probably in a few months should be a lot better. Still, by then we'll have the R600 hopefully, albeit probably with similarly messed up driversWhat drivers are you running exactly?
20 scalar registers, or 20 Vec4 ones? The former shouldn't a problem, while the latter would be. Let us consider it this way: the G80 is divided in 16 parts, each with its own register file and parallel data cache. Since the latter is presumably 16KiB, you'd assume the former to be between 16KiB and 64KiB, so let's guestimate 32KiB.The coherence isn't that much better (32 vs 48), and we have high register requirements, 31 packetized, and ~20 non-packetized.
Ouch! Assuming a I'48KiB register file, that's 192 threads, which is borderline I guess. I'm not sure how it'd affect performance with likely very unpredictable data fetches...18 float4's via fxc, 16 if written by hand.
Shouldn't the G80 *always* be (a fair bit) faster with the non-packetized version than with the packetized version, though? Since it has no benefit from Vec4 instructions in theory. Although, in practice, it could use those Vec4 instructions to extract instruction level parallelism and improve effective latency tolerance...As can be seen, as you increase ray divergence and hence instruction divergence, the non-packetized ray tracer is actually faster on R580, and is on G80 as well.
According to Buck_NVIDIA_Cuda.pdf, the PDC is 16KB per cluster, i.e. 8KB per 1/16th part.Let us consider it this way: the G80 is divided in 16 parts, each with its own register file and parallel data cache. Since the latter is presumably 16KiB, you'd assume the former to be between 16KiB and 64KiB, so let's guestimate 32KiB.
20 scalar registers, or 20 Vec4 ones? The former shouldn't a problem, while the latter would be. Let us consider it this way: the G80 is divided in 16 parts, each with its own register file and parallel data cache. Since the latter is presumably 16KiB, you'd assume the former to be between 16KiB and 64KiB, so let's guestimate 32KiB.
The diagram shows that the 16 ALUs are arranged as two SIMD arrays, each of 8 ALUs.Where do you get "divided in 16 parts" from? All the diagrams I've seen show 8 clusters * 16 ALUs/cluster = 128 ALUs.
Wow, that's really impressive. I'm always wondering if any future console has balls to go with raytracing based graphic subsystem, because the rasterization pipeline already showed too much innate deficiencies. Besides the raytracing hardware prototype showed in recent SIGGRAPH, it seems the Cell processor also has the potential to do it. In my opinion, Sony should've just use their PS2 architecture to make PS3: RSX only for rasterization, pixel shading and ROP, and let SPU handle all vertex computing, animation, physics and even raytracing. Of course that requires fast data access between SPU and RSX like VU->GS.mhouston said:Cell is actually a raytracing monster, compared to other non-custom architectures, in certain situations. The Saarland folks (and others including Stanford) have Cell raytracers >60Mrays/s for primary rays.
A few years ago I wrote a similiar 3D smoke demo based on Naive-Stokes equation, on a 6800 card. It ran at about 50fps for simulation and rendering the smoke object only. The most expensive part is not the simulation, but the volumetric lighting. I implemented the slice-based volume rendering at first, then tried to move on to raytraced volume rendering. Unfortunately the result was frustrating: hardware raytraced volume rendering trended to show more artifacts than slice-based methods, and almost no good way to calculate volumetric lighting. So guess that nVidia smoke demo is based on raytraced volume rendering, that means - tracing a pixel through a volume texture and accumulating sampled values along the ray.I believe the smoke demo from the G80 launch as marketed as "raytraced".