TimothyFarrar
Regular
Some points are valid and good to remember, it is not as if you are going to get good performance on Larrabee without good vectorization of the code...
And since it looks like Larrabee will be "serial-scalar done right" it'll prolly work quite well :smile:Shaders wise I expect them to do a very good job [...]
I was just waiting a post from Aaron about this matter"NVIDIA's approach to parallel computing has already proven to scale from 8 to 240 GPU cores."
How seriously can you take Nvidia's comments when they bend terminology like so? Change it to 30 cores or 10 three processor clusters and it doesn't sound quite so impressive.
I certainly wouldn't be surprised if ATI gets near twice the FLOPs per mm2 in the end, even with a process disadvantage.
I think that's quite probable, but it's delivered performance not peak that matters.
If Larrabee does more than play games it can succeed even if it isn't the fastest or most efficient 'GPU'. A 4 TFLOP GPU that you use 10% of the time or a 2 TFLOP CPU you use 50% of the time, which would you buy? I know most of you guys would say both.
A killer video or killer photo* app could sell a LOT of chips, and there are applications that aren't practical today that could become mass-market.
*Have a look at recent SIGGRAPH papers that are to do with imaging rather than rendering, there are some amazing things that could be done with lots of FLOPS.
I was just waiting a post from Aaron about this matter
GPUs are still too rooted at a base architectural level to the idea that what they compute doesn't need to be taken seriously.
There's a lot of unknowns about Larrabee and DB. That paper just uses simple cycling between the batches, so there's no real scheduling. Who knows how well the x86 thread can handle this task.With 16 elements per hardware thread, Larrabee will be paying the lowest cost of any GPU for DB incoherency. Additionally, with an x86 thread per core that's able (amongst other things) to "re-pack" elements to minimise incoherency, DB should end up reasonably useful - though I will admit non-D3D programmers will prolly find themselves having to implement re-packing.
This operation is likely to be so rare that I wouldn't worry about it consuming slots.The SW is responsible for deciding when to switch and for actually performing the switch (consuming issue slots, etc.).
Isn't it a bit far fetched to say that the programming model you describe here will be the only one available? AFAIK Intel is part of the group working on OpenCL, they also have Ct, which I hope will see the light of the day as some point in the future.Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors. If you ignore SIMD, you run on the scalar x86 units only, and get none of the benefit of all that throughput. This is why I think Intel's noise about Larrabee being x86 is pointless -- running pure x86 code on it is a waste, and SSE code won't even run (AFAIK) and would waste 3/4 of the throughput even if it did.
Exactly my thought, I'm no compilers expert but having CUDA or something CUDA-like sounds quite feasible.EDIT: Oops, forgot one point I wanted to make. For the reasons above, I'd want to use the CUDA programming model even when programming on Larrabee, and have a compiler+runtime map it to Larrabee vector instructions. Which shouldn't actually be too hard, hopefully someone implements this so I don't have to.
I think it's reasonable to assume the x86 part of each core (split across 4 hardware threads) is close to being fully dedicated to "managing" the execution of the 3D pipeline upon the VPU and all associated housekeeping.There's a lot of unknowns about Larrabee and DB. That paper just uses simple cycling between the batches, so there's no real scheduling. Who knows how well the x86 thread can handle this task.
I'm not sure how you're defining a batch here, because there is no "per-clock" switching.When running with 16 elements per hardware thread, remember that Larrabee switches batches every clock. Achieving that kind of scheduling throughput with a dynamic algorithm (i.e. not a predefined sequence) may not be possible.
But doesn't this distinction become moot as soon as you start programming the memory system for maximum performance - now you're trying to maximise cache hits, maximise scatter/gather bandwidth or traverse sparse data sets with maximum coherence. All those tasks become intimately tied into the SIMD width and corresponding path sizes/latencies within the memory hierarchy.CUDA hides the SIMD-ness of the hardware [...]
Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself.
But doesn't this distinction become moot as soon as you start programming the memory system for maximum performance - now you're trying to maximise cache hits, maximise scatter/gather bandwidth or traverse sparse data sets with maximum coherence. All those tasks become intimately tied into the SIMD width and corresponding path sizes/latencies within the memory hierarchy.
Larrabee (and SSE/AVX) on the other hand expose the SIMD-ness and you have to manage it yourself. The width is baked into your code: they can change number of execution units (at which point they're relying on ILP with the associated costs) but code doesn't automatically scale. We'll see this with AVX: all the existing SSE code won't magically take advantage of the wider vectors.