TimothyFarrar
Regular
I thought warps has 16 "threads"? The G80's units are 8-wide SIMD cores (as you say), and I thought it double-pumped them (and not quad-pumped them). I guess it isn't a big deal either way.
Actually, you are half right, a half-warp is 16 threads, and the core arch currently seems to be double-pumped, with 16 banks for memory access, but in CUDA warp size is 32 threads, and fragment programs also go 32 pixels at a time. Also according to the CUDA docs, seems as if going to 32 banks for memory access is in the works for future hardware.
Also I agree that it looks like with the type of GP programs Larrabee is designed to run well, there should be no problem keeping ALUs busy.
One thing I'm wondering about with Larrabee is what are they going to do driver side for scheduling tasks on the chip's cores.
With CUDA currently all programs (kernels) are serialized. I would assume that programs do overlap as one finishes and the next starts, but I haven't seen anything that verifies this (just it would seem stupid not to). Not being able to run multiple programs in parallel is a serious limitation for GPGPU stuff, and a serious advantage for Larrabee. Also I think, but could be very wrong, that the graphics drivers don't run CUDA and DX/GL in parallel, and have to switch between those to modes of computation.
For graphics, seems as if the drivers overlap vertex and fragment stages by starting the fragment stage when they can be sure that the fragment stage won't wait on vertex input. However, and perhaps someone who actually knows could correct me here, I'm thinking that, like CUDA, drawing calls are serialized with overlap between different programs only at the end and beginning. And would this only be overlap between different stages of two different programs, or for example, do two different fragment programs ever run in parallel?
Right now GPGPU stuff using DX/GL is limited by having to divide algorithms into very large batch chunks of fully parallel operations. Even if CUDA gets double precision, read/write access to a 2D surface cache, seems like it's going to need to be able to run multiple kernels in parallel to be competitive in terms of GPGPU usefulness compared to Larrabee.
Larrabee's biggest strength in my eyes is being able to easily intermix lots of GPGPU and GPU stuff in parallel. Just the though of being able to spawn new tasks from code running on the GPU is awesome.