Larrabee delayed to 2011 ?

Texture filtering isn't the thing which needs hardware support ... anisotropic is already mostly in the shaders, doing bilinear in there isn't a big thing. Decompression though belongs in hardware, and the actual cache ... 64 bit unaligned accesses don't suit normal caches very well (32 bit banked would do better).
Anisotropic done in the shaders? Never heard about it. Referring to some mobile part?
Doing bilinear the shaders also doesn't make much sense (power inefficient)
 
Another reason Larrabee had TEX units would be that without the decoupled scheme the in-order cores do not have the latency-hiding ability to handle even minor texture activity.
Larrabee's rasterizer scheme depended on software strands to hide the latency via staticaly scheduling a run of other instructions long enough to hide the texturing latency before performing a load to pull the results in.

If the core itself had to run the texture load itself, it would need to stall the thread and there are only 4 of them.
The decoupled scheme put the onus on the fetch unit, the details of which I didn't see publicized.
 
Another reason Larrabee had TEX units would be that without the decoupled scheme the in-order cores do not have the latency-hiding ability to handle even minor texture activity
The latency hiding has to be hidden by scheduling in either case. What you need to suspend is the actual instruction stream that depends on the texture request (i.e. the shader) by switching to doing something else, so that has to be handled in the cores. In fact it's less of an issue if filtering was being done in the cores since then you're only hiding the latency of the tap fetches rather than the entire latency of the filtering operation. As with all long latency memory accesses, you should be prefetching and switching to other work before issuing the load/store to avoid stalling the thread.

Texture sampling is just another long latency event that needs hiding... it's irrelevant whether or not it is being done in a separate unit except that if it is it's actually harder to keep the main cores busy, not easier.
 
Nick, I am curious to know your views on implementing sw rasterizers on Fermi. Is there anyway in which Fermi is lacking vis a vis LRB1? Modulo some oddball instructions, of course, they can be implemented without mucking around with the overall architecture.
 
The latency hiding has to be hidden by scheduling in either case. What you need to suspend is the actual instruction stream that depends on the texture request (i.e. the shader) by switching to doing something else, so that has to be handled in the cores.
It's done that way by the software rasterizer, it's just that doing so with the separate unit makes it easier to do so.
The texture ops become fire-and-forget with a better static target latency, uniform handling of alignment issues, and a texture unit that hopefully would do a better job of not discarding its prefetches before performing the filter or thrashing the limited L1 or valuable L2 tile space, which a long stream of prefetches would do.

In fact it's less of an issue if filtering was being done in the cores since then you're only hiding the latency of the tap fetches rather than the entire latency of the filtering operation.
I was under the impression that the filtering portion of the operation was significantly less than the worst-case fetch latency.

As with all long latency memory accesses, you should be prefetching and switching to other work before issuing the load/store to avoid stalling the thread.
Prefetch with the scalar or vector prefetches?
The vector ones have the downside of hitting the L1 and they do leave the VPU's FP resources unused for one or more cycles.
The scalar ones can fetch to the larger L2, though the 1:16 disadvantage they'd face with the organization of the software rasterizer's strand-based organization and the back and forth between the vector and scalar sides may make this less than universally useful.
Either way, this is a dual-issue core. There are only so many slots to burn.

Texture sampling is just another long latency event that needs hiding... it's irrelevant whether or not it is being done in a separate unit except that if it is it's actually harder to keep the main cores busy, not easier.
The math the cores would be doing would be shovel work, and the cache would be tracking dozens to hundreds of outstanding fetches. It's not a workload the P54 was meant to handle.
A custom texture unit could do all those things in its own way, and merely have to produce the results within a given time window.
The main cores could spend their time doing something useful.
 
I think you misinterpreted what I was saying... I agree that a separate unit for texturing fetching, decompression and filtering makes sense for the foreseeable future. I was merely noting that it doesn't actually change much about the latency hiding requirements of the core. i.e. latency hiding is not a major reason for having a separate unit, it's a result of it.
 
Nick, I am curious to know your views on implementing sw rasterizers on Fermi. Is there anyway in which Fermi is lacking vis a vis LRB1? Modulo some oddball instructions, of course, they can be implemented without mucking around with the overall architecture.
It's not lacking anything critical, but it's still not well tailored toward software implementations either. On the one hand, Fermi is an impressive step forward for GPU programmability. On the other hand, each new feature has been added through a relatively small change to the GT200 architecture. Fermi is still a GPU with a restriced programming model, and not a generic multi-core CPU. So although it's a definite improvement, it's still very hard to write efficient software for this architecture.

Fermi, like any GPU, has terrible single-threaded performance. So it's not capable of compiling its own code. That's a crucial feature for creating a versatile rasterizer that doesn't perform like an emulator. Fermi is still incredibly dependent on the CPU, which has implications on what sort of applications you can write for it. Amdahl's Law is a bitch.

And while Fermi's ability to run 16 kernels simultaneously is a welcome improvement, Larrabee can essentially run 128 kernels and manage them on the device. Fermi is lacking the synchronization primitives to do that. Again it's relying on the CPU for scheduling kernels, and pays the price of round-trip latency for dependencies. Rasterization doesn't suffer so much from this as it has few dependencies, but other applications may not be so fortunate.

Last but not least, I don't think Fermi will be succesful at creating a software ecosystem. I don't expect people will start exchanging a lot of CUDA, PTX and/or cubin code. Each application is pretty much written from scratch and delivered as an executable, not as APIs, libraries or source code. CUDA already has five different levels of compute capabilities, and more will be needed to achieve fully generic capabilities. That's very unattractive from a compatibility and code manageability point of view. The code length limitation also comes into play at some point. Larrabee will support unmodified C++ from day one, and additional features can be supported in a fully abstracted way using JIT compilation.

Does this answer your question? I may have digressed a little. ;)
 
It's not lacking anything critical, but it's still not well tailored toward software implementations either. On the one hand, Fermi is an impressive step forward for GPU programmability. On the other hand, each new feature has been added through a relatively small change to the GT200 architecture. Fermi is still a GPU with a restriced programming model, and not a generic multi-core CPU. So although it's a definite improvement, it's still very hard to write efficient software for this architecture.

Fermi, like any GPU, has terrible single-threaded performance. So it's not capable of compiling its own code. That's a crucial feature for creating a versatile rasterizer that doesn't perform like an emulator. Fermi is still incredibly dependent on the CPU, which has implications on what sort of applications you can write for it. Amdahl's Law is a bitch.

And while Fermi's ability to run 16 kernels simultaneously is a welcome improvement, Larrabee can essentially run 128 kernels and manage them on the device. Fermi is lacking the synchronization primitives to do that. Again it's relying on the CPU for scheduling kernels, and pays the price of round-trip latency for dependencies. Rasterization doesn't suffer so much from this as it has few dependencies, but other applications may not be so fortunate.

The issues you pointed out seem to be perfectly curable with a hypothetical Fermi-Llano. In particular, generic rasterizers seem perfectly doable with a Fermi-Llano.

Last but not least, I don't think Fermi will be succesful at creating a software ecosystem. I don't expect people will start exchanging a lot of CUDA, PTX and/or cubin code. Each application is pretty much written from scratch and delivered as an executable, not as APIs, libraries or source code. CUDA already has five different levels of compute capabilities, and more will be needed to achieve fully generic capabilities. That's very unattractive from a compatibility and code manageability point of view. The code length limitation also comes into play at some point. Larrabee will support unmodified C++ from day one, and additional features can be supported in a fully abstracted way using JIT compilation.

I think, a vendor neutral binary IR can fix this.
 
The issues you pointed out seem to be perfectly curable with a hypothetical Fermi-Llano. In particular, generic rasterizers seem perfectly doable with a Fermi-Llano.
Generic rasterizers, yes, they could benefit from a fusion between a CPU and a GPU with Fermi-like capabilities. Actually, I'm under the impression that Fermi already has a processor for kernel scheduling: the GigaThread Hardware Thread Scheduler, in marketing speak. But it isn't under the control of the developer and doesn't help in compilation. A full-blown CPU tagged onto the GPU would fix that.

But it would still be an intermediate step, only suited for rasterizers and other applications with a similar workflow. As much as NVIDIA would like it, supercomputers don't consist of tightly coupled cores executing the same instructions. They consist of clusters of cores that can each execute fully independent threads. That's a more flexible architecture that doesn't put any restrictions on the application. Add to this the ability to program it in regular C++ (or Fortran or whatever), and you can see why an architecture like Larrabee is far more likely to dictate the future of HPC. Developers don't like hybrid solutions.
I think, a vendor neutral binary IR can fix this.
Certainly, but that means CUDA is out. DirectX 12 and OpenCL 2.0 are also unlikely to let go of the kernel programming model, which makes exchanging of code on a large scale unlikely. The best candidate to create a strong software ecosystem, in my opinion, would be x86. It's not vendor neutral, but it already has a massive existing ecosystem it can borrow from. A lot of multi-threaded applications, APIs and libraries can easily be ported to Larrabee. NVIDIA hasn't even reached square one when it comes to creating momentum in the software market.
 
damn it no edit yet.. anyhow a lil dig at MS for Vista at the end and I kind of agree with Tim for the most part (for once).
 
Thanks for the links!

I'm not sure I totally followed the argument about explicit locality being better for power reasons... obviously going to off-chip memory costs a lot of power, but that's a point for making your algorithms have efficient working sets, not for *explicit* working sets. In fact, the latter often *wastes* a lot of power when you have anything even slightly data-dependent because you end up pulling in a whole lot more data than you end up using just because you *might* need it.

In these cases you really need a cache and hardware caches are a hell of a lot more "power efficient" than software caches. Even for more statically predictable stuff (read: easy everywhere already) I don't necessarily buy that explicit local memories are overall better than caches. I think it's clear (to me at least!) that we're going to need some form of hardware caches going forward and given that investment it's not clear that its worth having explicit local memories as well.
 
Last edited by a moderator:
I hope Charlie will have the chance to re-encode the vids, I've a tough time understanding their talk :(
 
Thanks for the links!

I'm not sure I totally followed the argument about explicit locality being better for power reasons... obviously going to off-chip memory costs a lot of power, but that's a point for making your algorithms have efficient working sets, not for *explicit* working sets. In fact, the latter often *wastes* a lot of power when you have anything even slightly data-dependent because you end up pulling in a whole lot more data than you end up using just because you *might* need it.

In these cases you really need a cache and hardware caches are a hell of a lot more "power efficient" than software caches. Even for more statically predictable stuff (read: easy everywhere already) I don't necessarily buy that explicit local memories are overall better than caches. I think it's clear (to me at least!) that we're going to need some form of hardware caches going forward and given that investment it's not clear that its worth having explicit local memories as well.

I agree. Some amount of hw caches are going to be necessary. Although local memories can be more area and power efficient, I think in the interest of unification, we'll end up like lrb, just a gp cache, repartitioned by compiler to act like private, local and global memory.
 
The issues you pointed out seem to be perfectly curable with a hypothetical Fermi-Llano. In particular, generic rasterizers seem perfectly doable with a Fermi-Llano.

Fixed-functions. Without fixed-functions, what you are going to see is this:

http://forums.arm.com/index.php?showtopic=14268

Notice why I have said ARM instead of the usual.

Regarding their debate, it looks to me as if Tim Sweeney is very bored, and that Andrew Richards is talking to himself. One person doesn't see how the world needs to be, and the other does. Fixed-functions, the bane of all things. The fact that many people (Tim Sweeney) root for a lack of fixed-functions is not related to a mere psychological effect, but to how the world actually works.
 
In these cases you really need a cache and hardware caches are a hell of a lot more "power efficient" than software caches. Even for more statically predictable stuff (read: easy everywhere already) I don't necessarily buy that explicit local memories are overall better than caches. I think it's clear (to me at least!) that we're going to need some form of hardware caches going forward and given that investment it's not clear that its worth having explicit local memories as well.
Scatter/gather without just slow serialization is really hard to do with caches ... and with snooping you can just plain forget about it (scaling the snoops filters, invalidation ports etc. for the order of magnitude higher traffic it can cause is not an option).
 
But it would still be an intermediate step, only suited for rasterizers and other applications with a similar workflow. As much as NVIDIA would like it, supercomputers don't consist of tightly coupled cores executing the same instructions. They consist of clusters of cores that can each execute fully independent threads. That's a more flexible architecture that doesn't put any restrictions on the application. Add to this the ability to program it in regular C++ (or Fortran or whatever), and you can see why an architecture like Larrabee is far more likely to dictate the future of HPC. Developers don't like hybrid solutions.
Larrabee is a hybrid solution.
 
Back
Top