NVIDIA Fermi: Architecture discussion

Interesting proposal although I'm not sure this is as simple as it sounds either. As you mention in your edit I think this still breaks down to being able to split up the "threads" in the lanes after some timeout, rather than just always doing it at control flow. In either case it does involve tracking additional program counters and potentially registers (unless you have some indirection map from the new warps to the old register lanes they came from I guess) when the split happens.
It can still be done at the first control flow point reached after the timeout while solving your deadlock problem, because deadlock isn't possible w/o control flow. Some hardware may already have a per-thread program counter (didn't Bob mention this?), as it's not clear to me that such an implementation for control flow is any harder than using a stack of predication masks.

As for new registers and indirection maps, you lost me, as I don't see why they're necessary. Remember that all that we're doing is changing the order of thread execution. Dealing with deadlock could be as simple as this:
1. Initialize execution mask
2. Set n to the index of the first bit set in the mask
3. Find all threads in the batch with the same instruction pointer as thread n, use these as the predicate mask, and clear those bits from the execution mask
4. Execute the program on these threads until the timeout (or syncthreads, or fetch)
5. Go to step 2 until the mask is clear

The biggest issue is, as you mentioned, the lack of thread reconvergence. However, if you want to guarantee being deadlock free (for appropriate code, of course), then I don't know if it's possible to run threads in a way that will let threads reconverge. The control path dependent partial-warp syncs that you mentioned can also deadlock if inserted by the compiler, because maybe some threads need to run through a section of code common to all threads before others to break the deadlock, thus mandating divergence.

I think the solution has to involve both HW and language changes. HW-only changes could at best heuristically weigh that timer I mentioned to encourage reconvergence, and I can't see it working too well. CPUs don't have to worry about reconvergence at all beyond regular syncs, so how can we expect GPUs to efficiently tackle this problem? What you need is subwarp desync-sync pairs in the language (e.g. syncthreads() takes a mask parameter which you initialize with desyncthreads(), with all code between being reducible) that the coder explicity uses to prevent deadlock arising from compiler defined static ordering of branch evaluation. This way you only use that timer when absolutely necessary, because otherwise divergence screws performance up.

Until we're ready to incorporate something like this into the language, IMO it's not worth it to do anything about this problem.

BTW, what are all the advantages you're thinking of with DWF? And how general of a solution are you thinking of?
 
Dunno how workable this is, but instead of adding a mask parameter to a hypothetical desyncthreads(), would it be enough to delay lanes at each control flow join by some (perhaps compiler determined) # of cycles (and only if divergence is detected) so that other lanes in a warp have a chance to catch up?
 
Dunno how workable this is, but instead of adding a mask parameter to a hypothetical desyncthreads(), would it be enough to delay lanes at each control flow join by some (perhaps compiler determined) # of cycles (and only if divergence is detected) so that other lanes in a warp have a chance to catch up?
You can't determine how to delay lanes at compiler time because you don't know which is going to be behind. Branches get evaluated at run time.

Trying to delay lanes during run time is what I was talking about with the timer heuristic. It may be possible to give more cycles until timeout to the lanes that are behind (and less to those that are ahead) so that they can catch up. It's all very iffy, though, because the lanes that are behind may never catch up and avoiding deadlock may in fact necessitate other lanes to proceed.

Imagine, for example, if Andrew's example didn't have the "else" and there were a bunch of common instructions between the two if statement. The compiler and/or the scheduler may want to execute those threads together for that segment, but you can't wait for thread 0 to catch up. Thread 1 must proceed alone through those instructions to avoid deadlock.

I think the best way is to keep todays execution model (which deadlocks with that example) because it maximizes performance. Go to a model that guarantees execution time for each thread only when directed to do so by the programmer.
 
As for new registers and indirection maps, you lost me, as I don't see why they're necessary.
They may not be - I was thinking that spills/renaming is allowed across basic blocks but you're probably right in that they simply aren't.

However, if you want to guarantee being deadlock free (for appropriate code, of course), then I don't know if it's possible to run threads in a way that will let threads reconverge.
I feel like given sufficient cleverness you can always detect reconvergence when it happens, although it obviously may never. That said, it may not be worth it in complexity.

The control path dependent partial-warp syncs that you mentioned can also deadlock if inserted by the compiler, because maybe some threads need to run through a section of code common to all threads before others to break the deadlock, thus mandating divergence.
Yes as I mentioned the barrier/sync model breaks down a little bit with this capability. Your generalization may work (I'd have to think it though) but it's not clear to me that if you really are going for making the threads look and act independent then you really want that at all. It's probably best to handle it more generally with typical communication through memory with yields, etc.

I think the solution has to involve both HW and language changes.
Yes definitely. That said let's take a step back: I am not proposing that we need to necessarily move in this direction. All I was saying is that these differences make "CUDA threads" fundamentally different in both theory and practice from the typical usage of the term thread, where producer/consumer models are basically page 1 of the textbook :) Hence calling them "work items" instead is justified.

The one bit that's somewhat unfortunate is that the CUDA/ComputeShader/OpenCL model takes us a bit away from being able to really write fully ideal code for these architectures as I mentioned earlier (with respect to braided parallelism). This is mostly the cost of abstracting the hardware SIMD width, but it sucks that you can't write producer/consumer/task parallel code that communicates at all, even though it should work fine at some granularity on these chips. I'm looking forward to see if the inevitable new CUDA release with Fermi starts to expose this more, or whether they choose to just keep the "multi-kernel execution" completely hidden from the programmer and only enabled when independent kernels are submitted.

BTW, what are all the advantages you're thinking of with DWF? And how general of a solution are you thinking of?
Well fully-general DWF would help to solve problems like ray divergence in a ray tracer for instance. If the scheduler was allowed to look at a somewhat wider number of "threads in flight" to form warps, and dynamically re-evaluate the formations on the fly at least within one group/cluster - then a lot of coherence could be captured that is otherwise lost to divergence, which is the #1 performance problem with GPU raytracing at the moment. To some extent you can cleverly reformulate the control flow to make a bit more friendly to the scheduler at some cost in readability (see Samuli Laine's latest paper on the topic) but it would be nice to have something fully general. Again though, the cost may well be more than it is worth.
 
I'm looking forward to see if the inevitable new CUDA release with Fermi starts to expose this more, or whether they choose to just keep the "multi-kernel execution" completely hidden from the programmer and only enabled when independent kernels are submitted.
All they need to expose are kernels that see the same addressing space. I wonder if Fermi handles page faults..

Well fully-general DWF.... Again though, the cost may well be more than it is worth.
Exactly. DWF is a double edged sword and perhaps a not entirely known beast at this point in time. We shall see..
 
I feel like given sufficient cleverness you can always detect reconvergence when it happens, although it obviously may never. That said, it may not be worth it in complexity.
I don't think detecting it is a problem. The hard part is making it happen more often than by pure coincidence.

Yes definitely. That said let's take a step back: I am not proposing that we need to necessarily move in this direction. All I was saying is that these differences make "CUDA threads" fundamentally different in both theory and practice from the typical usage of the term thread, where producer/consumer models are basically page 1 of the textbook :) Hence calling them "work items" instead is justified.
I figured as much and was going to go agree with your earlier stance, but I figured that this tangent was worth some discussion. In particular, it's not the SIMD nature of the work item processing that makes them different from normal threads, but rather the desire to run them as fast as possible by maximizing coherency.

Well fully-general DWF would help to solve problems like ray divergence in a ray tracer for instance. If the scheduler was allowed to look at a somewhat wider number of "threads in flight" to form warps, and dynamically re-evaluate the formations on the fly at least within one group/cluster - then a lot of coherence could be captured that is otherwise lost to divergence, which is the #1 performance problem with GPU raytracing at the moment. To some extent you can cleverly reformulate the control flow to make a bit more friendly to the scheduler at some cost in readability (see Samuli Laine's latest paper on the topic) but it would be nice to have something fully general. Again though, the cost may well be more than it is worth.
I haven't read the paper, but I assumed something like that was developed a long time ago. I always thought that the bigger problem is the data incoherency rather than divergence in the instruction streams, particularly with secondary rays (which, IMO, is needed for realtime RT to have any chance of taking over rasterization). You're going to need clever coding to really address that, not DWF.

I'm not really sold on it. Keep GPUs excelling at fairly coherent loads, and use CPUs for the rest.
 
Ignore for a second Fermi L2 size and don't think of a cache in terms of a local store, otherwise you'd be better of using a local store (wouldn't waste area and power on something you don't need). Just because your entire data set doesn't fit in your cache it doesn't automatically mean that your cache cannot help you. Texture caches are a perfect example of that.

Right, but my question was more about what specific things might now be possible or changed significantly based on the new memory hierarchy.
 
Mintmaster, sure the compiler cannot know how much to delay lanes except in very simple cases, nor can it statically know which lane to delay (and I do understand that one cannnot always reconverge without reintroducing the deadlock problem). My thought was that that the compiler would estimate how much a particular lane could be ahead of the others; at run-time if the lane actually is ahead, then it gets delayed by that number of cycles. For complicated cases, the compiler would just punt and some default delay would get used. The goal is really just to allow reconvergence for at least the simple cases. That said, I think you are right that this wouldn't work out too well in practice.

BTW, an interesting hybrid vector/thread architecture is the vector-thread architecture - it features a control processor that can issue vector instructions to a set of lanes a la Larrabee (but no SMT). The ALU lanes are augmented with tiny instruction caches and are also able to fetch instruction blocks for themselves.
 
Sigh, based on what?

1. Semi-accurate says they're clearing shelves before a price drop
2. Semi-accurate says they're going EOL
3. Digitimes says 55nm is in short supply from both Nvidia and ATi
4. Fudzilla says partners can't get parts
5. BSN claims partners say the parts are still in full production
6. Newegg has lots of 260s, 275s, 285s and 295s in stock

Take your pick.


Anand now says Nvidia says GT200 is EOL.

The prices on 260, 275, 285, 295 at newegg make them a nonstarter. Yes, technically they are available, and they will probably never sell out either because who would buy them.

For example, does this look like a competitive lineup of GTX275 to you? http://www.newegg.com/Product/Produ...048 106792634 1067947241&name=GeForce GTX 275, or does it look like a dying product line?

You've got one part under $250, from someone named "Galaxy"...when for $259, theoretically at least, you could get a 5850.
 
Anand now says Nvidia says GT200 is EOL.

The prices on 260, 275, 285, 295 at newegg make them a nonstarter. Yes, technically they are available, and they will probably never sell out either because who would buy them.

For example, does this look like a competitive lineup of GTX275 to you? http://www.newegg.com/Product/Produ...048 106792634 1067947241&name=GeForce GTX 275, or does it look like a dying product line?

You've got one part under $250, from someone named "Galaxy"...when for $259, theoretically at least, you could get a 5850.

Newegg actually has hardware they purchased. It would be better for them to sell it for $10 than to store it for ages. The out of stock bits make it seem much more like EOL than high prices. As long as the prices are high then there are still people buying at that price. I don't know who they are though.
 
HP, Nvidia Team Up in HPC Project

http://www.eweek.com/c/a/IT-Infrastructure/HP-Nvidia-Team-Up-in-HPC-Project-658727/

The two technology vendors are part of a project being overseen by the Georgia Institute of Technology, which announced Oct. 21 that it had received a five-year, $12 million Track 2 award from the National Science Foundation’s Office of Cyberinfrastructure.

The goal of the program is to create two heterogeneous HPC systems for research work in such areas as computational biology, combustion, materials science and massive visual analytics.

In the initial system, which is scheduled to be deployed in early 2010, HP and Nvidia will provide the processing power and computing systems. The project will pair the CPU capabilities of Intel with GPUs (graphics processing units) from Nvidia.

HP will supply hundreds of Intel-based systems, while Nvidia will bring its next-generation CUDA architecture, which is code-named “Fermi.” CUDA is the computing engine for Nvidia’s GPUs. CUDA also makes it easier for researchers to run GPUs and CPUs in a co-processing way. Fermi is designed specifically for HPC environments.
 
Is Huddy getting a bit Fuddy?

AMD's senior manager of developer relations, Richard Huddy, told HEXUS it looks to him as if NVIDIA is "somewhat abandoning the gaming market."

Huddy explained how even his mother had found benefits in using DX11 to speed up home video transcoding on Windows Media player, but concluded "Maybe NVIDIA just doesn't care enough about my mother."

That second one is priceless :LOL:
 
right up there with Dual Precision, ECC and Cuda...
It's funny you're saying this because AMD had support for DP way ahead of NV and CUDA is essentially an NV's version of OpenCL and DX11CS both of which are supported (even evangelised) by AMD.
As for ECC - it doesn't cost much in terms of transistors and won't be used on gaming boards so there won't be any impact on performance.
I'd say that AMD's "abandoning gaming" almost as much as NV does. After all the main feature of DX11 is Compute Shaders, right?
 
right up there with Dual Precision, ECC and Cuda...

Not sure what point you're trying to make. All those things line up perfectly with Nvidia's current compute push. i.e they're not contradictory (which is what I assume Degustator was pointing out with regard to Huddy's tangential comments on gaming and video acceleration).
 
Back
Top