NVIDIA GF100 & Friends speculation

Wasnt Cypress being mocked earlier as being "just" 30% faster than the 285 and how it was so disappointing .. :rolleyes: And this basing on numbers supposedly from Nvidia. :LOL:

Nah the mocking was more to do with comparisons between Cypress and RV770. That was real easy because of the symmetric doubling of units - things aren't quite so cut and dry with Fermi v GT200. In any case we won't have a full picture of Fermi advantages (or lack thereof) over GT200 until everything is fleshed out - PhysX and CSAA performance are two outstanding questions.
 
The y unit can use the result of the x unit, z can use the result of the y unit, etc for a single xyzw-instruction group. You know how the ALUs work on two wavefronts in a AAAABBBB pattern? Just use 8 batches in a ABCDEFGHAB... 32-cycle pattern. Same net instruction throughput, branch throughput, register access scheme, wavefront throughput, etc, except you get higher ALU utilization.
You will run out of register file ports, will spend more on control logic, will have to put more registers, will have to load more threads in the GPU, will... Well not really good idea.
 
The y unit can use the result of the x unit, z can use the result of the y unit, etc for a single xyzw-instruction group. You know how the ALUs work on two wavefronts in a AAAABBBB pattern? Just use 8 batches in a ABCDEFGHAB... 32-cycle pattern. Same net instruction throughput, branch throughput, register access scheme, wavefront throughput, etc, except you get higher ALU utilization.
I see what you mean. Not sure it would be so free from an area standpoint, for instance you might need an higher rate instructions decoder. It could also affect performance when there are not many threads to run..

Yeah, but for a massively parallel processor, it's probably more costly and difficult. Out of curiosity, what applications are you thinking of? I figured that shared memory and atomics help reduce the need for coherency a fair amount.
Building "complex" data structures via UAVs (yes you can do it on shared memory but in some cases it's simply too small..), faster access to arrays of temps, faster registers spill/fill, etc..
 
Maybe I'm not catching your drift, but semaphores via atomics are the key here, aren't they? Those plus append/consume.
Right but the problem is you're never guaranteed that more than a single CS group (or thread for that matter) is ever actually running concurrently, thus you can never write "safe" producer/consumer code since you can never actually *wait* on results from another thread or group (since that may deadlock if they are not running in parallel). Similarly for append/consume, the API doesn't allow you to do both in a single kernel for the same reason (you must be appending *or* consuming).

I think it's still fairly important. Fermi's caches are small, if you don't use them carefully they will be effectively worthless due to capacity misses. The local store lets you protect highly reused data from eviction as you stream other data through.
True, but then if that's the functionality that you want and you've already paid the cost for coherence logic, why not simply expose the ability to pin cache lines or similar? That would give the programmer the control to use it like a cache, or like a LDS.

I imagine you're right in that for now there will still be decent benefits from using the LDS, but in the long run the question needs to be addressed... Presumably Fermi unifying the address spaces and load/store ops is a step in the direction of at least partially coherent caches too.
 
Last edited by a moderator:
You will run out of register file ports, will spend more on control logic, will have to put more registers, will have to load more threads in the GPU, will... Well not really good idea.
No, you don't. Additional hardware is very minimal, because I'm not looking for general scalar computation like NVidia. Just executing the same instructions using the same register loads in a different static order to provide the option for dependency within an instruction group.
 
Anybody know if some Australian site is going to jump the gun? And if we might get some real benchmarks within the next couple of hours...
 
I see what you mean. Not sure it would be so free from an area standpoint, for instance you might need an higher rate instructions decoder. It could also affect performance when there are not many threads to run..
I've thought it through, and instruction rate is the same. Instead of issuing 2 instruction groups every 8 cycles, you'd need to issue 8 instruction groups every 32 cycles. Area-wise, it's pretty minimal, maybe needing a couple kilobytes of SRAM per SIMD to hold some temporary values. If you only have 2 or 4 wavefronts in flight due to obscene register requirements, you can always just fall back to the old way of doing things, and it can even be determined at compile time.

I'm thinking of making an animation at some point to illustrate the idea.

Building "complex" data structures via UAVs (yes you can do it on shared memory but in some cases it's simply too small..), faster access to arrays of temps, faster registers spill/fill, etc..
Okay, but it just seems to me that often when you need r/w coherency in large structures then the algorithm is usually very difficult to parallelize, in which case you'd be better off just using the CPU. Am I grossly mistaken?
 
Right but the problem is you're never guaranteed that more than a single CS group (or thread for that matter) is ever actually running concurrently, thus you can never write "safe" producer/consumer code since you can never actually *wait* on results from another thread or group (since that may deadlock if they are not running in parallel).
Agreed.

In CUDA you know the "size" of the device, the count of SIMDs, the number of blocks/threads per SIMD etc. (which are a result of the specific kernel you plan to execute) so you can size your domain of execution to fit "precisely" on the GPU. That's the "persistent kernel" programming model, as opposed to the data-parallel domain of execution model (or stream model, if you prefer).

I haven't looked that closely at DC to see if that kind of querying is possible, specifically the count of threads that can be launched to occupy the entire GPU (or some subset).

Similarly for append/consume, the API doesn't allow you to do both in a single kernel for the same reason (you must be appending *or* consuming).
Yeah, so a set of sequentially invoked kernels is implied.

A combination of a persistent kernel and append, followed by a persistent kernel and consume is what I'm envisaging. I haven't come close to messing about with complex data structures in GPGPU though, so...

Jawed
 
From the Fermi docs it appears that caching will be enabled by default on DirectCompute global UAV accesses which is a big deal. Beyond helping algorithms that actually have unpredictable memory access patterns, it begs the question of how important the local data store is now.
Since the accesses are unordered, doesn't that mean ATI can still use its L2 for reads? It seems to me that any example that winds up using the r/w cache in the backdoor way you mention wouldn't actually be a compliant DirectCompute program.

Or are you thinking of a program with a truckload of memory barriers that Fermi basically ignores due to the coherency of its cache?

BTW, where can I find documentation on DirectCompute? I'm basically relegated to looking at NVidia's GPU SDK examples.
 
Okay, but it just seems to me that often when you need r/w coherency in large structures then the algorithm is usually very difficult to parallelize, in which case you'd be better off just using the CPU. Am I grossly mistaken?
You are certainly not mistaken in the general case, but large structures don't automatically mean bad data locality :) The problem now is that as you go off chip for r/w buffers you get no caching at all, which is fine as long as you are not bw limited. As usual one has to think about caches on GPUs as way to amplify bw, not to drastically reduce latency (which is not really interesting)
 
Since the accesses are unordered, doesn't that mean ATI can still use its L2 for reads?
Only if you are exclusively performing reads or you have bound a read-only texture/buffer.
If you are performing gather/scatter from/to an UAV you can still use the read-only L2 but you have to issue cache line evictions before every gather/scatter to make sure data in the L2 reflects data in memory. The other choice is to use uncached gather/scatter. In both cases L2 is not really used to save bw.
 
This might sound crazy (and it probably is..) but I wonder if in some cases it would be faster on a AMD GPU to build data structures in global memory via atomic ops, even if you don't need atomic ops, since they are supposedly performed on a globally shared and coherent (with respect to other atomic ops) r/w cache.
 
Since the accesses are unordered, doesn't that mean ATI can still use its L2 for reads?
It's possible that ATI can use it's L2$ (and maybe already does)... I don't recall whether it is already coherent across the chip, but in my brief testing putting "globallycoherent" or not didn't seem to make any different on performance or functionality, implying that they're only ever using globally coherent caches. The difference here is that NVIDIA has an L1$ that they imply will be used for both global and local memory caching (CUDA allows you to play with some of these parameters). This is interesting because since the L1$ is the same memory they use for LDS (they partition it) it should be low latency and casts doubt on to whether or not its worth explicitly loading the LDS, which basically now amounts to cache line pinning as I understand it. ATI on the other hand has a pure LDS for L1 and I imagine L2$ is a fair distance away in terms of latency. Thus I would imagine that you'd always want to explicitly load into LDS on their cards if you intend to reuse any global memory results, coherent or not.

It seems to me that any example that winds up using the r/w cache in the backdoor way you mention wouldn't actually be a compliant DirectCompute program.
Certainly for R/W it's not clear, but for R/O it's definitely possible to write an application that runs well with a cache and not with an explicit LDS and I expect people will write these without knowing it if developed on NVIDIA hardware. Your question relates to my early aside about the usefulness of globallycoherent at all though... with the very few guarantees that DC gives you, I'm not sure that you can ever actually make use of this coherence safely.

Or are you thinking of a program with a truckload of memory barriers that Fermi basically ignores due to the coherency of its cache?
I don't think Fermi's caches are necessarily fully coherent on *write*, but I could be mistaken. Granted, the rather relaxed execution model of DC in particular implies that you might still be able to make a compliant implementation without respecting this. Things like CMPEXCH may cause trouble here, but for most operations if the order of execution isn't defined then coherence doesn't mean a whole lot... you can always argue that the discrete "views" that a CS group sees on memory are due to execution ordering even if they are actually due to loose memory coherence in practice (again, excepting some of the atomic-with-return cases).

BTW, where can I find documentation on DirectCompute? I'm basically relegated to looking at NVidia's GPU SDK examples.
It's not well-documented, but MSDN/help file in the latest DXSDK (Feb 2010) has some basics.

As usual one has to think about caches on GPUs as way to amplify bw, not to drastically reduce latency (which is not really interesting)
I'd argue that reducing latency is definitely interesting... you don't always have enough parallelism throughout your whole algorithm to keep the entire chip busy, *especially* if you have to cover long memory access latencies too. I think we're going to increasingly see algorithms with tighter feedback loops (iterative optimization stuff especially) and low-latency LDS/caches are going to be critical to these running fast. There's also always a tradeoff between the latency and storage that you need... the more latency your stuff has the more cache/storage you spend on thread contexts. At some point it crosses over in terms of hardware cost.

This might sound crazy (and it probably is..) but I wonder if in some cases it would be faster on a AMD GPU to build data structures in global memory via atomic ops, even if you don't need atomic ops, since they are supposedly performed on a globally shared and coherent (with respect to other atomic ops) r/w cache.
I'm sure it could be useful in some cases, but these operations still have a high overhead even without return values and even with fairly little contention. Excepting the (ridiculously) fast append/consume path, general purpose atomics to global memory are still pretty costly. That said, I'm sure there are cases where the alternatives are *more* costly :p
 
Last edited by a moderator:
With the exception of software optimizations like memory allocation, shader compilers, or alteration of the workload, drivers have made fairly minimal changes in performance for the last couple years.

That depends. The 5830 (many consider it a rather poor showing) with 10.3 drivers is as fast as and sometimes faster than 5870 with launch drivers in some games.

I'd say there's been quite a bit of performance gains in 5870. X1800 XT also had quite beefy gains in the first 3-6 months.

But yes, you can't use past driver speed ups to predict future speedups for future cards. That said I wouldn't be surprised if over the course of 6 months there some performance gain to be had with Fermi. Whether it's a lot or a little, who knows, however.

Also have to consider the fact that Nvidia have had their hands on working Fermi boards since at least December or January. So performance gains may or may not come slower than past cards.

Regards,
SB
 
It can be possible that there will be no real performance jumps with further drivers of GF100, because 60/64 TMUs are just too little to feed the chip.
Maybe Fermi will only speed up with upcoming titles with more shader power requirements.
 
With Fermi nearly here, I'm looking forward to a bit of fun with the histogramming "the words in Dickens' novels" problem that floored GTX285 a while back:
Yeah, looks like that'll run insanely fast on Fermi.

That depends. The 5830 (many consider it a rather poor showing) with 10.3 drivers is as fast as and sometimes faster than 5870 with launch drivers in some games.
Are there any confirmations of that performance gain on sites other than the widescreengamingforum link in the other thread? You're right. That was an unusually large jump. I think the sandbagging theory may hold some water, as ATI wasn't even doing the regular AI mipmap optimizations on Evergreen, IIRC.
 
If you are performing gather/scatter from/to an UAV you can still use the read-only L2 but you have to issue cache line evictions before every gather/scatter to make sure data in the L2 reflects data in memory.
I thought the meaning of 'unordered' was that you don't have any guarantee about how stale the data is that you're reading.
It's possible that ATI can use it's L2$ (and maybe already does)... I don't recall whether it is already coherent across the chip, but in my brief testing putting "globallycoherent" or not didn't seem to make any different on performance or functionality
"globallycoherent" just means that it's the same value across the chip, right? It doesn't necessarily mean that it's sync'd with memory, does it?

Certainly for R/W it's not clear, but for R/O it's definitely possible to write an application that runs well with a cache and not with an explicit LDS and I expect people will write these without knowing it if developed on NVIDIA hardware.
R/O doesn't need LDS to run fast, does it? Isn't L1/L2 just as good? (R/O means 'read-only', right?)

I don't think Fermi's caches are necessarily fully coherent on *write*, but I could be mistaken.
Okay, now I'm confused, because I thought coherence is all about read-after-write. What does Fermi guarantee with its cache?

It's not well-documented, but MSDN/help file in the latest DXSDK (Feb 2010) has some basics.
Ah, okay. I'm still on August 2009, and that's the online library version, too.
 
"globallycoherent" just means that it's the same value across the chip, right? It doesn't necessarily mean that it's sync'd with memory, does it?
It's unclear what it means... what does the "same value across the chip" even mean if there's no guarantees on how stale it is? How does this provide anything over *not* declaring it that way? The docs say:
DirectX docs said:
RWBuffer objects can be prefixed with the storage class globallycoherent. This storage class cuases[sic] memory barriers and syncs to flush data across the entire gpu such that other groups can see writes. Without this specifier, a memory barrier or sync will only flush a UAV within the current group.
So it looks like the intention is to allow incoherent "per-core" caches (i.e. at the group level) when this is not specified. Still, given that there appears to be no guarantee on the parallelism or ordering of how multiple groups behave, I don't see how the attribute is interesting at all... I'm clearly missing something.

R/O doesn't need LDS to run fast, does it? Isn't L1/L2 just as good? (R/O means 'read-only', right?)
Yes definitely and that's the point. I could write an app that doesn't use LDS at all but consistently reads randomly from global memory. This would probably run decently on Fermi but not so well on ATI (no L1$, only LDS).

Okay, now I'm confused, because I thought coherence is all about read-after-write. What does Fermi guarantee with its cache?
I'd love to hear the answer to this too :)

Ah, okay. I'm still on August 2009, and that's the online library version, too.
I think only PIX was really updated in Feb 2010, so that should actually give you the latest docs... it's indeed pretty sparse though especially on issues like this. OpenCL is even worse (or rather, a strict reading of the spec implies that basically *nothing* is allowed).
 
Back
Top