If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Junior Member
Join Date: May 2006
Location: Shanghai
Posts: 90
|
not sure if it's discussed before..
I found this document http://www.mathematik.uni-dortmund.d...08/C1_CUDA.pdf which said Compute capability 1.2 adds shared mem atomics I suspect it's GT2xx right? and is there any additional info? |
|
|
|
|
|
#2 |
|
Member
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
|
Interesting note in that document is that it says "Global memory not cached on G8x GPUs", is that a slip that G9x or beyond might cache some global memory accesses?
Looks like G84 and beyond can do CUDA kernel and CPU<->GPU memory copy in parallel but not G80. Perhaps even with the stream interface CUDA still can only do one kernel at a time (serializes kernel calls), who knows if they can overlap execution as each microprocessor runs out of thread blocks to run?
__________________
Timothy Farrar :: blog |
|
|
|
|
|
#3 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Since it is just a general presentation for CUDA, I don't think a line like "global memory not cached on G8x GPU" means anything about future GPUs. Furthermore, a read-write cache would require a cache coherence protocol which is not a trivial thing to do for 16 MPs. Read only cache is already available through texture cache.
Right now CUDA can only do one kernel a time (except the rare case when two kernels overlap). However, since MPs are quite independent maybe it's possible to schedule different kernels, such as 10 MPs running kernel A and 6 MPs running kernel B. Although, if your kernels are small enough, you can put both of them into the same kernel, and dispatch them inside the kernel, but you won't have any control over scheduling. |
|
|
|
|
|
#4 |
|
Member
Join Date: Nov 2006
Posts: 128
|
It doesn't, actually. There's always the option of putting the coherence burden on the programmer with memory barriers and explicit invalidate or uncached load commands. General-purpose processors (and the programmers that write for them) have always assumed cache coherence, but it's not as universal on more exotic high-performance processors.
|
|
|
|
|
|
#5 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I don't think it's worth the trouble to implement a non-coherence read-write cache on GPU (or almost anything), as it's much more error-prone. Just IMHO though.
|
|
|
|
|
|
#6 | |
|
Regular
|
Quote:
|
|
|
|
|
|
|
#7 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
I don't know what you mean by "single cache." To my understanding, if every MP has its own cache and you want efficient cache coherence protocol (not just some simple broadcasting schemes) then it's not trivial.
|
|
|
|
|
|
#8 |
|
Regular
|
A single shared cache.
|
|
|
|
|
|
#9 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Ok. But making a single shared cache multi-ported for 16 processors is not going to be easy.
|
|
|
|
|
|
#10 |
|
Registered
Join Date: Jun 2008
Posts: 1
|
The use of texture cache is complex in real workloads since it has relatively small footprint and requires 2D spatial locality of accesses. There's also very specialized constant memory, which is also cached but it is specialized for very specific use pattern...
Other than that, G80 indeed doesn't have the global memory cache. The fast memory on MP chip (shared memory) can be used as user-managed cache, though it's not straightforward. The clear advantage of this memory being user-managed rather than hardware is that the coordination of the data replacement is done by the kernel, and in conjunction with the computation algorithm executed by all 512 threads together. Addition of a hardware cache instead is likely to reduce the performance. Said that, addition of the hardware read-only cache would improve the usability of CUDA, although at the expense of some performance.. It's hard to believe that write cache will be added in the new generation, since the coherence is crucial but obviously totally impractical. |
|
|
|
|
|
#11 |
|
Regular
|
Without scatter it is practical, seeing as ATI is practicing it.
|
|
|
|
|
|
#12 |
|
chaos dunk
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
|
http://forums.nvidia.com/index.php?showtopic=70171
Double precision, shared memory atomics, long global memory atomics, double the number of registers, active warps per SM is up to 32, active threads per SM is up to 1024. |
|
|
|
|
|
#13 |
|
Regular
|
Shame section 5.1 doesn't describe the throughputs for double instructions.
Ooh, interesting, appendix A lists the Compute Capabilities showing a nice big gap for 1.2 GPUs, i.e. those GPUs that are the same as 1.3 but don't have double-precision. Jawed |
|
|
|
|
|
#14 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
It's interesting to see that basic double precision operations are 754 compliant (add, sub, mul, div, and sqrt). Also add and mul support all four rounding modes. Single precision operations are still not 754 compliant though (div is implemented with mul by inverse so it's not accurate to 754 standard).
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|