Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 10-Jun-2008, 16:42   #1
RacingPHT
Junior Member
 
Join Date: May 2006
Location: Shanghai
Posts: 90
Default What's new in the upcoming GT2XX? Atoms in shared memory and more?

not sure if it's discussed before..
I found this document http://www.mathematik.uni-dortmund.d...08/C1_CUDA.pdf
which said
Compute capability 1.2 adds shared mem atomics
I suspect it's GT2xx right? and is there any additional info?
RacingPHT is offline   Reply With Quote
Old 10-Jun-2008, 22:16   #2
TimothyFarrar
Member
 
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
Default

Interesting note in that document is that it says "Global memory not cached on G8x GPUs", is that a slip that G9x or beyond might cache some global memory accesses?

Looks like G84 and beyond can do CUDA kernel and CPU<->GPU memory copy in parallel but not G80. Perhaps even with the stream interface CUDA still can only do one kernel at a time (serializes kernel calls), who knows if they can overlap execution as each microprocessor runs out of thread blocks to run?
__________________
Timothy Farrar :: blog
TimothyFarrar is offline   Reply With Quote
Old 10-Jun-2008, 23:50   #3
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,467
Default

Since it is just a general presentation for CUDA, I don't think a line like "global memory not cached on G8x GPU" means anything about future GPUs. Furthermore, a read-write cache would require a cache coherence protocol which is not a trivial thing to do for 16 MPs. Read only cache is already available through texture cache.

Right now CUDA can only do one kernel a time (except the rare case when two kernels overlap). However, since MPs are quite independent maybe it's possible to schedule different kernels, such as 10 MPs running kernel A and 6 MPs running kernel B. Although, if your kernels are small enough, you can put both of them into the same kernel, and dispatch them inside the kernel, but you won't have any control over scheduling.
pcchen is offline   Reply With Quote
Old 11-Jun-2008, 05:46   #4
armchair_architect
Member
 
Join Date: Nov 2006
Posts: 128
Default

Quote:
Originally Posted by pcchen View Post
Furthermore, a read-write cache would require a cache coherence protocol which is not a trivial thing to do for 16 MPs.
It doesn't, actually. There's always the option of putting the coherence burden on the programmer with memory barriers and explicit invalidate or uncached load commands. General-purpose processors (and the programmers that write for them) have always assumed cache coherence, but it's not as universal on more exotic high-performance processors.
armchair_architect is offline   Reply With Quote
Old 11-Jun-2008, 18:36   #5
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,467
Default

I don't think it's worth the trouble to implement a non-coherence read-write cache on GPU (or almost anything), as it's much more error-prone. Just IMHO though.
pcchen is offline   Reply With Quote
Old 11-Jun-2008, 21:46   #6
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,520
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by pcchen View Post
Since it is just a general presentation for CUDA, I don't think a line like "global memory not cached on G8x GPU" means anything about future GPUs. Furthermore, a read-write cache would require a cache coherence protocol which is not a trivial thing to do for 16 MPs.
It's very trivial, when there is only a single cache. (Or a single cache per partition.)
MfA is offline   Reply With Quote
Old 12-Jun-2008, 02:55   #7
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,467
Default

I don't know what you mean by "single cache." To my understanding, if every MP has its own cache and you want efficient cache coherence protocol (not just some simple broadcasting schemes) then it's not trivial.
pcchen is offline   Reply With Quote
Old 12-Jun-2008, 03:53   #8
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,520
Send a message via ICQ to MfA
Default

A single shared cache.
MfA is offline   Reply With Quote
Old 12-Jun-2008, 04:03   #9
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,467
Default

Ok. But making a single shared cache multi-ported for 16 processors is not going to be easy.
pcchen is offline   Reply With Quote
Old 17-Jun-2008, 11:37   #10
Marks
Registered
 
Join Date: Jun 2008
Posts: 1
Default Cache

The use of texture cache is complex in real workloads since it has relatively small footprint and requires 2D spatial locality of accesses. There's also very specialized constant memory, which is also cached but it is specialized for very specific use pattern...

Other than that, G80 indeed doesn't have the global memory cache. The fast memory on MP chip (shared memory) can be used as user-managed cache, though it's not straightforward. The clear advantage of this memory being user-managed rather than hardware is that the coordination of the data replacement is done by the kernel, and in conjunction with the computation algorithm executed by all 512 threads together. Addition of a hardware cache instead is likely to reduce the performance. Said that, addition of the hardware read-only cache would improve the usability of CUDA, although at the expense of some performance..

It's hard to believe that write cache will be added in the new generation, since the coherence is crucial but obviously totally impractical.
Marks is offline   Reply With Quote
Old 17-Jun-2008, 16:11   #11
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,520
Send a message via ICQ to MfA
Default

Without scatter it is practical, seeing as ATI is practicing it.
MfA is offline   Reply With Quote
Old 19-Jun-2008, 00:08   #12
Tim Murray
the Windom Earle of GPUs
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,277
Default

http://forums.nvidia.com/index.php?showtopic=70171

Double precision, shared memory atomics, long global memory atomics, double the number of registers, active warps per SM is up to 32, active threads per SM is up to 1024.
Tim Murray is offline   Reply With Quote
Old 19-Jun-2008, 01:46   #13
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

Shame section 5.1 doesn't describe the throughputs for double instructions.

Ooh, interesting, appendix A lists the Compute Capabilities showing a nice big gap for 1.2 GPUs, i.e. those GPUs that are the same as 1.3 but don't have double-precision.

Jawed
Jawed is offline   Reply With Quote
Old 19-Jun-2008, 11:08   #14
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,467
Default

It's interesting to see that basic double precision operations are 754 compliant (add, sub, mul, div, and sqrt). Also add and mul support all four rounding modes. Single precision operations are still not 754 compliant though (div is implemented with mul by inverse so it's not accurate to 754 standard).
pcchen is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 20:42.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.