View Full Version : nvidia 2017 gpu speculation :)
rpg.314
09-Dec-2009, 14:16
Nvidia 2017 gpu speculation by Bill Daly (http://www.nvidia.com/content/GTC/documents/SC09_Dally.pdf)
Slide 34
A few latency-optimized processors
2,400 throughput cores (7,200 FPUs), 16 CPUs – single chip
40TFLOPS (SP) 13TFLOPS (DP)
Deep, explicit on-chip storage hierarchy
Fast communication and synchronization
512GB Phase-change/Flash for checkpoint and scratch
So a bunch of on-chip ARM cores, lots of on-board flash for fast hard disk emulation.
trinibwoy
09-Dec-2009, 19:39
Trying to understand exactly what they're getting at here. They claim that higher efficiency comes from locality of data and to achieve that the recommended steps are to:
Provide rich, exposed storage hierarchy
Explicitly manage data movement on this hierarchy
Is that just a fancy way of saying software managed cache? Or is there something about the new setup that also improves off-chip bandwidth efficiency?
rpg.314
09-Dec-2009, 20:48
Trying to understand exactly what they're getting at here. They claim that higher efficiency comes from locality of data and to achieve that the recommended steps are to:
Provide rich, exposed storage hierarchy
Explicitly manage data movement on this hierarchy
Is that just a fancy way of saying software managed cache? Or is there something about the new setup that also improves off-chip bandwidth efficiency?
I think with recent emergence of gpu's and lrb's extensive cache management instructions, it is clear that data locality will certainly come to be managed in software, either at the programmer level or at runtime level.
trinibwoy
09-Dec-2009, 23:35
That's an interesting dilemma given that many workloads are large enough that items are evicted long before they are requested again, which sort of makes locality a moot point.
Nvidia has a patent that tries to address this by profiling memory requests over the course of a frame in the case of 3D rendering. The aim is to use this metadata to know what will be needed at the start of the next frame and precache it before its needed. Can't see how that would work in the general case though.
rpg.314
10-Dec-2009, 09:52
That's an interesting dilemma given that many workloads are large enough that items are evicted long before they are requested again, which sort of makes locality a moot point.
It seems that nv is going forward with it's approach of registers+shared+cached global memory. This is like the vertex/pixel shader divide all over again. Their are 3 on chip memory pools, all doing basically the same thing, and yet not unified. LRB was a giant leap forward in this regard.
What should I do I need 18kb shared mem per bolck? use 48 kb shared mem option on fermi and you just lost 30kb of L1 cache. :roll: It's right there and yet you can't touch it. :evil:
What if I use 1 more register per thread? number of warps in-flight crashes, taking ( a good chunk?) of latency hiding with it.
As for arrays private to a thread, it is even more troublesome. All of it is spilled to off-chip memory. Fermi helps with caches, but there is 128 KB worth of register file idling there. All you can use is the 48 KB of L1 cache.
It seems that nv is going forward with it's approach of registers+shared+cached global memory. This is like the vertex/pixel shader divide all over again. Their are 3 on chip memory pools, all doing basically the same thing, and yet not unified. LRB was a giant leap forward in this regard.
But can it scale? Do you think it's economical to make a, say, 128 core CPU, with complete cache coherence?
How many supercomputers are cache coherent? To my understanding, only SGI's Altix are single image cache coherent design. And they are really expensive (compared to other supercomputers with similar number of cores).
What should I do I need 18kb shared mem per bolck? use 48 kb shared mem option on fermi and you just lost 30kb of L1 cache. :roll: It's right there and yet you can't touch it. :evil:
Isn't this just a design decision? I mean, you can always modify your algorithm to use more (or less) shared memory, and you can easily test which one is faster.
What if I use 1 more register per thread? number of warps in-flight crashes, taking ( a good chunk?) of latency hiding with it.
And a normal, cache coherent CPU is different? I mean, if your algorithm uses more localized memory than the cache size of the CPU, you are probably going to lose a lot of performance. How's that different?
As for arrays private to a thread, it is even more troublesome. All of it is spilled to off-chip memory. Fermi helps with caches, but there is 128 KB worth of register file idling there. All you can use is the 48 KB of L1 cache.
This is a design problem unrelated to cache coherence (or "unified memory"). To my understanding, RV770/RV870 support indexed registers. Also, you can still use shared memory for indexed array.
rpg.314
10-Dec-2009, 12:32
But can it scale? Do you think it's economical to make a, say, 128 core CPU, with complete cache coherence?
Caches != coherence, :wink:, as shown by fermi. You can have semi coherent caches.
How many supercomputers are cache coherent? To my understanding, only SGI's Altix are single image cache coherent design. And they are really expensive (compared to other supercomputers with similar number of cores).
FWIW, I am doubtful of scalability of full coherence in hw.
Isn't this just a design decision? I mean, you can always modify your algorithm to use more (or less) shared memory, and you can easily test which one is faster.
Yes, I can do that. And that is one of the reasons (if not the reason) why gpu programming is full of steep performance cliffs. A good architecture should allow for a graceful degradation of performance.
And a normal, cache coherent CPU is different? I mean, if your algorithm uses more localized memory than the cache size of the CPU, you are probably going to lose a lot of performance. How's that different?
Yes, but the falloff is much less rapid than on gpu's. The bigger point is that by unifying these 3 memory pools, overall utilization and hence efficiency will be higher. How many gpu programs of today max out both the shared mem and the register file? The number of blocks per SM in today's programs is the min of those allowed by reg file usage and shared mem usage, so >80% of the time, one of the two is going waste (in whatever amount). Unifying them will allow the last few bits left over for use by the hw managed cache (to whatever degree of coherence).
This is a design problem unrelated to cache coherence (or "unified memory"). To my understanding, RV770/RV870 support indexed registers. Also, you can still use shared memory for indexed array.
Yes, that is one of the great things about amd gpu's. Just because I need a local array, shouldn't mean that all of my context is spilled to off-chip ram.
2,400 throughput cores (7,200 FPUs)
Oh hey, 4 wide cores.
rpg.314
10-Dec-2009, 13:34
Oh hey, 4 wide cores.
Wouldn't that be 3 wide?
The bigger point is that by unifying these 3 memory pools, overall utilization and hence efficiency will be higher. How many gpu programs of today max out both the shared mem and the register file? The number of blocks per SM in today's programs is the min of those allowed by reg file usage and shared mem usage, so >80% of the time, one of the two is going waste (in whatever amount). Unifying them will allow the last few bits left over for use by the hw managed cache (to whatever degree of coherence).
Even a register-only GPU has wastage - e.g. in RV670 the granularity of register allocation means that if 4 hardware threads are allocated, 64 vec4 registers are available, but if 5 hardware threads are allocated 51 vec4 registers are available.
So if a kernel allocates 52 vec4 registers, then the hardware can only have 4 threads in flight. This then means that 8 registers per thread are going to waste (since 64 are available).
Jawed
rpg.314
10-Dec-2009, 14:14
Even a register-only GPU has wastage - e.g. in RV670 the granularity of register allocation means that if 4 hardware threads are allocated, 64 vec4 registers are available, but if 5 hardware threads are allocated 51 vec4 registers are available.
So if a kernel allocates 52 vec4 registers, then the hardware can only have 4 threads in flight. This then means that 8 registers per thread are going to waste (since 64 are available).
Jawed
Good point. This means that if shared mem is added to the mix, the wastage (or possibilities for wastage) only increase.
I wonder what amd has in mind to fix this, nv seems to be pushing forward with what they have now.
trinibwoy
10-Dec-2009, 14:32
The bigger point is that by unifying these 3 memory pools, overall utilization and hence efficiency will be higher.
You seem to be glossing over the unpredictability and inefficiencies of hardware managed caches. Capacity utilization doesn't really mean squat if you're constantly churning the cache. It's bandwidth utilization that matters.
Ugh, yes ... I should have said 3.
A quote from "The Case for Simple, Visible Cache Coherency" (Appeal to authority, I know.)
Yet given the scaling problems mentioned above, it is unlikely
that protocols alone will solve this scaling problem.
Worse is the fact that more complex protocols often use
additional resources or add new features that can have unintended
consequences in the hardware design. We believe
that scaling applications to a large number of processors will
always require tuning the applications, and we should focus
our system designs to make this tuning process easier and
create machines optimized to run these tuned applications.
Both of these goals favor simpler protocols.
The problem with snooping is that it turns to having hidden directories when scaled ... that kind of opaque heavy handed architecture is the opposite from what is needed for high throughput computing. Once you have virtual memory keeping things coherent is pretty trivial though at low performance. The trivial case simply assigns a page to a core (or an edge cache connected to a memory bus) and routes all reads&writes through it. Coherence itself should not be given up.
rpg.314
10-Dec-2009, 14:49
You seem to be glossing over the unpredictability and inefficiencies of hardware managed caches. Capacity utilization doesn't really mean squat if you're constantly churning the cache. It's bandwidth utilization that matters.
Fair enough, but registers/shared mem/whatever can be locked down using lrb-like cache line locking, which can be done at compile time.
3dilettante
10-Dec-2009, 17:10
Fair enough, but registers/shared mem/whatever can be locked down using lrb-like cache line locking, which can be done at compile time.
Are you sure?
I've seen hints for marking a fetched line as LRU and requests for exclusivity. However, I'm not sure the exclusivity in this case is the same as locking, rather it would put the line in the E state in a MESI protocol, which would invalidate other cached copies.
If the latter case is true it wouldn't stop another core from interfering with that line at some later point, causing the line to either be invalidated or coherency snooping to be turned back on.
Caches != coherence, :wink:, as shown by fermi. You can have semi coherent caches.
By semi-coherent cache I think you mean the delayed write L1 cache? The problem is, how loose you want (or can afford) the memory access rule to be...
Yes, I can do that. And that is one of the reasons (if not the reason) why gpu programming is full of steep performance cliffs. A good architecture should allow for a graceful degradation of performance.
Right now, the most "steep" performance cliff on GPU is basically the global memory. On GT200 it's not that bad anymore. I'd say Fermi will be much better.
Yes, but the falloff is much less rapid than on gpu's. The bigger point is that by unifying these 3 memory pools, overall utilization and hence efficiency will be higher.
By "efficiency" I think you ignore the cost of implementing a real cache instead of shared memory. I mean, if by using shared memory you can have twice the size compared to a cache, I think it's probably not that inefficient. Furthermore, isn't that's why Fermi provide the ability to change the configuration of the L1 cache/shared memory system?
3dilettante
10-Dec-2009, 18:17
There are also granularity and addressing concerns with caches as implemented in most CPUs.
The smallest unit of transfer for a cache is a line, which for Larrabee is 64 bytes.
For unaligned items, it may require 2 separate lines and 2 separate accesses.
Aligning data can involve padding the data structure, which may make a cache line appear "unwasted", but also effectively have less algorithmic meaning.
As people enjoy bringing up pathological cases, a data set that happens to have a stride that matches the indexing of the cache can effectively cut capacity for a given load down to that single associative set for anything but a fully associative cache.
The register file has wastage, but it already enforces a rigid level of alignment and it has a finer granularity.
Shared memory as currently implemented has that finer granularity, though it lacks the raw bandwidth and depending on the situation can have longer latency.
The explicit memory pools give up some flexibility and grace for potentially cleaner scaling in capacity, transistor cost, latency, and power consumption.
With power draw becoming the ceiling, a failure to mitigate the big active source of it simply negates the point to having the large peak resources there in any form.
CarstenS
12-Dec-2009, 11:21
Ugh, yes ... I should have said 3.
And it's one third DP - which is NOT Fermi :) So, are they are going to scale it backwards a bit?
rpg.314
12-Dec-2009, 13:10
There are also granularity and addressing concerns with caches as implemented in most CPUs.
The smallest unit of transfer for a cache is a line, which for Larrabee is 64 bytes.
This can be fixed by making sizeof(float)*vector-width=cacheline size (as it is prolly is on lrb). This way, all "register" access will be fast and only addressing local arrays will possible incur warp serialization penalty.
As people enjoy bringing up pathological cases, a data set that happens to have a stride that matches the indexing of the cache can effectively cut capacity for a given load down to that single associative set for anything but a fully associative cache.
This sort of thing happens often enough with shared mem bank conflicts in CUDA.
rpg.314
12-Dec-2009, 13:13
Are you sure?
I've seen hints for marking a fetched line as LRU and requests for exclusivity. However, I'm not sure the exclusivity in this case is the same as locking, rather it would put the line in the E state in a MESI protocol, which would invalidate other cached copies.
If the latter case is true it wouldn't stop another core from interfering with that line at some later point, causing the line to either be invalidated or coherency snooping to be turned back on.
How about having an instruction which marks these lines as Most recently used, untill flushed by the driver?
Or whatever lrb uses? I am sure they can lock cachelines. Also the xbox cpu can lock cache lines to provide fast memory access. I don't know all the details, but I am sure it is a worked out problem.
rpg.314
12-Dec-2009, 13:16
By semi-coherent cache I think you mean the delayed write L1 cache? The problem is, how loose you want (or can afford) the memory access rule to be... I had something like fermi's caches in mind. I admit I haven't really understood fermi's cache mechanism fully.
I had something like fermi's caches in mind. I admit I haven't really understood fermi's cache mechanism fully.
To my understanding, Fermi's L2 cache is relatively simple. It's bound to the memory controller, so there's no cache coherence problem (each memory controller has its own cache). L1 cache is mostly read-only, but it seems to have a loose write-back policy (such as, only perform write back when thread synchronize is requested). Since the thread execution order is undefined (i.e. you can't depend program execution to be consistent if it depends on thread execution order), this memory access rule is good enough. Such loose memory access rule is not uncommon among some RISC CPU.
x86 CPU, on the other hand, have a pretty strict memory access rule.
3dilettante
14-Dec-2009, 13:40
This can be fixed by making sizeof(float)*vector-width=cacheline size (as it is prolly is on lrb). This way, all "register" access will be fast and only addressing local arrays will possible incur warp serialization penalty.
This would not be the case for the scalar side, or anything but a trivial gather/scatter to on cache line.
This sort of thing happens often enough with shared mem bank conflicts in CUDA.
Bank conflicts cut bandwidth for shared memory, but you can still use all of it.
The rare pathological case where data access strides hit the same set make it effectively no larger than that single set.
rpg.314
14-Dec-2009, 14:28
This would not be the case for the scalar side, or anything but a trivial gather/scatter to on cache line.
I didn't get your point. Wouldn't my suggestion make register access a "trivial gather/scatter to on cache line"?
3dilettante
14-Dec-2009, 14:51
A scatter/gather instruction wants to find 16 separate data words, which in the worst case for Larrabee can mean 16 separate cache lines must be filled to complete the instruction.
The trivial case is when the various target addresses happen to fall on the same cache line.
rpg.314
14-Dec-2009, 17:37
A scatter/gather instruction wants to find 16 separate data words, which in the worst case for Larrabee can mean 16 separate cache lines must be filled to complete the instruction.
The trivial case is when the various target addresses happen to fall on the same cache line.
Well that's all you need to "acceptably" push the registers to the cache, right?
GPU's too have bandwidth penalties if the access patterns are not regular.
3dilettante
14-Dec-2009, 18:23
Well that's all you need to "acceptably" push the registers to the cache, right?
If you want to fill/spill registers to the L1, I think you'd go for an aligned vector store or load, which works at the granularity of the SIMD registers and cache and doesn't cross line boundaries.
Accesses that do not align will inflict storage and bandwidth penalties in a single store or load situation, which can be a doubled cost that will probably be avoided as much as possible.
If for some reason someone did do a scatter to push out a register, it could inflict up to a 16x capacity penalty, as well as waste a lot of bandwidth (up to 15x16 bytes of bandwidth in total) that could have been used elsewhere.
The capacity penalty could be mitigated if the structure is packed with 15 other threads' storage, but that's a very roundabout way of doing the same thing as mentioned in the first paragraph.
GPU's too have bandwidth penalties if the access patterns are not regular.
I'm not saying they don't. I'm saying the cache line-based method with Larrabee is coarser, and that a more complex hierarchy of register/shared memory/cache allows for finer granularity at the cost of flexibility and additional demands on the software.
rpg.314
14-Dec-2009, 19:39
If you want to fill/spill registers to the L1, I think you'd go for an aligned vector store or load, which works at the granularity of the SIMD registers and cache and doesn't cross line boundaries.
Accesses that do not align will inflict storage and bandwidth penalties in a single store or load situation, which can be a doubled cost that will probably be avoided as much as possible.
I think lrb shows quite clearly that aligned vector load-store is fine for storing thread context in cache instead of registers.
3dilettante
14-Dec-2009, 20:01
I think I've possibly lost track of your point.
rpg.314
15-Dec-2009, 15:43
My point was that by unifying the 3 on-chip memory pools flexibility and utilization would increase. The registers and shared memory can be conveniently stored in caches with lrb style cache locking instructions. Fermi like semi-coherent caches will allow for future scalability.
Later on, I was making the point that by making a vectorwidth equal to a cacheline size, one can ensure fast register loads-stores.
Since storing context in L1 instead of gpu registers will be managed by the compiler, it can be ensured that it is an aligned scatter gather from a single cache line.
Since execution order of threads within a lock is undefined, a loose memory consistency model is fine.
My point was that by unifying the 3 on-chip memory pools flexibility and utilization would increase. The registers and shared memory can be conveniently stored in caches with lrb style cache locking instructions. Fermi like semi-coherent caches will allow for future scalability.
You can't really replace registers with cache if you want very fast thread switching time in order to hide latency. In current GPU, context switch is very fast because there is almost no data movement (everything is stored in registers). However, if you have to spill thread contexts into cache, you are going to have at least 2 or 3 cycles of delay (or much more if you are using more registers). That's like 2 or 3 times worse performance. Or of course you can make the cache bandwidth extremely large, but that will increase cost further.
As for using cache instead of shared memory, again it's possible but with added costs. Also, it's going to be slower compared to a banked shared memory.
3dilettante
15-Dec-2009, 18:10
My point was that by unifying the 3 on-chip memory pools flexibility and utilization would increase. The registers and shared memory can be conveniently stored in caches with lrb style cache locking instructions. Fermi like semi-coherent caches will allow for future scalability.
You assert Larrabee has line-locking instructions, can you point out which ones those are in the list Intel has published?
Later on, I was making the point that by making a vectorwidth equal to a cacheline size, one can ensure fast register loads-stores.
That removes the finer granularity possible for the register file and shared memory. It makes everything bow down to the lowest common denominator.
SIMD lanes still operate on 32 or 64 bit values, as will the scalar pipeline and gather/scatter.
Using something like shared memory to communicate between threads on different parts of the chip can become much more expensive in terms of bandwidth and power if it has to use a cache line to do so.
Since storing context in L1 instead of gpu registers will be managed by the compiler, it can be ensured that it is an aligned scatter gather from a single cache line.
You wouldn't use a scatter or gather for moving registers to memory. That's using a more complicated instruction when a regular vector load or store would suffice.
You can't really replace registers with cache if you want very fast thread switching time in order to hide latency. In current GPU, context switch is very fast because there is almost no data movement (everything is stored in registers). However, if you have to spill thread contexts into cache, you are going to have at least 2 or 3 cycles of delay (or much more if you are using more registers). That's like 2 or 3 times worse performance. Or of course you can make the cache bandwidth extremely large, but that will increase cost further.
That's why there's 4 hardware threads.
Fact is, GPUs fall off a cliff once thread context exceeds a certain proportion (total shader latency derived) of register file capacity. As far as I can tell there's been zero effort in the compilers to optimise register spill - so it's hard to say how well GPUs can be made to cope with this serious problem.
In my view the huge register file, high hardware thread count switching model of current GPUs is doomed. That is, the costs of supporting a huge register file and anywhere from 1 to ~100 hardware threads become a burden as kernel context size increases - e.g. you have a scoreboard for a hundred threads per core that mostly idles when the core can only support a few threads.
In theory current GPUs can hide the latency of spills (and re-fills) but since the memory hierarchy has to be generalised for other reasons, it seems to me newer GPUs might as well use the Larrabee model of a true "pipeline working set" register file backed by a cache hierarchy that can hide systematic spill/re-fill latencies.
Jawed
You assert Larrabee has line-locking instructions, can you point out which ones those are in the list Intel has published?
Here's everything related to cache:
_MM_MEM_HINT_ENUM – Constants used by all operations that read or write memory for non-temporal hint:
_MM_HINT_NONE No memory hint
_MM_HINT_NT Nontemporal memory hint
Cache instructions:
CLEVICT1
CLEVICT2
PREFETCH1
PREFETCH2
_MM_PREFETCH_HINT_ENUM – Constants used by PREFETCH1 and PREFETCH2:
_MM_PFHINT_NONE No prefetch hint
_MM_PFHINT_EX Mark cacheline exclusive
_MM_PFHINT_NT Nontemporal data hint
_MM_PFHINT_EX_NT Mark cacheline exclusive and load with nontemporal data hint
_MM_PFHINT_MISS Miss hint
_MM_PFHINT_EX_MISS Mark cacheline exclusive and load with miss hint
_MM_PFHINT_NT_MISS Load with nontemporal data and miss hints
_MM_PFHINT_EX_NT_MISS Mark cacheline exclusive and load with nontemporal data and miss hints
Jawed
3dilettante
15-Dec-2009, 19:25
I made comment on that terminology earlier, in that the word exclusive already has a meaning within a MESI cache protocol, and that established meaning is not equivalent to locking the line.
rpg.314
16-Dec-2009, 05:29
You can't really replace registers with cache if you want very fast thread switching time in order to hide latency. In current GPU, context switch is very fast because there is almost no data movement (everything is stored in registers). However, if you have to spill thread contexts into cache, you are going to have at least 2 or 3 cycles of delay (or much more if you are using more registers). That's like 2 or 3 times worse performance. Or of course you can make the cache bandwidth extremely large, but that will increase cost further.
As for using cache instead of shared memory, again it's possible but with added costs. Also, it's going to be slower compared to a banked shared memory.
No, if you can dual issue a vector store with a vector alu op, you can implement zero-overhead software thread switch with just a 4 cycle cache.
rpg.314
16-Dec-2009, 05:34
Are you sure?
I've seen hints for marking a fetched line as LRU and requests for exclusivity. However, I'm not sure the exclusivity in this case is the same as locking, rather it would put the line in the E state in a MESI protocol, which would invalidate other cached copies.
If the latter case is true it wouldn't stop another core from interfering with that line at some later point, causing the line to either be invalidated or coherency snooping to be turned back on.
Why should the other cores interfere on that cache line? It holds thread specific data which is accessed by only the owner thread. So other cores never read or write from that line.
That's why there's 4 hardware threads.
And what happens if you need to hide latency with more than 4 hardware threads? Won't there be a "cliff" here?
Fact is, GPUs fall off a cliff once thread context exceeds a certain proportion (total shader latency derived) of register file capacity. As far as I can tell there's been zero effort in the compilers to optimise register spill - so it's hard to say how well GPUs can be made to cope with this serious problem.
I keep hearing this "fall off a cliff" on GPU. But in reality it's not that steep as many may think. I've seen CUDA kernel with only 32 threads running fine (because it doesn't have to hide too much latency). The requirement of "at least 192 threads" is there to make sure that all ALU latency can be hidden, it's not always a requirement for good performance. Heck, most current OOOE superscalar CPU can't even hide this kind of latency very well.
In my view the huge register file, high hardware thread count switching model of current GPUs is doomed. That is, the costs of supporting a huge register file and anywhere from 1 to ~100 hardware threads become a burden as kernel context size increases - e.g. you have a scoreboard for a hundred threads per core that mostly idles when the core can only support a few threads.
The reality is, that so-called "huge register file" is actually not a huge register file. I don't think GT200 or any other GPU actually have a 8192 or 16384 entry register file, because that's too big to be economical. It's mostly likely implemented with a quick SRAM which supports fast load/store with a real, smaller register file. So in a sense it's a fast register spill buffer.
In theory current GPUs can hide the latency of spills (and re-fills) but since the memory hierarchy has to be generalised for other reasons, it seems to me newer GPUs might as well use the Larrabee model of a true "pipeline working set" register file backed by a cache hierarchy that can hide systematic spill/re-fill latencies.
I don't think it's better to handle register spill with a cache, rather than current internal fast pseudo register buffer. Basically, if you need many threads to hide a certain amount of latency, but you don't have enough registers (or register spill buffers) to handle this amount of threads, it's going to be slow no matter how you handle the register spill (or reduce the amount of threads).
No, if you can dual issue a vector store with a vector alu op, you can implement zero-overhead software thread switch with just a 4 cycle cache.
Suppose that a kernel uses 16 registers (for each thread or "work item"), you'll need to write 16 vector registers AND read 16 vector registers to perform a basic context switch. That takes at least 16 cycles.
You'd think they would have a special instruction to push/pop sets of registers so it could be overlapped with some useful work ... shame to waste cycles on that. That said, they never gave the remotest hint that they did.
rpg.314
16-Dec-2009, 13:15
The reality is, that so-called "huge register file" is actually not a huge register file. I don't think GT200 or any other GPU actually have a 8192 or 16384 entry register file, because that's too big to be economical. It's mostly likely implemented with a quick SRAM which supports fast load/store with a real, smaller register file. So in a sense it's a fast register spill buffer.
Well, LRB has 192k of usable data cache per core and fermi is "supposed" to have 128K reg file + 64K (shared mem/L1 cache). So the capacity crossover has already happened. I don't know how ports affect the size of a sram block, but capacity wise they are already there.
3dilettante
16-Dec-2009, 14:05
Why should the other cores interfere on that cache line? It holds thread specific data which is accessed by only the owner thread. So other cores never read or write from that line.
Marking a cache line exclusive involves allocating the cache line in an exclusive status and invalidating any other cached copies of said line.
Potentially, the prefetch with this hint is the equivalent of a Load+Store operation where the line has an element loaded and then written immediately back, with the added bonus of not setting a dirty bit to save some writeback bandwidth on eviction.
If the chance that any other core will have had the line or ever will is zero, then what is the point of locking anything at all?
It is also the case that the cache line status is not thread-aware, and there are 4 threads that will see the cache line in the same status. This line can still be evicted, and still have its exclusive status modified if another core touches it.
If no other core touches it, then it is more performant to not try to "lock" it.
rpg.314
16-Dec-2009, 14:50
It is also the case that the cache line status is not thread-aware, and there are 4 threads that will see the cache line in the same status. This line can still be evicted, and still have its exclusive status modified if another core touches it.
If no other core touches it, then it is more performant to not try to "lock" it.
The point of locking a line is to permanently mark it as most-recently-used (or some such), so that it is never evicted.
3dilettante
16-Dec-2009, 15:09
The point of locking a line is to permanently mark it as most-recently-used (or some such), so that it is never evicted.
Which instruction or prefix that Intel has listed does that?
You'd think they would have a special instruction to push/pop sets of registers so it could be overlapped with some useful work ... shame to waste cycles on that. That said, they never gave the remotest hint that they did.
Even if they have specific instructions for doing that, it will still be limited by cache bandwidth. So you can only switch context every 16 cycles at best, and there can't be any memory access in these cycles. Basically I don't think this is a good idea. It's probably better to just forget about context switch, and just do plain old software pipelining or similar tricks (such as using loop unrolling to hide latency).
Well, LRB has 192k of usable data cache per core and fermi is "supposed" to have 128K reg file + 64K (shared mem/L1 cache). So the capacity crossover has already happened. I don't know how ports affect the size of a sram block, but capacity wise they are already there.
I don't know how large Larrabee's L1 cache is, but it certainly is not 192KB per core. In the original Intel paper, the model they used is 32KB I + 32KB D. The real thing is not going to be vastly different from that.
If you combine L2 cache then you should also consider Fermi's L2 cache too.
rpg.314
16-Dec-2009, 16:32
I don't know how large Larrabee's L1 cache is, but it certainly is not 192KB per core. In the original Intel paper, the model they used is 32KB I + 32KB D. The real thing is not going to be vastly different from that.
If you combine L2 cache then you should also consider Fermi's L2 cache too.
I was counting total cache available to each core in LRB. there is 32K I + 32K Data L1 and 256K L2 cache. And since it is an inclusive cache (99% sure of it), effective per core storage in lrb is 192K.
Which is same as Fermi's per core storage (with fermi's L2 extra).
Which is same as Fermi's per core storage (with fermi's L2 extra).
Fermi has 32K registers per core (that's 128KB) and 64KB L1/share memory. That's already 192KB. L2 cache is 128KB per memory controller, which Fermi has 6 of them, total of 768KB. If you average them "per core" then it's an additional 48KB.
trinibwoy
16-Dec-2009, 17:11
Interesting, in previous comparisons I never considered Fermi's register file. So LRB doesn't have an advantage in on-chip storage after all.
To be fair, Larrabee with its 4 threads multi-threading it should be considered as having 4 set of register files. LRBni supports 32 vector registers (and they should be real registers instead of spill buffers), so there are actually 32x4x16 = 2K registers. That's 8KB. And of course there are also the plain old scalar registers, but they are tiny compared to LRBni registers.
Even if they have specific instructions for doing that, it will still be limited by cache bandwidth. So you can only switch context every 16 cycles at best, and there can't be any memory access in these cycles. Basically I don't think this is a good idea. It's probably better to just forget about context switch, and just do plain old software pipelining or similar tricks (such as using loop unrolling to hide latency).
In the end the compiler will have to do the same thing anyway, it won't generally have enough registers to retain the state for the shader invocations long enough to cover external memory latency ... one way or another the shader context will get piped through cache. Also you won't always have enough invocations to hide the latency if you don't mix and match with other shaders and that's unrealistic without the fiber approach.
spacemonkey
18-Dec-2009, 22:25
Don't know about 2017, but according to Kirk: in 2015 we can expect "570X" :runaway:
http://pc.watch.impress.co.jp/img/pcw/docs/336/837/ph24.jpg
I love slides with X's in them; so full of hope. :grin:
http://pc.watch.impress.co.jp/docs/news/event/20091218_336837.html
green.pixel
19-Dec-2009, 01:51
They have already done the slideshow game. :D
http://blogs.nvidia.com/ntersect/2009/08/hot-chips-2009-keynote-by-jen-hsun-huang.html
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.