AMD: R8xx Speculation

Jawed · May 6, 2009

MfA said:
No, they put it in there because they wanted to present developers with a x86 processor supporting classical multithreaded development ... but it does tie them down a bit.

I have to admit I still can't work out what it is you think Intel should have done instead.

Jawed

trinibwoy · May 6, 2009

Jawed said:
I have to admit I still can't work out what it is you think Intel should have done instead.

They had all the same options available to them as AMD and Nvidia do. You make it sound as if their hands were tied from the beginning.

Jawed · May 6, 2009

trinibwoy said:
They had all the same options available to them as AMD and Nvidia do. You make it sound as if their hands were tied from the beginning.

I'm talking about MfA's comments on the cache architecture.

Jawed

3dilettante · May 6, 2009

The cache architecture was already mostly there the instant they settled on an already existing x86 core.
It would have taken more work to remove it.

MfA · May 6, 2009

Jawed said:
I have to admit I still can't work out what it is you think Intel should have done instead.

Jawed

All caches lockable, 32 bit wide banked ports for L1 and L2.

Jawed · May 6, 2009

MfA said:
All caches lockable,

So associativity is not good enough?

32 bit wide banked ports for L1 and L2.

So how much more costly does that make them, and how much extra latency would that add?

What you're asking for, apparently, is >256KB of register file per core :???:

Jawed

3dilettante · May 6, 2009

Jawed said:
So associativity is not good enough?

Locking would prevent cache line eviction, so the data would always be where it was last loaded.
Increasing associativity would reduce the chance of mapping conflicts, though a thread could just be unlucky.

Since Larrabee is multithreaded, an increase in associativity would be needed just to keep miss rates within the realm of where they were originally.

So how much more costly does that make them, and how much extra latency would that add?

There would be a hefty amount of logic to route between all the banks, which would add latency.
The big overhead is going to be that this is a cache, which means that it is handling 32 full memory accesses.
Unless the accesses are always to the same locality, which negates the need for banking, that's 32 address calculations and 32 TLB accesses, 32 read requests, and the coherency traffic to match.

nAo · May 6, 2009

If a software engineer is willing to use cache locking to speed up his code, then he can achieve the same result tagging everything that doesn't need to be locked as LRU. It's obviously not a symmetrical choice/situation, but the latter method is obviously much more flexibile and doesn't waste any cache memory you have locked and you are not going to use. Oh, it scales much better too.

3dilettante · May 6, 2009

That does reduce the probability of needed data being evicted to a low level.
It's not guaranteed, of course.

There are other threads contending for the cache, whose associativity is not quite known.
Locking a whole section of cache might be accomplished in a few instructions.
Marking every cache line not to be locked might be bigger hit to instruction bandwidth.

nAo · May 6, 2009

3dilettante said:
Locking a whole section of cache might be accomplished in a few instructions.
Marking every cache line not to be locked might be bigger hit to instruction bandwidth.

How does that effect instruction bandwidth?
LRBni allows you to specify cache behaviour or any prefetch/load/store instruction.

Jawed · May 6, 2009

3dilettante said:
Locking would prevent cache line eviction, so the data would always be where it was last loaded.

Reading the Siggraph paper and the Larrabee New Instructions Overview, it seems that marking cache-lines for eviction is the technique Intel chose for controlling cache-line usage, i.e. when L2 and/or L1 lines are marked for eviction, other lines are "preferentially" preserved. Lines aren't explicitly lockable as far as I can tell, so it's up to the program to ensure there's always enough "evict upon use" lines.

Hmm, seems I missed the attributes for memory-related hints and prefetch, too:

_MM_MEM_HINT_ENUM – Constants used by all operations that read or write memory for non-temporal hint
_MM_HINT_NONE No memory hint
_MM_HINT_NT Nontemporal memory hint

_MM_PREFETCH_HINT_ENUM – Constants used by PREFETCH1 and PREFETCH2
_MM_PFHINT_NONE No prefetch hint
_MM_PFHINT_EX Mark cacheline exclusive
_MM_PFHINT_NT Nontemporal data hint
_MM_PFHINT_EX_NT Mark cacheline exclusive and load with nontemporal data hint
_MM_PFHINT_MISS Miss hint
_MM_PFHINT_EX_MISS Mark cacheline exclusive and load with miss hint
_MM_PFHINT_NT_MISS Load with nontemporal data and miss hints
_MM_PFHINT_EX_NT_MISS Mark cacheline exclusive and load with nontemporal data and miss hints

There would be a hefty amount of logic to route between all the banks, which would add latency.
The big overhead is going to be that this is a cache, which means that it is handling 32 full memory accesses.
Unless the accesses are always to the same locality, which negates the need for banking, that's 32 address calculations and 32 TLB accesses, 32 read requests, and the coherency traffic to match.

Sounds untenable to me. Particularly as the TLBs are shared by both the cores and the texture units.

Jawed

3dilettante · May 6, 2009

nAo said:
How does that effect instruction bandwidth?
LRBni allows you to specify cache behaviour or any prefetch/load/store instruction.

In cache shared between multiple threads, it may not be enough for a thread to say that some cache line it hits is LRU.
To more closely emulate locking, it has to somehow make a line it does need be MRU relative to all other lines in its set.
(edit: That is, make every other line less recently used.)
That is more work.

Jawed · May 6, 2009

I guess Intel had this lying around:

Shared cache structure for temporal and non-temporal instructions

Not sure if that's part of the Pentium core that Larrabee is supposed to be based-upon. I don't know much about x86 caching, so I've got no idea if there's much experience with this approach. Maybe it's an Itanium thing?

http://software.intel.com/en-us/blo...rmance-analysis-and-optimizationwhich-failed/

The surprising find during this experiment was that temporal locality hints were responsible for all of the gains on the gzip binary. This is one of the “aha” moments that I described at the beginning of the blog. Temporal locality hints are fairly useful on the Itanium architecture, it is difficult to determine where to place them unless a savvy developer is familiar with how often data within his or her program is touched.

Jawed

TimothyFarrar · May 6, 2009

Jawed said:
Any global atomic requires global thread serialisation, not merely global memory serialisation.

I was suggesting the possibility that a GPU could be (likely is) built to do the global atomic operation NOT in the SIMD cores, but rather much closer to the memory subsystem itself, perhaps similar to how ROP operations work and are very efficient.

CUDA style GPU global atomic operations can happen out of order. With what I'm suggesting SIMD cores issue the atomic operation just like any other global memory operation (well except you would have to transfer the extra operands to the atomic operation), then the SIMD "warp" goes to sleep, and the ALU keep going on another thread. About the only extra serialization you need here is if multiple atomic operations apply to an address in one single coalesced memory segment transaction. Then you'd serialize the atomic operations (but not in the normal SIMD ALUs) which happen to data in that segment, then return the atomic result to the SIMD ALUs as if it was a normal global memory fetch. Since atomic operations return the value in memory before the atomic operation, you wouldn't even have to incur any extra latency

This quite a bit different than things are done in CPU land.

The sensible way to serialise in this situation is to distribute serialisation of addresses across extant threads, e.g. if Larrabee has 32 cores, then across 128 threads. Simple tiling of the atomic variables will ensure that cache-line/RAM-bank collisions don't occur.

Yeah, with coherent caches, the "sensible" way to use atomic operations on shared cache lines is to use them only sparingly. Or use them only for atomic operations between threads on the same core (as you suggested). Even then they are mighty expensive. Not sure what the cost will be on Larrabee but InterlockedIncrement can range from 30-90 cycles on PC and over 200 cycles on something like the 360 (PPC chips use retry loops if they loose the cache line reservation on the conditional store).

What "goodness" about atomics can you discern from these performance figures?

That paper left out way too many details about the algorithm to infer much about actual atomic costs beyond the "15% more expensive than global writes" figure that they found in their algorithm.

Anyway, I should be able to provide actual timings on global atomic operations (including CUDA hardware counters) next week when my GT275 arrives so we have something real to talk about.

3dilettante · May 6, 2009

Jawed said:
I guess Intel had this lying around:

Shared cache structure for temporal and non-temporal instructions

Not sure if that's part of the Pentium core that Larrabee is supposed to be based-upon. I don't know much about x86 caching, so I've got no idea if there's much experience with this approach. Maybe it's an Itanium thing?

The P54 core was introduced in 1994. There's no law saying it couldn't be added for Larrabee, I suppose.

I've never run across any article that stated a processor used this scheme, but it's the sort of unglamorous esoteric implementation detail that few would comment on.

Jawed · May 6, 2009

TimothyFarrar said:
I was suggesting the possibility that a GPU could be (likely is) built to do the global atomic operation NOT in the SIMD cores, but rather much closer to the memory subsystem itself, perhaps similar to how ROP operations work and are very efficient.

The threads in all the multiprocessors become subject to a single fence. If you could guarantee that all the atomics for a given address happen only within a single multiprocessor, then it would be a shared-memory atomic, not a global one. But that would give you way more performance.

CUDA style GPU global atomic operations can happen out of order. With what I'm suggesting SIMD cores issue the atomic operation just like any other global memory operation (well except you would have to transfer the extra operands to the atomic operation), then the SIMD "warp" goes to sleep, and the ALU keep going on another thread.

The multiprocessors are not designed to hide the latency of foreign multiprocessors' memory operations, only local ones, so the fence will have a huge impact on performance.

About the only extra serialization you need here is if multiple atomic operations apply to an address in one single coalesced memory segment transaction. Then you'd serialize the atomic operations (but not in the normal SIMD ALUs) which happen to data in that segment, then return the atomic result to the SIMD ALUs as if it was a normal global memory fetch. Since atomic operations return the value in memory before the atomic operation, you wouldn't even have to incur any extra latency.

This is why I was originally thinking in terms of some kind of network amongst multiprocessors. It turns out that the atomic operations in NVidia are integer-only operations, which re-inforces the general suspicion that atomics are actually manipulated in the ROPs. So when the ROPs gain full floating-point capability, operands of atomics can be floating point, it seems. Or, if ROPs disappear in favour of "surfaces" close to the ALUs ...

Yeah, with coherent caches, the "sensible" way to use atomic operations on shared cache lines is to use them only sparingly.

Thinking about this some more, perhaps you end up building a task-pool based scheduler, perhaps with work stealing and dole-out atomics as a special kind of task. I'm way out of my depth on this stuff

Or use them only for atomic operations between threads on the same core (as you suggested). Even then they are mighty expensive. Not sure what the cost will be on Larrabee but InterlockedIncrement can range from 30-90 cycles on PC and over 200 cycles on something like the 360 (PPC chips use retry loops if they loose the cache line reservation on the conditional store).

It seems Larrabee has ultra-high-bandwidth inter-core coherence, SMT and locality+temporality cache control to help mitigate the horrors of cache latency.

That paper left out way too many details about the algorithm to infer much about actual atomic costs beyond the "15% more expensive than global writes" figure that they found in their algorithm.

Yes, it's hard to know if the cost of that operation (atomic or not) is buried under the costs of all the work undertaken per frame. Is divergence penalty why it's so slow or is it memory operations?

"Ray-tracing sucks on GPUs".

Anyway, I should be able to provide actual timings on global atomic operations (including CUDA hardware counters) next week when my GT275 arrives so we have something real to talk about.

I imagine you'll be spoilt very quickly

Jawed

Jawed · May 6, 2009

3dilettante said:
I've never run across any article that stated a processor used this scheme, but it's the sort of unglamorous esoteric implementation detail that few would comment on.

Section 5.4 Data Cache Prefetching and Load Hints (stupidly doesn't allow copy to clipboard):

http://download.intel.com/design/Itanium2/manuals/25111003.pdf

Seems Larrabee is an expansion upon Itanium, with a larger set of hints. Seems pretty likely that few people will engage in the kind of hand-coding that these hints appear to demand - so then it's a question of whether a library can be built up. e.g. this atomics question, is it possible to construct an atomics library that will hold-up to more than light usage of atomics in high-level code?

Jawed

TimothyFarrar · May 6, 2009

So Jawed, what am I missing here?

Multiprocessors are designed to hide latency, regardless of source.

1. A SIMD group issues a memory transaction, then stops getting scheduled for execution. Other SIMD groups run.

2. The memory transaction gets routed (through whatever device interconnect network) to what I'm going to label as the memory hub. This hub might have multiple interfaces to DRAM or an interface through the PCI-E bus. These interface(s) queue up, coalesce and service requests and send the results back through the interconnect network to the multiprocessors.

So now toss in atomic operations happening in the unit doing the coalescing for the given interface to DRAM.

1. Each atomic operation is composed of a {read,op,write} sequence.

2. Note that the result of the atomic operation is the first read, so this can immediately be returned with the same latency as if it was just a read.

3. After this read, a parallel unit (I'm going to refer to as the atomic unit) in the interface could service the atomic operations on the segment of DRAM which was read. Then later re-issue the write back to DRAM. Intermediate results could be cached (right at the interface) so that repeated operations to the same segments don't eat repeated read/write cycle of the same segments of DRAM. This in interface caching would make (2) above fast even in the case of repeated address collisions.

What about the worst case scenerio, all multiprocessors repeated issue atomic operations to the same address?

1. There are two cases here,

1a. One thread of the SIMD group is requesting the atomic operation (this would be the common case, for example writing to a global queue : a shared-memory scan/scatter to group, then one thread does a global atomicAdd to get a global position to do a fully coalesced non-atomic global scatter of the group into the queue). Performance in this case would be similar to issuing a non-atomic operation with the addition of the possible latency of the atomic unit updating the read DRAM segment, and the bandwidth cost of the eventual write back (which could infact happen each operation in the worst case).

The real cost here is similar to a single non-coalesced 32-bit read, ie having to read the minimum sized segment (which in cuda is 32-bytes). However this runs in parallel with the regular SIMD ALUs, so assuming you don't saturate the memory bus, then in theory all is well.

1b. Second case, each thread of the SIMD unit is requesting an individual global atomic operation. In the worst case all these addresses collide. I cannot think of any case were a program would do that. More likely case is that these atomic operations are either semi-random (of which you end up with the same performance as case 1a) or have some locality and slightly collide (at which case perhaps you get more latency at the atomic unit), or perhaps all hit a separate coalesce able address (like if you had say 32 separate global queue pointers for each thread of a warp, don't have any usage cases for that one yet).

The end result of all this as seen by the multiprocessors is a slightly longer latency and reduced throughput of atomic operations compared to non-atomic operations, but nothing awful.

EDIT: As a side note, yes I am suggesting something of a cache here, but the cache isn't in the multiprocessors, but instead in the coalescing unit at the DRAM interface(s) themselves.

3dilettante · May 6, 2009

Jawed said:
Section 5.4 Data Cache Prefetching and Load Hints (stupidly doesn't allow copy to clipboard):

http://download.intel.com/design/Itanium2/manuals/25111003.pdf

The method described there is a non-temporal hint that allocates a line but neglects to update the LRU bits. This would make this line more likely to be evicted, but as stated it is not a guarantee.

Other than the L1 I-cache, it appears that the replacement policy is probably the reason why this is the case, as the lower levels use NRU.

The patent had a flag that explicitely pointed out there was non-temporal data in a set.

Seems Larrabee is an expansion upon Itanium, with a larger set of hints. Seems pretty likely that few people will engage in the kind of hand-coding that these hints appear to demand - so then it's a question of whether a library can be built up. e.g. this atomics question, is it possible to construct an atomics library that will hold-up to more than light usage of atomics in high-level code?

Standard x86 has a selection of non-temporal hints as well.
These hints appear to be orthogonal to what would be needed for an atomic operation. Hints don't prevent some inopportune memory traffic pattern from interfering.

Jawed · May 7, 2009

TimothyFarrar said:
So Jawed, what am I missing here?

OK, quick post as I'm about to go to bed...

So now toss in atomic operations happening in the unit doing the coalescing for the given interface to DRAM.

1. Each atomic operation is composed of a {read,op,write} sequence.

2. Note that the result of the atomic operation is the first read, so this can immediately be returned with the same latency as if it was just a read.

3. After this read, a parallel unit (I'm going to refer to as the atomic unit) in the interface could service the atomic operations on the segment of DRAM which was read. Then later re-issue the write back to DRAM. Intermediate results could be cached (right at the interface) so that repeated operations to the same segments don't eat repeated read/write cycle of the same segments of DRAM. This in interface caching would make (2) above fast even in the case of repeated address collisions.

I think it's safe to liken this to blending in the ROPs as each fragment arrives. If multiple fragments for a given pixel arrive within a short time span, then the data is likely in-cache.

What about the worst case scenerio, all multiprocessors repeated issue atomic operations to the same address?

I think my idea of the worst case is worse, and I think it's why we're somewhat divergent: I'm thinking of all threads in a grid atomically updating an address and then all threads using the final value for immediately-dependent computation. It's actually stupid, and logically equivalent to issuing two kernels one after the other.

On that basis my worst case is a straw man.

The only mitigation might be that multiple atomic variables are all independently updated by "random" threads from the grid. Even then, trying to justify running this as one kernel rather than two successive kernels seems difficult.

1. There are two cases here,

1a. One thread of the SIMD group is requesting the atomic operation (this would be the common case, for example writing to a global queue : a shared-memory scan/scatter to group, then one thread does a global atomicAdd to get a global position to do a fully coalesced non-atomic global scatter of the group into the queue). Performance in this case would be similar to issuing a non-atomic operation with the addition of the possible latency of the atomic unit updating the read DRAM segment, and the bandwidth cost of the eventual write back (which could infact happen each operation in the worst case).

The real cost here is similar to a single non-coalesced 32-bit read, ie having to read the minimum sized segment (which in cuda is 32-bytes). However this runs in parallel with the regular SIMD ALUs, so assuming you don't saturate the memory bus, then in theory all is well.

1b. Second case, each thread of the SIMD unit is requesting an individual global atomic operation. In the worst case all these addresses collide. I cannot think of any case were a program would do that. More likely case is that these atomic operations are either semi-random (of which you end up with the same performance as case 1a) or have some locality and slightly collide (at which case perhaps you get more latency at the atomic unit), or perhaps all hit a separate coalesce able address (like if you had say 32 separate global queue pointers for each thread of a warp, don't have any usage cases for that one yet).

The immediately obvious issue here is that without extremely high locality of the addresses, per-thread atomics will swamp the atomic units, i.e. 16 atomic requests per core clock by each of 30 multiprocessors in GTX280, being serviced by 32 atomic units (GT200 can blend at 32 pixels per clock, can't it?). Worst case the "blend rate" of the atomic is 1/32 of full speed.

Thinking about graphics, this is the primary reason that rasterisation is highly localised with space-filling curves.

So, ahem, it might be best to model the cost of atomic operations in terms of their distance from the highly regularised form of blending in rasterised graphics. So thinking in terms of:

address radii per warp - page divergence
count of overlapping radii by distinct warps within a multiprocessor - intra-multiprocessor serialisation factor
count of overlapping radii by distinct multiprocessor (if any) - inter-multiprocessor serialisation factor
available cycles to hide atomics, per kernel invocation or twixt fences - ALU:atomic

In graphics all of these are sane

I still haven't fully understood the nature of the antiradiance algorithm in that paper, to get an idea of a practical instance of these factors, e.g. I'm unclear on the mapping of links to bins - specifically whether all bins in the destination are updated or only a selection.

Also the algorithm appears to have no need for an atomic, as a non-atomic version runs. But I don't understand how the non-atomic version works. Does it use considerably more memory? Is atomic required for more complex scenes. The talk of a breadth-first version at the end of the paper implies that memory wouldn't be problematic, I guess.

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Jawed

trinibwoy

Meh

Jawed

3dilettante

MfA

Jawed

3dilettante

nAo

Nutella Nutellae

3dilettante

nAo

Nutella Nutellae

Jawed

3dilettante

Jawed

TimothyFarrar

3dilettante

Jawed

Jawed

TimothyFarrar

3dilettante

Jawed

Similar threads