Nvidia GT300 core: Speculation

Status
Not open for further replies.
ALU:TMU frequency ratio

G80= 2.35x
G92= 2.49x
GT200b = 2.27x
(hypothetical G300) = 2.29x
It's even more interesting to take units ratio into account... total ratio:

G80 ~ 1:4,7 (or 1:9,4 for bilinear filtered / unfiltered texels)
G80U ~ 1:4,9 (or 1:9,8 for bilinear filtered / unfiltered texels)
G84 ~ 1:4,3
G92 ~ 1:5
G92b ~ 1:5
G94 ~ 1:5
GT200 ~ 1:6,5
GT200b ~ 1:6,8
G300(?) ~ 1:9,1 (for 128 TMUs at 700MHz and 512 SPs at 1600MHz)

If so, it could be a step to the right direction...
 
The page table is just data in memory, granted it has system-level significance.
The various cores and tex units might cache parts of it at any given time in their TLBs. There's no need in x86 for a TLB fabric, as they are caches that can stay coherent like their neighbor L1 and L2s.
I was just thinking of mirroring or marking as dirty, traffic crossing the ring bus.

Also I was thinking of:

Virtual memory fragment aware cache

specifically figure 8 (paragraphs 109 onwards), where the TLBs are communicating with each other via L2. Also note this talks about 8 page table contexts, although distribution of contexts by type of client may not marry well with my thoughts on dividing contexts across tasks/cores.

Since it is system-critical, changing the page table would require additional work that would be serializing. As AMD's Barcelona chip showed, there are a number of actions related to TLBs where there are assumed to be atomic operations on page table structures. The failure to handle them atomically is basically game-over.

It might be fun to see how much work Intel has done in verifying Larrabee can handle TLB updates across so many cores, given what AMD experienced with just four. TLB-related errata are documented concerning non-atomic updates are still present in Intel and AMD designs, but are typically prevented by microcode patches.
Well, in theory Intel knows how to do that.

One thing occurred to me: AMD and NVidia are effectively going to spend billions over the next 5 years of GPU design, with iterations of refinement towards cGPU. How much Larrabee investment over the same time is required? What architectural gotchas need fixing and what scaling issues need to be worked around? Is Intel going to be paying mostly for shrinking to each node? It seems to me Larrabee should at least be cheap in this sense (even if the thing is so monstrous it isn't profitable for a fair while).

If the current architecture won't scale at >64 cores then sure that's a lot of fixing. Seems unlikely, at least for rasterisation.

It would be interesting to see just how much the texture units can modify the page tables. Maybe they can update some of the bookkeeping bits per entry. If they write to the L2 buffer, it might have to be labelled dirty, or maybe it has to be already initialized to a fixed status by the core.
That the texture unit gives up on a fault makes sense as modifying the actual page table is an OS-level operation.
I would assume Larrabee's software sets up as much as it can ahead of time and tries to keep it as unchanged as possible, given the overhead.
The thing is, the vast majority of page-table traffic will relate to textures. 80%? 90%?

Also, when a TU faults and the owning thread stalls/sleeps this will presumably start to cause the L2 lines preferred by that thread to leak away in favour of threads that are active. On the other hand, maybe the entire core quickly falls inactive since textures are usually pretty coherent and these higher mip levels will affect a lot of pixels in the core's screen-space tile (e.g. prolly all of them?).

I wouldn't expect the core to receive mirrored TLB entries, if the texture unit somehow modifies them.
An alteration would invalidate all cached copies of the page table entry. The CPU, if it were to require that entry, would miss and have to fill the TLB before trying to complete the memory access.
It should be on-chip, as the texture unit would have a copy. It might be that the entries are mostly not modified by the texture units to avoid TLB thrashing.
At the very least the cores' TLB updates need to be mirrored into TUs though, otherwise they'll be spending a lot of time stale (suppose it depends on how quickly the TUs exhaust the content of higher MIP level pages). One-way mirroring, from cores to TUs?

My take was that texture units have more specialized TLB behavior that allows them to behave in a rather non-x86 manner. The texture unit can, at programmer discretion, give up when a full core would be required to service a miss or fault.
This might make good performance sense, as fiddling with the TLB can inject unpredictable latencies.
I would think that the texture unit with hard faults enabled would still defer to the core if it encounters a fault that invokes an OS routine.
Did you mean soft faults? Hard faults makes the TU always defer don't they?

I didn't get the impression that there was any mirroring of TLBs in the pdf. What leads to this conclusion? TLB entries can be shared, but an update to one TLB's entry will typically only lead to the invalidation of old copies cached elsewhere. Broadcasting an update wouldn't normally be done. The other cores would just have to service a miss, if they happen to need the address.
It's not a conclusion so much as a guess/query.

TLBs are typically a shared resource local to a core. Given how intimately they are tied to the memory pipeline and how many there would be if they were per-thread, I'd bet this hasn't changed.
The slide-deck I linked earlier certainly hints that per-thread TLB contexts are not implemented and keeping similar tasks on a core is deliberately minimising TLB thrashing. So overall I have to admit it does seem like the TLBs could easily turn into a nightmare thrash-bottleneck.

Leave it to Sweeney, eh? Version 2.0?

Jawed
 
I was just thinking of mirroring or marking as dirty, traffic crossing the ring bus.
The Pentium caches wouldn't push data updates, as far as I know.
The modification of a table entry would follow an invalidation of all other caches' copies and also a write of the new data form to main memory.
Without some other modification to the "it's x86" mantra, the traffic over the bus would be "pull" traffic where other cores or texture units generate TLB misses and then need to pull the latest version in.

Also I was thinking of:

Virtual memory fragment aware cache

specifically figure 8 (paragraphs 109 onwards), where the TLBs are communicating with each other via L2. Also note this talks about 8 page table contexts, although distribution of contexts by type of client may not marry well with my thoughts on dividing contexts across tasks/cores.

It's an interesting way of compressing page table data to prevent excessively redundant page table entries from overwhelming the smal TLBs.
The communication through L2s something similar to the way Larrabee has local L2 subsets per core, with a core treating every other core's subset as an L3.
It's not a proactive form of communication, just that every L2 is allowed to fail back to the shared superset of all other L2s.

Well, in theory Intel knows how to do that.
The concepts are well-understood. Validating them over coherent caches, many memory clients, and other system details does get more expensive.
So far, no implementation on the most modern cores has come without bugs, though they can be caught by microcode patches.
Larrabee's core is not the most modern core, and its baseline infrastructure is an FSB and a single core per socket. There would be extra effort to make sure there wasn't some nasty surprise hiding in the melding of a new interconnect with the older core, particularly since incorrect TLB behavior that can't be patched would be a very bad thing.

One thing occurred to me: AMD and NVidia are effectively going to spend billions over the next 5 years of GPU design, with iterations of refinement towards cGPU. How much Larrabee investment over the same time is required?
That would depend significantly on how we account for the research Intel has put into its many other manycore initiatives.
Nobody has a good answer right now, so they're all spending massive amounts to find an acceptable one.

Also, when a TU faults and the owning thread stalls/sleeps this will presumably start to cause the L2 lines preferred by that thread to leak away in favour of threads that are active. On the other hand, maybe the entire core quickly falls inactive since textures are usually pretty coherent and these higher mip levels will affect a lot of pixels in the core's screen-space tile (e.g. prolly all of them?).
There's probably a pool of threads pinned to each core that is >4 for this reason.
The slides indicate Larrabee uses a scheme similar to Niagara, with modified round-robin threading.

At the very least the cores' TLB updates need to be mirrored into TUs though, otherwise they'll be spending a lot of time stale (suppose it depends on how quickly the TUs exhaust the content of higher MIP level pages). One-way mirroring, from cores to TUs?
The architecturally simplest answer would be to do what x86 does already and invalidate the TU's PTEs.
I think having stale copies anywhere being treated as if they were valid is a potential system crash.
However, since the texture units have their own TLB hardware and are allowed to perform fills on their own, mirroring would be more complicated.

Did you mean soft faults? Hard faults makes the TU always defer don't they?
My interpretation is that setting a texture unit for soft faults means that it will message back to the core a message that something went wrong with paging, but would not generate an actual fault, leaving it up to the shader code to decide what to do next.

With hard faulting, the texture unit would defer also, but this time to the core's fault handler.

My earlier statement was colored by my impression that it might also be possible to disable the texture unit's ability to automatically fill its TLB, and just require it to defer back to a shader.
It might keep the texture unit's behavior more deterministic as invoking a TLB fill is a hardware exception that blocks all use of the memory pipeline while in progress, but on reflection the complexity might not be worth it.
 
Wonder if with GT300 that we will see any important changes (beyond GT200 coalescing) to the MC, or will it likely end up as mostly a performance update (and GDDR5 update) of the MC arch seen in GT200.

I'm very cloudy on various aspects of this part of the hardware, anyone want to help me out a little? How many physical GDDR3 interfaces do high end GT200 chips have? I'm assuming the limit is due to chip pin limitations and isn't something we should assume will scale?

Also does anyone have any rough speculation as to the breakdown of latencies involved in NVidia's memory arch? Such as percentage of latency from core to memory controller (on-chip network) vs memory controller to GDDR (off-chip).
 
There's probably a pool of threads pinned to each core that is >4 for this reason.
The slides indicate Larrabee uses a scheme similar to Niagara, with modified round-robin threading.
I don't understand how there can be more than 4 threads (hardware threads) per core.

My interpretation is that setting a texture unit for soft faults means that it will message back to the core a message that something went wrong with paging, but would not generate an actual fault, leaving it up to the shader code to decide what to do next.
Hmm, yeah I suppose it's a question of volumes. e.g. if there's 16GB of texture data for a game and 2GB of card memory, you prolly want to hard-fault as the graphics engine starts up or just get the cores to suck in textures to pages. Then maybe switch to soft faults if textures are mostly slowly paging after that.

Jawed
 
Wonder if with GT300 that we will see any important changes (beyond GT200 coalescing) to the MC, or will it likely end up as mostly a performance update (and GDDR5 update) of the MC arch seen in GT200.
Maybe they'll do something radical with the cache architecture? L2 in ATI appears to be flat, i.e. all on-die clients go through L2. Dunno what NVidia's doing.

I'm very cloudy on various aspects of this part of the hardware, anyone want to help me out a little? How many physical GDDR3 interfaces do high end GT200 chips have? I'm assuming the limit is due to chip pin limitations and isn't something we should assume will scale?
8x 64-bit controllers. NVidia says that bandwidth is fine on GT200 for the units on the die. I think it'll be a long long time before a chip goes beyond 512-bit DDR. Hell, the DDR roadmap apparently hits the wall at around 7-8Gb/s per pin and we'll all be witness to Rambus getting their day with a change to XDR something or other. Sigh.

Also does anyone have any rough speculation as to the breakdown of latencies involved in NVidia's memory arch? Such as percentage of latency from core to memory controller (on-chip network) vs memory controller to GDDR (off-chip).
~250 ALU cycles typical, ~500 worst case, extremes beyond that? Volkov has reasonable looking measurements.

10s of cycles worst case to get data around the die?

GDDR5 is a bit slower than GDDR3, 10%?

Jawed
 
I don't understand how there can be more than 4 threads (hardware threads) per core.
Thread affinity can be set to force a given thread to only execute on a single core, there can be more threads pinned to a core than the CPU can actually execute.
If a hardware thread hits some extremely long latency event, it can signal the software scheduler to swap in a queued thread that hasn't hit a roadblock.
 
Thanks Jawed.

Maybe they'll do something radical with the cache architecture? L2 in ATI appears to be flat, i.e. all on-die clients go through L2. Dunno what NVidia's doing.

I thought (as per the R700 ISA doc drawings) R700 had the following,

(1.) Read only L2s connected via crossbar to a L1 texure cache per SIMD unit.

(2.) Write combine buffers per MC connected via crossbar to an export buffer per SIMD unit.

~250 ALU cycles typical, ~500 worst case, extremes beyond that? Volkov has reasonable looking measurements. 10s of cycles worst case to get data around the die?

So perhaps 10x or so latency off chip compared to on chip? Lets just say for the sake of argument that NVidia does end up with some writable caches. Would the following architecture make sense?

(1.) Each MC has a Read/Write "L2" connected via crossbar (or some other network as chips scale) to SIMD units. Possibly each MC+cache unit has a dedicated small atomic ALU unit.

(2.) SIMD units still have a dedicated read only L1 (texture cache).

The idea being that the internal on-chip network routes requests to MC+cache units each of which serialize (+coalesce?) requests so there is no need for cache coherency between MC+cache units. Also if MC+cache units do atomic operations, then atomic operations scale as long as they are distributed well across MC+cache units.
 
I don't think that writable caches will be present. Then you effectively run into lrb like cache coherency. Even if the cache size is kept small, it won't scale and the small size will defeat the point of writable caches any way.

Read only texture caches dramatically simplify cache design, AFAICS.

EDIT: and they also simplify multi gpu shared memory scaling (should that come to pass this gen).
 
If I were going for the simplest thing to tack on, I'd say add write-through caches on the opposite side of every memory controller from the rest of the on-chip network.

The memory controllers by their physical separation mean writes to an address already have a known destination that will not somehow magically wind up in another controller's cache, and with the already present coalescing and conflict checking done by the hardware on the other side of the memory controller, no coherence is needed at all between writable caches.

Each memory controller could be upscaled to support one physical memory read and one or more reads to the cache at the same time. Writes would be unchanged for a write-through scheme.

This would be a bandwidth and latency abatement scheme of moderate benefit with much reduced impact on the overall design.
Write-through would also make the simplest design more compact, with a single read/write port.
Some decent density could be arrived at for this banked single-port scheme.
If I had the luxury, a few megs on-die at least would be nice.

The question is how to handle what is in the read-only caches on the other side.
The easiest is to set up the scheme so that coherence is not defined within the chip without some kind of fence or global synch.
The next step up would be a global broadcast of an invalidate on that address on all read-only caches.
For more control, it might even be desirable to make certain outputs non-cacheable to save room for other more important data.

Atomics could be labeled cacheable, which would cut latency there.
If tiled rendering is used, framebuffer tiles could be kept on-die to avoid excess read traffic.
With sufficient size, the cache might even be able to catch a lot of excess traffic without explicit tiling.

Write-back would save even more bandwidth, but it could come at the cost of larger physical costs or extra latency, as evictions would require a read then write of lines.

Multi-chip can lead to more challenges, but this would depend on how stringent the coherence scheme is.
The more forgiving, the cheaper the implementation.
 
The memory controllers by their physical separation mean writes to an address already have a known destination that will not somehow magically wind up in another controller's cache, and with the already present coalescing and conflict checking done by the hardware on the other side of the memory controller, no coherence is needed at all between writable caches.

Exactly what I was thinking. I'm guessing more likely to be one of the primary differences between the approaches taken by Intel and NVidia.

The question is how to handle what is in the read-only caches on the other side. The easiest is to set up the scheme so that coherence is not defined within the chip without some kind of fence or global synch.

Other side read-only caches (instruction cache, constant cache, and texture cache) are all relatively small "L1" sized caches. Seems as if by design all major APIs (CUDA, OpenCL, and DX11) only support read-only tex+constant cache coherence at draw call (or compute kernel) boundaries. Likely with a command buffer command (hidden to programmer, but inserted at driver level) to flush those smaller read only caches when necessary.

Apparatus, System, and Method For Graphics Memory Hub NVidia patent filed Dec 2008, might provide a few clues on GT300 or later designs. If I am reading this right, seems as if they are looking to provide more bandwidth per pin by having on chip MCs communicate more efficiently to an off-chip memory hub which would interface with PCI-E and DRAM (for a small extra cost in latency). Also patent covers optionally placing ROP in the hub. So perhaps my speculation on atomic ALU operations happening out of SIMD units (and post routing to individual MCs) isn't too far off.
 
If so, that strangely sounds a bit like the northbridges of conventional motherboard chipsets - something which CPUs have just gotten rid of for performance purposes. Funny thing.
 
Sounds like a patent for FB-DIMMs for GPUs. A specialized bus leads to a control chip that aggregates one or more channels of DRAM chips.

I'm not sure what's new about this. It seems almost trivial to change the chip at the end from a CPU to a GPU.
 
But that's already the case today right?

I'm not sure what mechanisms are in place currently for this in the latest chips. I can only comment on the slope of complexity of possible design choices.

If there is already a broadcast invalidate in hardware for all writes, it would just mean that the designers would have less work to do.
 
An nVidia beta tester has confirmed on the hardware.no forum that the GT300 has been taped out and the first samples are up and running in nVidia's labs at 700/1600/2100.
 
An nVidia beta tester has confirmed on the hardware.no forum that the GT300 has been taped out and the first samples are up and running in nVidia's labs at 700/1600/2100.

2100 = 1050 GDDR5 base clock, I presume (so actually it should read 4200, not 2100)
 
Status
Not open for further replies.
Back
Top