Nvidia GT300 core: Speculation

Discussion in 'Architecture and Products' started by Shtal, Jul 20, 2008.

Thread Status:
Not open for further replies.
  1. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    I can inform you that current A1-samples are clocked at 700/1600/1100 MHz. So they come to impressed 2,457 Gigaflops and 281 GB/s memory bandwidth.
    It is nice, is not it? ;)

    Source: Hardware-Infos
     
  2. Pressure

    Veteran

    Joined:
    Mar 30, 2004
    Messages:
    1,655
    Likes Received:
    593
    I'm not convinced GFlops numbers and the amount of memory bandwidth is a good indication of performance.

    Just take a look at the R600 (massive amount of memory bandwidth at the time) and the Radeon HD 4870x2 which pushes 2.4 GFlops.
     
  3. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    IMO, if you are talking about fibers as in light weight threads, the second a fiber has to check, via software, if it can run, then it has too much overhead. Also a fiber context switch still has to restore it's context, which could likely be a full trip to/from L2. Switch granularity likely has to be coarse for good performance (in order to amortize the switch cost).

    If you are talking about fibers as per LRB, it seems to me that LRB fibers are more like compiler loop unrolling. I thought the idea here was to, at compile time, compute an order of operations which would likely best insure that when the fiber returns to execution that it wouldn't have to check anything and could just continue executing (clearly in worst case just stall).

    I look at LRB (in terms of NVidia's hardware) as having a fixed set of 4 blocks (or work groups), with fixed round-robin scheduling between the warps of a block. Anything outside of this formula (besides just round-robin of more blocks) requires software overhead. With NVidia's hardware you have the flexibility of warps of a block running out of order without software overhead.

    If we speculate that GT300 goes the route of having the ability to lock down a section of registers, shared memory, and warps (out of the 32/core maximum in GT200) for a "pinned block" then I'd bet they add something new in the hardware to insure warps of a pinned block don't have to poll (ie wake up, check if they should run, then either run or sleep again). Perhaps a dedicated instruction to wait on data from a hardware queue, or wait until a reserved line gets modified (in the case they they implemented writable caches).

    Where the atomic operation is performed is very important because it factors greatly into the overall throughput of atomic operations.

    Yeah, part of the problem here is that even beyond what you have mentioned and testing applications re-written for the different hardware options (cache/nocache, cache size, etc), one would also have to compare everything assuming the same base amount of silicon to implement all the hardware options...
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,062
    Likes Received:
    3,119
    Location:
    New York
    Well that's true when comparing architectures from different companies. But it's a safe bet that GT300 will scale at least linearly with theoretical increases vs GT200 (in shader bound scenarios)
     
  5. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    Is it a safe bet to think that GT300 and GT200 are the same architecture?
    I mean, ~2.5 Tflops is a x2.5 improvement over gt200 parts, that's a huge improvement.
    There are two choices the chip is really really huge or Charlie has got some stuff right and Nvidia said goodbye to a bunch of fixed function hardware.
     
  6. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Clearly Larrabee is doing this already since 32 registers can't hold the entire state of a reasonably complex pixel shader for much more than a quad of pixels. So L1/L2 traffic is "continuous" from the point of view of a Larrabee core.

    ATI also has coarse context switching - i.e. a preference for long clauses of TEX instructions and a minimal number of clauses, in total, over the lifetime of the shader. That's because switching adds latency to the duration of the shader (don't know how much).

    The overhead cost for software scheduling is a trade of utility versus area to implement the funny stuff. e.g. for all we know Intel's put dedicated atomic functionality in the cache/bus as part of the cross-core TLB fabric.

    There's basic hardware scheduling to cover latencies such as texture fetches, cache fetches (not sure) and branch misprediction.

    Fragment shading is atomic per pixel. The scheduler task that dishes out pixel-shading/output-merger work has a scoreboard for these atomics. So between rasterisation and shading there's an "atomic queue". This is obviously fairly brutal and nowhere near as fine-grained as fibre-level sleeping, of course.

    I don't know if Intel plans to routinely sleep individual fibres rather than threads (contexts). I can't tell how they've implemented global atomics. Hell, maybe there's no hardware support at all.

    I don't really understand how texturing faults in Larrabee, but apparently it does, seemingly through the core's own TLB. I'm unclear if the thread sleeps or if the thread's expected to sleep fibres, or how things wake up - seemingly the thread issues a request to memory, perhaps the return of the data simply wakes the thread. Maybe the thread builds a list of faults before submitting them as a burst and then sleeping - so it doesn't attempt to run any fibres they have all successfully gotten data.

    Or maybe it can wake as texture results return then go back to sleep? Using a gather-register full of addresses where texture results are expected, the thread is woken when any of that data appears in L2/L1?:

    Code:
    // clause 1
    TEX: x1   
    repeat until TEXGatherAddressList is empty
        scan for fibre to do
        ALU: x6 // dependent instructions executed for the fibre
     
    // clause 2
    TEX: x3   
    repeat until TEXGatherAddressList is empty
        scan for fibre to do
        ALU: x25
    I don't know if gather can effectively drive this sleep-until-woken thing. If this is possible then it seems it's a technique applicable to atomics too, wherever it is (out of pipeline) the atomics are actuated.

    This is purely about load-balancing in my view. A persistent kernel wakes up when the scoreboarder determines that overall throughput will suffer.

    So I doubt the kernel does any polling as such.

    Not to mention what clocks you can run it at.

    Jawed
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,062
    Likes Received:
    3,119
    Location:
    New York
    Well I think it's a safe bet that "512" refers to the number of scalar MADD ALUs. So in that regard it should scale linearly. But that says nothing about the other bottlenecks that may have shifted.
     
  9. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    The news about G300 clocks at 700/1600/1100 MHz is now available in English.
     
  10. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    With LRB I'd suspect that actual register to SIMD group (example, 4 quads) allocation changes throughout the shader. With a mix of areas (you could think of these as clauses) as that are separately loop unrolled and get different kinds of register allocation during the duration of the shader. Just a guess. But might be good to think of register allocation as cross clause (many fibers) rather than per shader.

    If GT300 is some kind of MIMD or DWF however, I don't see this applying.

    I wonder if they are using the x86 LOCK prefix for atomic operations.

    In CUDA one can take advantage of out-of-order warp execution. SIMD groups of quads could also execute out-of-order (might have to sync block prior to output however?). Also if NVidia does provide "warp as a task" functionality for GT300, then the finer warp granularity scheduling could indeed be useful.

    What about submit entire caches lines full of offset (and lod, etc) for 16 fetches as one unit. This would make sense with the idea of compiler cross-fiber register allocation (I talked about above). Then have enough fibers issuing groups of 16 fetches such that when you return to the first fiber again, that you assume the TEX resulting data is in L2 (if it isn't there than core does an L2 miss and waits for the TEX result only on that hyperthread).
     
  11. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Fibering != unrolling, unless you want to have fun constantly thrashing the instruction cache.
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Isn't the shader clock supposed to be twice of core clock in G80 and GT200? Clearly not here if this is true. I am a bit sceptical that it has been ditched. :???:
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I'm not sure what dedicated functionality there would be. Atomic operations resolve to control of cache lines, and this has been rolled into in a CPU architecture's coherence scheme. The dedicated logic in this case would be the cache controller. Atomic operations as far as the actual ALU work is concerned are nothing special once the cache line statuses have been handled.

    Usually, whatever the TLB would be doing with regards to page status has already been done, as the CPU first loads the cache line with the atomic variable, then tries to assert ownership. Involving the TLB past the initial load attempt is begging to cause an exception or interrupt related to page table or the TLB needing to be filled, which would run counter to the desire to keep atomic ops short.
    I'm not sure what a TLB fabric would entail.

    The statements concerning this indicate the core is in charge with handling pages not in memory, a hopefully rare occurrence for Larrabee. My impression was more that the core itself handles paging things in, which is more than just the TLB, as this involves creating page table entries, which is something a simple texture unit might not be able to do.

    Data doesn't just appear in the L2/L1 in a coherent cache scheme, barring some very non-x86 and possibly no-CPU type cache behavior.
    The data would have to have been pulled in by the core, which is currently only guessing when the data will have actually been added--absent some kind special interrupt that has not yet been disclosed.
    A low-rent scheme would be for the texture units to write out 16-bit error flags per each qquad's texture access.
    Set that flag as a bit mask for a standard gather using the base addresses of each texture access.

    If all bits are 0, the operation is skipped.
    For all bits set to 1, each gather load is a standard x86 load, which will hit the fault and initiate the standard x86 handler.
     
  14. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Oops, so perhaps scratch my idea of grouped clause register allocation... seemed as if it might work when all hyperthreads are running the same shader, but perhaps not!
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,062
    Likes Received:
    3,119
    Location:
    New York
    Nope, it's at least twice the core clock. Most configs have shaders running at 2.2-2.5x the core clock. For example my GTX 285 is at 700/1585 right now.
     
  16. crystall

    Newcomer

    Joined:
    Jul 15, 2004
    Messages:
    149
    Likes Received:
    1
    Location:
    Amsterdam
    The first Larrabee paper stated that communication among the four hardware threads of a core went through a queue updated with the CMPXCHG instruction without using the LOCK prefix. This is possible because the four logical threads running on the hardware context (1 FE and 3 BE using Intel's nomenclature) are pinned to the core and thus CMPXCHG works atomically even without the LOCK prefix. This was done to avoid the unnecessary cache-coherency overhead of such operation. Communication with the outside world on the other hand requires using primitives involving the LOCK prefix and suffering the full cost of a cache-coherency broadcast.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I was dallying with the thought of the programmer having to roll their own global atomics :shock: forgetting that cache coherence + line-locking should lie at the heart of this.

    :oops: I should have just referred to cache-coherency fabric. I was mixing-in the need for virtualised addressing being common to all cores, into cache-coherency communication. But referring to the former.

    For 32 cores on the same die to all cooperate in a single virtual address space there does need to be some kind of TLB fabric though, doesn't there? I supposed they're all caching the same page table, which implies changes to the page table have to be atomic, but that's a whole other kettle of fish. I don't know how the "Pentium core" that Larrabee's based on does this, or how scalable it is.

    Hmm, I was forgetting that D3D programmers currently have to implement their own virtualised texture scheme and wouldn't work like this for normal textures, so this is looking into a future D3D or just for people like Sweeney.

    Though I'm wondering what happens when "AGP texturing" is required...

    I was thinking that there's an addressing scheme for texture results, e.g. a block of addresses are reserved for 16 quads of texture results returned by the TU. A portion of L2 acts as a stream-through buffer for these texture results. But that's mostly a sideline issue for this topic.

    I suppose the chip works in one of two modes: real and virtual addressing. Normally the TUs fetch texels from flat addresses and return results in a real-addressed block (which is the L2 stream-through buffer).

    Slides 12 and 33 here:

    http://s08.idav.ucdavis.edu/forsyth-larrabee-graphics-architecture.pdf

    indicate that there are TU TLBs, which allow it to work independently of the core, fetching page table entries if need be and managing its own page load requests. The owning thread is oblivious to all this stuff (though the core may receive mirrored TLB entries if they're changed?) and just ends up stalling when the texture results don't appear.

    Alternatively the programmer can elect to have hard faults activated, which seems to mean that TU-TLB is a mirror of core-TLB. I guess as soon as a TU-TLB entry miss occurs or a page is listed as not in physical RAM, the TU abandons the TEX. The TU fires the address back to the core which sleeps the thread and then services either the missing TLB entry or the page load, or both.

    When the page is ready to be used by the TU the core wakens the thread which re-submits the TEX.

    Presumably textures are striped into pages by mip-level, and each texture-mip doesn't cross page boundaries. This means there's only one page fault per TEX instruction per mip level - so in theory you could get two or more page misses for a single TEX, but you wouldn't be generating a miss per pixel in the qquad.

    So overall I don't think this would use the gather mechanic I described before - it seems like it's just a paging mechanic. Though I suppose you could use gather with page-base addresses, hmm...

    One of the interesting things here is that a TU-TLB is logically a mirror of 8 cores' TLBs, when hard-faulting is active. And each core's TLB is logically 4-way threaded too (though I'm not sure if Larrabee actually threads core-TLB). Whether hard- or soft-faulting there's 32 cores and 8 TUs all cooperating in page table maintenance.

    Jawed
     
  18. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    ALU:TMU frequency ratio

    G80= 2.35x
    G92= 2.49x
    GT200b = 2.27x
    (hypothetical G300) = 2.29x
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    To be honest I don't think of the 32 registers as anything other than pipeline registers. A shader's state registers, r0, r1 etc. map to addresses. How the compiler or programmer maps between state-register-addresses and pipeline-registers is completely flexible, e.g. if shader execution is emulated as clause-by-clause loops over the fibres, then the body of each loop has a fixed configuration of instructions<->registers, so shader state registers must all be mapped consistently into these registers. If the shader is formed as some chaotic intermingling of fibres :shock: then, erm, well leave that to the compiler.

    It should be the operand collector's problem - i.e. it translates warp+thread register-address into dynamic-warp lane and all that gritty stuff.

    Yep, having warps as contexts in their own right works very nicely - though any kind of render target RMW, intra-shader, obviously causes all sorts of fun as you're trying to construct a scoreboarder of your own using a task-warp as an atomic-singleton. Certainly not saying it won't work, but clearly this ends up looking a lot like Larrabee. Here I'm talking about long-winded RMW clauses within a shader, e.g. fetch pixel on instruction 1 with 10 or 20 dependent instructions before returning a result to the render target.

    Though I have to admit the way D3D-11 render target RMW is described seems to imply that the developer can't make this atomic. If atomic is desired then an extra pass is required using D3D-CS and, optionally, atomic - depends on whether CS is consuming an append buffer with pixel address+value pairs or whether it's trying to post-process a render target, I guess.

    So render-target RMW seems to be un-ordered chaos with fingers-crossed - prolly won't see much use?

    Obviously CUDA's different and we're now just waiting to see if persistent warps are a possibility.

    Yeah, this kind of thing seems to be the usual case of hiding latency by using a combination of fibres/qquads with ALUs in the shadow of the batch of requests + SMT. If the thread runs out of ALUs to run it means that the task-allocation routine has made the screen space tile too small. I suppose this is inevitable sometimes no matter what - after all GPUs can't always hide worst case texturing latency (e.g. perlin noise based on 3D texture lookups).

    Jawed
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The page table is just data in memory, granted it has system-level significance.
    The various cores and tex units might cache parts of it at any given time in their TLBs. There's no need in x86 for a TLB fabric, as they are caches that can stay coherent like their neighbor L1 and L2s.
    Since it is system-critical, changing the page table would require additional work that would be serializing. As AMD's Barcelona chip showed, there are a number of actions related to TLBs where there are assumed to be atomic operations on page table structures. The failure to handle them atomically is basically game-over.

    It might be fun to see how much work Intel has done in verifying Larrabee can handle TLB updates across so many cores, given what AMD experienced with just four. TLB-related errata are documented concerning non-atomic updates are still present in Intel and AMD designs, but are typically prevented by microcode patches.

    It would be interesting to see just how much the texture units can modify the page tables. Maybe they can update some of the bookkeeping bits per entry. If they write to the L2 buffer, it might have to be labelled dirty, or maybe it has to be already initialized to a fixed status by the core.
    That the texture unit gives up on a fault makes sense as modifying the actual page table is an OS-level operation.
    I would assume Larrabee's software sets up as much as it can ahead of time and tries to keep it as unchanged as possible, given the overhead.

    I wouldn't expect the core to receive mirrored TLB entries, if the texture unit somehow modifies them.
    An alteration would invalidate all cached copies of the page table entry. The CPU, if it were to require that entry, would miss and have to fill the TLB before trying to complete the memory access.
    It should be on-chip, as the texture unit would have a copy. It might be that the entries are mostly not modified by the texture units to avoid TLB thrashing.

    My take was that texture units have more specialized TLB behavior that allows them to behave in a rather non-x86 manner. The texture unit can, at programmer discretion, give up when a full core would be required to service a miss or fault.
    This might make good performance sense, as fiddling with the TLB can inject unpredictable latencies.
    I would think that the texture unit with hard faults enabled would still defer to the core if it encounters a fault that invokes an OS routine.

    I didn't get the impression that there was any mirroring of TLBs in the pdf. What leads to this conclusion? TLB entries can be shared, but an update to one TLB's entry will typically only lead to the invalidation of old copies cached elsewhere. Broadcasting an update wouldn't normally be done. The other cores would just have to service a miss, if they happen to need the address.

    TLBs are typically a shared resource local to a core. Given how intimately they are tied to the memory pipeline and how many there would be if they were per-thread, I'd bet this hasn't changed.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...