Nvidia GT300 core: Speculation

rpg.314 · May 28, 2009

Can you use LRB as a CPU or better, can you use LRB in a system without a central processing unit?

Technically, why not? Practically, Intel knows.

KimB · May 28, 2009

rpg.314 said:
10x DP increase. Color me sceptical :/

It wouldn't be that hard. Add real DP functionality that drops the performance hit vs. SP from 1:12 to 1:2, which is already a 6x increase, with the rest being made up by general performance improvement.

Not saying it'll happen, just that it's not outlandish.

rpg.314 · May 28, 2009

Nobody said it was impossible. Hey, they could drop sp and go dp exclusive too. It's a question of making money out of it too.

Ailuros · May 28, 2009

rpg.314 said:
Technically, why not? Practically, Intel knows.

Technically yes; if you don't care one bit about central processing performance that is. Instead of IPS (instructions per second) you'd simply have SPI (seconds per instruction).

rpg.314 said:
Nobody said it was impossible. Hey, they could drop sp and go dp exclusive too. It's a question of making money out of it too.

Then they'd definitely would have a separate chip for professional markets. You could never justify the die size increase for a desktop GPU, considering that DP is of no use for the mainstream consumer.

Jawed · May 28, 2009

dkanter said:
What makes you think they are doing this? Moving your I/Os to another chip doesn't necessarily solve that problem at all, and introduces quite a few more. Not that it's impossible, but I'm very curious to hear your reasoning.

See this post:

http://forum.beyond3d.com/showthread.php?p=1294818#post1294818

on a patent application which describes a remote hub for physical DDR interfacing + the option to put ROPs there.

I don't know what the mm2/gbps and mW/gbps is for GDDR5, but for them to have a win, they'd need to use an interconnect between the "GPU" and "GPU memory controller" that is way better (in both those metrics) than GDDR5.

For what it's worth I agree GDDR5 looks like a tough nut to crack - whereas GDDR3 looks pretty easy (see Xenos). Obviously these high speed interfaces aren't easy, with per lane training - and especially as GDDR5 is using these kinds of tricks.

If they have 200GB/s of memory bandwidth (very reasonable), what interconnect will move 200GB/s between two chips, how much area and power will it use? That power and area is all extra, since with a normal architecture it'd just be GDDR5<-->DRAM. Now you have GPU<-->magic interconnect<-->GDDR5<-->DRAM.

I agree, the magic inteconnect will add to board-level power - but it might reduce GPU-chip power.

The command structure for moving data over the magic interconnect is theoretically simpler than the command structure of the DDR interface. There's no bank tracking with associated re-ordering and packets of data can be in nice big bursts with a low addressing overhead. DRAM IO drowns in addressing overhead, in comparison, with its dedicated command bus.

Moreover, it seems like that would do rather catastrophic things to latency for atomics, which goes against NV's goal of focusing on GPGPU.

One thing I don't understand is why NVidia's current hardware is incapable of doing floating point atomics, whereas the ROPs are capable of floating point "blending". My theory is that it achieves the latter by a combination of units (e.g. colour + stencil) that, combined, produce a result that looks good enough for floating point blending, but not precise enough to be called a floating point atomic.

In theory D3D11 tightens up the spec on floating-point blending, maybe to the point that it's good enough for floating-point atomics. Then again ...

Separately atomics already have catastrophic latency in NVidia. You're using them wrong if you suffer at their hands. You need to issue a salvo of atomics to get nice performance.

So I'm curious:
1. What interconnect would you imagine they use?

See paragraphs 19 and 20 of the patent application. I recoil in horror at the use of PCI-Express, for what it's worth. Clearly that's not enough for the CrossFireX Sideport on RV770 (which appears to be PCI Express) so I don't see how it's viable here.

2. What exactly would they move to the 2nd "GPU memory controller"?

Everything that deals with DRAM and seemingly ROPs.

3. Can you elaborate why you think they might go this route?

It adds flexibility; it defines an architecture that is based on re-usable, interchangeable and varying counts of these hubs; it reduces the engineering (costs + length of refresh cycles) involved per GPU chip, as a pile of junk (differing DDR interfaces) is someone else's problem; it focuses the GPU chip on computing and latency hiding; it scales up and down across SKUs from $75 to $700 (and $10000 in Tesla guise) easily; it also means that the Tesla line is not burdened by the crud in the ROPs that doesn't contribute to GPGPU (non-atomics stuff: z/stencil testing, render target compression, MSAA sample manipulation and maybe some other bits I haven't thought of...)

As you can tell I really like this concept: it enables NVidia to architect a graphics processor for consumers that is modular, enabling a low-fat Tesla configuration to break off, enabling faster refresh cycles, controlling power and bandwidth and potentially opening-up Tesla to more advanced clustering configurations than the current fairly poor situation of small-scale "SLI" rigs or having to dedicate an x86 chip to each GPU in a rack.

At the same time the whole thing hinges on this magic-interconnect. It has an advantage over HT and QPI in that NVidia owns the landscape: NVidia builds the GPU, the substrate it's mounted on (with hubs in tow?) and the board the GPU and memory is built on. NVidia can emplace constraints on the magic-interconnect that aren't viable in more general purpose schemes. It's similar to the way in which GPUs have access to ~200GB/s while CPUs are pissing about with 1/5th that.

Jawed

Jawed · May 28, 2009

rpg.314 said:
1) then the tesla chip won't be able to do any atomic operations.

The Tesla-specific hub could have simplified stuff purely for atomics. Or maybe NVidia wants atomics running in the ALUs (there's shared memory atomics to be done, not just global atomics)?

2) and why on earth would a tesla chip have video stuff?

Good point. But video stuff is in the region of 5mm² if judged by what ATI does, prolly half that in 40nm - and may use ALUs for some aspects of processing? It could go in the hub, but what if the architecture is about having multiple hubs per card for the 256-bit and 512-bit cards?

EDIT: 1 more snag here. Then both your compute and NVIO3 chip will need an atleast 512 bit mem interface, imposing rather large lower bound on the die size (by pad limits)

The sweetspot for the hubs might be 128-bit DDR interfaces. I'm thinking there wouldn't necessarily be a single hub per GPU.

Jawed

Jawed · May 28, 2009

TimothyFarrar said:
Yeah I forgot about depth/stencil issues. The latency between post ROP fine Z/stencil results getting to the coarse+fine early Z/stencil can be important. In the case of short shaders (as you described), beyond the issues you described, the latency between ROP to raster unit might easily be high enough so that the early data isn't yet valid at the start of the draw call. Some depth/stencil state cases might require waiting for early update data sync, which could place a large enough pipeline bubble to warrant turning off the early logic?

Thinking about this some more I suspect it's purely the continual latency over the lifetime of the draw call from late-Z units (full precision Z) back to early-Z (low precision Z). That latency always means early-Z is significantly out of date as well as, presumably, slowing down the late-Z unit (as it has to feed early-Z).

What's puzzling me is how was early-Z viable under these kinds of conditions back in the early days, 8500 onwards? By definition those shaders were teeny, surely they're the worst case for shortness?

Hmm, maybe this is all about relative memory performance, compared with fillrate and compute (i.e. GPU clocks and memory clocks were more similar back then). Back then Z-rejection performance per unit of fillrate would have been a big deal because memory efficiency was so low (poor compression, little caching, slow memory command turn-around). Yet against that the compute:memory of that era was much more favourable than it is now :???:

Jawed

Jawed · May 28, 2009

Just noticed that Intel has memory hubs of their own in the Nehalem EX architecture:

http://www.anandtech.com/weblog/showpost.aspx?i=604

Scalable Memory Buffers they're called. Seems IBM's been doing this for a while, too.

The SMI sections on the die appear to occupy one entire long-side of the die - I presume that's what talks to the SMBs. I haven't worked out yet what the effective DDR bus width per chip is or how this compares with the size of the DDR interfaces on Nehalem.

It appears each processor interfaces with multiple SMBs if required.

Jawed

fellix · May 28, 2009

Well, it took them some time to figure out that an AMB bridge per DIMM is too much of an overkill.

Jawed · May 28, 2009

16 DIMMs per CPU for Nehalem-EX:

http://www.hardwarezone.com/articles/view.php?cid=2&id=2910&pg=2

which is 4 DIMMs per SMI it seems - although the picture of the die seems to show that SMI comes in two parts, not 4. Anyway, 4 hubs say at 128-bit per SMI = 512-bit equivalent bus-width?

Jawed

TimothyFarrar · May 28, 2009

Jawed said:
The Tesla-specific hub could have simplified stuff purely for atomics. Or maybe NVidia wants atomics running in the ALUs (there's shared memory atomics to be done, not just global atomics)?

BTW, there is a current difference in functionality between global and shared memory atomics, only global atomics support 64-bit operations in CUDA 2. Seems like shared memory atomics would naturally be simple ALU operations flagged to avoiding bank broadcast and serialized when hardware detects the same shared memory address value (all which would clearly have bank conflicts). Likely in the address conflict case, with both the hot clock and the dual-issue half warp, this leads to a lot of idle lanes in the GT200. Who knows what they have changed for GT300...

Of the subset of atomics in DX11 (which I think is only: add, min, max, or, xor, cas, exch, and the new InterlockedCompareStore), only one actually requires any complicated floating point hardware (that's add), assuming that InterlockedAdd() supports floating point in DX11 (which I'm not 100% sure of). Min and max could be done with simple integer math with an easy fixup for negative floats. Seems to me that an atomic only ALU unit would be tiny (with the exception of the float add).

BTW, an ATI DX11 doc says,

(Atomics) Can optionally return original value. (Atomics have a) Potential cost in performance: (1.) especially if original value is required, (2.) more latency hiding required.

Hints directly that atomic operations will be of higher latency, and that when return value is required, performance will be lower, which hints that global atomic operations aren't done in the ALUs (because if they were, you'd have the return value for free)

TimothyFarrar · May 28, 2009

Jawed said:
See paragraphs 19 and 20 of the patent application. I recoil in horror at the use of PCI-Express, for what it's worth. Clearly that's not enough for the CrossFireX Sideport on RV770 (which appears to be PCI Express) so I don't see how it's viable here.

The PCI-E interface being on the HUB would be quite important. With PCI-E on the hub, wouldn't you get a unified way to access memory regardless of source (just certain parts of the address space, like the PCI-E mapping, has lower latency and bandwidth). Example usage, unified way to grab command buffer data regardless of source (I'd bet some day, GT400?, you will be able to issue draw calls GPU side on NVidia hardware), parallel cpu2gpu and gpu2cpu copies interleaved with computation (which is possible starting with GT200), lower latency gpu2cpu result fetch for things like using GPU for more latency sensitive operations (perhaps something like real-time audio processing), ability to skip the cpu2gpu copy and fetch geometry data directly from cpu side (think about tessellation, huge amount of data expansion means latency and bandwidth of early source data might not be of high importance)...

Jawed · May 28, 2009

TimothyFarrar said:
BTW, there is a current difference in functionality between global and shared memory atomics, only global atomics support 64-bit operations in CUDA 2. Seems like shared memory atomics would naturally be simple ALU operations flagged to avoiding bank broadcast and serialized when hardware detects the same shared memory address value (all which would clearly have bank conflicts). Likely in the address conflict case, with both the hot clock and the dual-issue half warp, this leads to a lot of idle lanes in the GT200. Who knows what they have changed for GT300...

Have you tested the performance of shared memory atomics?

I'm unclear how these can be anything other than purely serialised. The store unit in each multiprocessor acts as the gatekeeper for atomics, rather than a ROP in the global atomic case. So the question then is how big is the store unit's scoreboard? By bank?

Of the subset of atomics in DX11 (which I think is only: add, min, max, or, xor, cas, exch, and the new InterlockedCompareStore), only one actually requires any complicated floating point hardware (that's add), assuming that InterlockedAdd() supports floating point in DX11 (which I'm not 100% sure of). Min and max could be done with simple integer math with an easy fixup for negative floats. Seems to me that an atomic only ALU unit would be tiny (with the exception of the float add).

It would be a shame if these atomics are not floating point.

BTW, an ATI DX11 doc says,

(Atomics) Can optionally return original value. (Atomics have a) Potential cost in performance: (1.) especially if original value is required, (2.) more latency hiding required.

Hints directly that atomic operations will be of higher latency, and that when return value is required, performance will be lower, which hints that global atomic operations aren't done in the ALUs (because if they were, you'd have the return value for free)

Sounds entirely reasonable. The fast path is just to put the update into the atomic queue. It's fire and forget and only suffers serialisation latency. The slow path is forcing the thread to wait to find out the value, which requires the atomic queue to be processed upto the point containing the owning thread's update.

It appears on NVidia you always take the slow path, regardless of whether the owning thread wants a return value. Maybe NVidia will fix that in GT300?

Jawed

Jawed · May 28, 2009

TimothyFarrar said:
The PCI-E interface being on the HUB would be quite important. With PCI-E on the hub, wouldn't you get a unified way to access memory regardless of source (just certain parts of the address space, like the PCI-E mapping, has lower latency and bandwidth). Example usage, unified way to grab command buffer data regardless of source (I'd bet some day, GT400?, you will be able to issue draw calls GPU side on NVidia hardware), parallel cpu2gpu and gpu2cpu copies interleaved with computation (which is possible starting with GT200), lower latency gpu2cpu result fetch for things like using GPU for more latency sensitive operations (perhaps something like real-time audio processing), ability to skip the cpu2gpu copy and fetch geometry data directly from cpu side (think about tessellation, huge amount of data expansion means latency and bandwidth of early source data might not be of high importance)...

All this connectivity and parallelism already works in current GPUs as far as I can tell either in graphics or compute (that is, there may be subsets of these capabilities in one or the other programming model). PCI Express, itself, is relatively high latency though.

It's also point-to-point. So if you want to build a graphics card with multiple hubs you'd need to interconnect them.

Sadly the PCI Express 16 slot doesn't allow arbitrary indpendence of the lanes in the slot (cheaper, simpler) - so if a future version has independence of all lanes then any graphics card that makes use of that configuration can't work in an older version PCI Express slot (or it would have to fall back to an on-card network). With independence you could dedicate 4 lanes to each of four hubs on the ultra-enthusiast SKU, so that the hubs work in parallel and data moves didn't trouble the GPU. As it is you'd required some kind of network, i.e. a tree of router chips or at least a tree of routing integrated within the hubs.

Jawed

TimothyFarrar · May 28, 2009

Jawed said:
Have you tested the performance of shared memory atomics?

I haven't profiled shared memory atomics yet, nor global memory atomics without the return value. I've only profiled global with return.

Haven't really thought about the best way to profile shared memory atomics (seems lots of possible pitfalls with things that get optimized out by the compiler in combination with effects from non-atomic instructions). If you have any ideas here, I'll try them...

dkanter · May 28, 2009

Jawed said:
See this post:

http://forum.beyond3d.com/showthread.php?p=1294818#post1294818

on a patent application which describes a remote hub for physical DDR interfacing + the option to put ROPs there.

For what it's worth I agree GDDR5 looks like a tough nut to crack - whereas GDDR3 looks pretty easy (see Xenos). Obviously these high speed interfaces aren't easy, with per lane training - and especially as GDDR5 is using these kinds of tricks.

I agree, the magic inteconnect will add to board-level power - but it might reduce GPU-chip power.

The command structure for moving data over the magic interconnect is theoretically simpler than the command structure of the DDR interface. There's no bank tracking with associated re-ordering and packets of data can be in nice big bursts with a low addressing overhead. DRAM IO drowns in addressing overhead, in comparison, with its dedicated command bus.

Why would it be simpler? You still need a low latency way to send memory controller commands to the GDDR5 controller. You haven't simplified the problem at all.

Also, this 2ndary chip will be pretty damn big b/c it will be pin limited.

Just because something is in a patent doesn't mean it will see the light of day.

See paragraphs 19 and 20 of the patent application. I recoil in horror at the use of PCI-Express, for what it's worth. Clearly that's not enough for the CrossFireX Sideport on RV770 (which appears to be PCI Express) so I don't see how it's viable here.

Everything that deals with DRAM and seemingly ROPs.

It adds flexibility; it defines an architecture that is based on re-usable, interchangeable and varying counts of these hubs; it reduces the engineering (costs + length of refresh cycles) involved per GPU chip, as a pile of junk (differing DDR interfaces) is someone else's problem; it focuses the GPU chip on computing and latency hiding; it scales up and down across SKUs from $75 to $700 (and $10000 in Tesla guise) easily; it also means that the Tesla line is not burdened by the crud in the ROPs that doesn't contribute to GPGPU (non-atomics stuff: z/stencil testing, render target compression, MSAA sample manipulation and maybe some other bits I haven't thought of...)

You don't get it:
1. It goes against the trend of greater integration
2. GPUs are not just about lots of compute, they are first and foremost about lots of memory bandwidth. Once you have memory bandwidth, then compute matters.
3. Separating the memory controller reduces effective memory bandwidth, a lot.
4. Being independent of the graphics memory isn't a big plus as you need to tune GPU architecture to suit memory architecture (remember load/store units are shared by multiple SMs...and GDDRx will change the ideal command/data ratio).

As you can tell I really like this concept: it enables NVidia to architect a graphics processor for consumers that is modular, enabling a low-fat Tesla configuration to break off, enabling faster refresh cycles, controlling power and bandwidth and potentially opening-up Tesla to more advanced clustering configurations than the current fairly poor situation of small-scale "SLI" rigs or having to dedicate an x86 chip to each GPU in a rack.

At the same time the whole thing hinges on this magic-interconnect. It has an advantage over HT and QPI in that NVidia owns the landscape: NVidia builds the GPU, the substrate it's mounted on (with hubs in tow?) and the board the GPU and memory is built on. NVidia can emplace constraints on the magic-interconnect that aren't viable in more general purpose schemes. It's similar to the way in which GPUs have access to ~200GB/s while CPUs are pissing about with 1/5th that.

Jawed

The reason why GPUs have 200GB/s of bandwidth and CPUs don't are way different and quite simple:
1. CPUs need high capacity (>100GB), GPUs use <2GB
2. CPUs use socketed DIMMs
3. CPUs don't have as many I/O pins (GT200 has as many I/O pins as high-end RISC chips)
4. CPUs need low latency
5. CPUs need to work in multi-socket systems coherently

Coming up with something superior to HT or CSI will not be simple. Yes, there are some compromises they made to operate in more commodity oriented environments, so you could pick up some performance there...but they are just designed for other purposes.

If you want to think about what kind of I/O you might use to connect a GPU and GPU memory controller, I'd look into Rambus' work. They do the most aggressive high speed interfaces out there.

DK

Jawed · May 28, 2009

TimothyFarrar said:
I haven't profiled shared memory atomics yet, nor global memory atomics without the return value. I've only profiled global with return.

Oh, you haven't done global atomics without a return value. Thought you had

Haven't really thought about the best way to profile shared memory atomics (seems lots of possible pitfalls with things that get optimized out by the compiler in combination with effects from non-atomic instructions). If you have any ideas here, I'll try them...

For shared memory atomics maybe try a domain with blocks of 1024 threads - get each one to generate a random integer (generate a load of them to hide latency) and get a block-level max? Try with different modulos of warpthreadID for the shared memory atomic address: 1, 2, 4, 8, 16?

Jawed

Jawed · May 28, 2009

dkanter said:
If you want to think about what kind of I/O you might use to connect a GPU and GPU memory controller, I'd look into Rambus' work. They do the most aggressive high speed interfaces out there.

I guess you can start with Nehalem-EX's SMB. That's real, it's working and it appears to be doing exactly what this patent proposes.

Jawed

Lukfi · May 28, 2009

Don't you think this approach creates too much latency to be useful for GPUs?

Jawed · May 28, 2009

Lukfi said:
Don't you think this approach creates too much latency to be useful for GPUs?

GPUs are designed to hide latency - you'd be better off asking about the extra latency Nehalem-EX experiences as a result of SMB.

Jawed

Nvidia GT300 core: Speculation

rpg.314

KimB

rpg.314

Ailuros

Epsilon plus three

Jawed

Jawed

Jawed

Jawed

fellix

Jawed

TimothyFarrar

TimothyFarrar

Jawed

Jawed

TimothyFarrar

dkanter

Jawed

Jawed

Lukfi

Jawed

Similar threads