Can you use LRB as a CPU or better, can you use LRB in a system without a central processing unit?
Technically, why not? Practically, Intel knows.
Can you use LRB as a CPU or better, can you use LRB in a system without a central processing unit?
It wouldn't be that hard. Add real DP functionality that drops the performance hit vs. SP from 1:12 to 1:2, which is already a 6x increase, with the rest being made up by general performance improvement.10x DP increase. Color me sceptical :/
Technically, why not? Practically, Intel knows.
Nobody said it was impossible. Hey, they could drop sp and go dp exclusive too. It's a question of making money out of it too.
See this post:What makes you think they are doing this? Moving your I/Os to another chip doesn't necessarily solve that problem at all, and introduces quite a few more. Not that it's impossible, but I'm very curious to hear your reasoning.
For what it's worth I agree GDDR5 looks like a tough nut to crack - whereas GDDR3 looks pretty easy (see Xenos). Obviously these high speed interfaces aren't easy, with per lane training - and especially as GDDR5 is using these kinds of tricks.I don't know what the mm2/gbps and mW/gbps is for GDDR5, but for them to have a win, they'd need to use an interconnect between the "GPU" and "GPU memory controller" that is way better (in both those metrics) than GDDR5.
I agree, the magic inteconnect will add to board-level power - but it might reduce GPU-chip power.If they have 200GB/s of memory bandwidth (very reasonable), what interconnect will move 200GB/s between two chips, how much area and power will it use? That power and area is all extra, since with a normal architecture it'd just be GDDR5<-->DRAM. Now you have GPU<-->magic interconnect<-->GDDR5<-->DRAM.
One thing I don't understand is why NVidia's current hardware is incapable of doing floating point atomics, whereas the ROPs are capable of floating point "blending". My theory is that it achieves the latter by a combination of units (e.g. colour + stencil) that, combined, produce a result that looks good enough for floating point blending, but not precise enough to be called a floating point atomic.Moreover, it seems like that would do rather catastrophic things to latency for atomics, which goes against NV's goal of focusing on GPGPU.
See paragraphs 19 and 20 of the patent application. I recoil in horror at the use of PCI-Express, for what it's worth. Clearly that's not enough for the CrossFireX Sideport on RV770 (which appears to be PCI Express) so I don't see how it's viable here.So I'm curious:
1. What interconnect would you imagine they use?
Everything that deals with DRAM and seemingly ROPs.2. What exactly would they move to the 2nd "GPU memory controller"?
It adds flexibility; it defines an architecture that is based on re-usable, interchangeable and varying counts of these hubs; it reduces the engineering (costs + length of refresh cycles) involved per GPU chip, as a pile of junk (differing DDR interfaces) is someone else's problem; it focuses the GPU chip on computing and latency hiding; it scales up and down across SKUs from $75 to $700 (and $10000 in Tesla guise) easily; it also means that the Tesla line is not burdened by the crud in the ROPs that doesn't contribute to GPGPU (non-atomics stuff: z/stencil testing, render target compression, MSAA sample manipulation and maybe some other bits I haven't thought of...)3. Can you elaborate why you think they might go this route?
The Tesla-specific hub could have simplified stuff purely for atomics. Or maybe NVidia wants atomics running in the ALUs (there's shared memory atomics to be done, not just global atomics)?1) then the tesla chip won't be able to do any atomic operations.
Good point. But video stuff is in the region of 5mm² if judged by what ATI does, prolly half that in 40nm - and may use ALUs for some aspects of processing? It could go in the hub, but what if the architecture is about having multiple hubs per card for the 256-bit and 512-bit cards?2) and why on earth would a tesla chip have video stuff?
The sweetspot for the hubs might be 128-bit DDR interfaces. I'm thinking there wouldn't necessarily be a single hub per GPU.EDIT: 1 more snag here. Then both your compute and NVIO3 chip will need an atleast 512 bit mem interface, imposing rather large lower bound on the die size (by pad limits)
Thinking about this some more I suspect it's purely the continual latency over the lifetime of the draw call from late-Z units (full precision Z) back to early-Z (low precision Z). That latency always means early-Z is significantly out of date as well as, presumably, slowing down the late-Z unit (as it has to feed early-Z).Yeah I forgot about depth/stencil issues. The latency between post ROP fine Z/stencil results getting to the coarse+fine early Z/stencil can be important. In the case of short shaders (as you described), beyond the issues you described, the latency between ROP to raster unit might easily be high enough so that the early data isn't yet valid at the start of the draw call. Some depth/stencil state cases might require waiting for early update data sync, which could place a large enough pipeline bubble to warrant turning off the early logic?
The Tesla-specific hub could have simplified stuff purely for atomics. Or maybe NVidia wants atomics running in the ALUs (there's shared memory atomics to be done, not just global atomics)?
See paragraphs 19 and 20 of the patent application. I recoil in horror at the use of PCI-Express, for what it's worth. Clearly that's not enough for the CrossFireX Sideport on RV770 (which appears to be PCI Express) so I don't see how it's viable here.
Have you tested the performance of shared memory atomics?BTW, there is a current difference in functionality between global and shared memory atomics, only global atomics support 64-bit operations in CUDA 2. Seems like shared memory atomics would naturally be simple ALU operations flagged to avoiding bank broadcast and serialized when hardware detects the same shared memory address value (all which would clearly have bank conflicts). Likely in the address conflict case, with both the hot clock and the dual-issue half warp, this leads to a lot of idle lanes in the GT200. Who knows what they have changed for GT300...
It would be a shame if these atomics are not floating point.Of the subset of atomics in DX11 (which I think is only: add, min, max, or, xor, cas, exch, and the new InterlockedCompareStore), only one actually requires any complicated floating point hardware (that's add), assuming that InterlockedAdd() supports floating point in DX11 (which I'm not 100% sure of). Min and max could be done with simple integer math with an easy fixup for negative floats. Seems to me that an atomic only ALU unit would be tiny (with the exception of the float add).
Sounds entirely reasonable. The fast path is just to put the update into the atomic queue. It's fire and forget and only suffers serialisation latency. The slow path is forcing the thread to wait to find out the value, which requires the atomic queue to be processed upto the point containing the owning thread's update.BTW, an ATI DX11 doc says,
(Atomics) Can optionally return original value. (Atomics have a) Potential cost in performance: (1.) especially if original value is required, (2.) more latency hiding required.
Hints directly that atomic operations will be of higher latency, and that when return value is required, performance will be lower, which hints that global atomic operations aren't done in the ALUs (because if they were, you'd have the return value for free)
All this connectivity and parallelism already works in current GPUs as far as I can tell either in graphics or compute (that is, there may be subsets of these capabilities in one or the other programming model). PCI Express, itself, is relatively high latency though.The PCI-E interface being on the HUB would be quite important. With PCI-E on the hub, wouldn't you get a unified way to access memory regardless of source (just certain parts of the address space, like the PCI-E mapping, has lower latency and bandwidth). Example usage, unified way to grab command buffer data regardless of source (I'd bet some day, GT400?, you will be able to issue draw calls GPU side on NVidia hardware), parallel cpu2gpu and gpu2cpu copies interleaved with computation (which is possible starting with GT200), lower latency gpu2cpu result fetch for things like using GPU for more latency sensitive operations (perhaps something like real-time audio processing), ability to skip the cpu2gpu copy and fetch geometry data directly from cpu side (think about tessellation, huge amount of data expansion means latency and bandwidth of early source data might not be of high importance)...
Have you tested the performance of shared memory atomics?
See this post:
http://forum.beyond3d.com/showthread.php?p=1294818#post1294818
on a patent application which describes a remote hub for physical DDR interfacing + the option to put ROPs there.
For what it's worth I agree GDDR5 looks like a tough nut to crack - whereas GDDR3 looks pretty easy (see Xenos). Obviously these high speed interfaces aren't easy, with per lane training - and especially as GDDR5 is using these kinds of tricks.
I agree, the magic inteconnect will add to board-level power - but it might reduce GPU-chip power.
The command structure for moving data over the magic interconnect is theoretically simpler than the command structure of the DDR interface. There's no bank tracking with associated re-ordering and packets of data can be in nice big bursts with a low addressing overhead. DRAM IO drowns in addressing overhead, in comparison, with its dedicated command bus.
See paragraphs 19 and 20 of the patent application. I recoil in horror at the use of PCI-Express, for what it's worth. Clearly that's not enough for the CrossFireX Sideport on RV770 (which appears to be PCI Express) so I don't see how it's viable here.
Everything that deals with DRAM and seemingly ROPs.
It adds flexibility; it defines an architecture that is based on re-usable, interchangeable and varying counts of these hubs; it reduces the engineering (costs + length of refresh cycles) involved per GPU chip, as a pile of junk (differing DDR interfaces) is someone else's problem; it focuses the GPU chip on computing and latency hiding; it scales up and down across SKUs from $75 to $700 (and $10000 in Tesla guise) easily; it also means that the Tesla line is not burdened by the crud in the ROPs that doesn't contribute to GPGPU (non-atomics stuff: z/stencil testing, render target compression, MSAA sample manipulation and maybe some other bits I haven't thought of...)
As you can tell I really like this concept: it enables NVidia to architect a graphics processor for consumers that is modular, enabling a low-fat Tesla configuration to break off, enabling faster refresh cycles, controlling power and bandwidth and potentially opening-up Tesla to more advanced clustering configurations than the current fairly poor situation of small-scale "SLI" rigs or having to dedicate an x86 chip to each GPU in a rack.
At the same time the whole thing hinges on this magic-interconnect. It has an advantage over HT and QPI in that NVidia owns the landscape: NVidia builds the GPU, the substrate it's mounted on (with hubs in tow?) and the board the GPU and memory is built on. NVidia can emplace constraints on the magic-interconnect that aren't viable in more general purpose schemes. It's similar to the way in which GPUs have access to ~200GB/s while CPUs are pissing about with 1/5th that.
Jawed
Oh, you haven't done global atomics without a return value. Thought you hadI haven't profiled shared memory atomics yet, nor global memory atomics without the return value. I've only profiled global with return.
For shared memory atomics maybe try a domain with blocks of 1024 threads - get each one to generate a random integer (generate a load of them to hide latency) and get a block-level max? Try with different modulos of warpthreadID for the shared memory atomic address: 1, 2, 4, 8, 16?Haven't really thought about the best way to profile shared memory atomics (seems lots of possible pitfalls with things that get optimized out by the compiler in combination with effects from non-atomic instructions). If you have any ideas here, I'll try them...
I guess you can start with Nehalem-EX's SMB. That's real, it's working and it appears to be doing exactly what this patent proposes.If you want to think about what kind of I/O you might use to connect a GPU and GPU memory controller, I'd look into Rambus' work. They do the most aggressive high speed interfaces out there.
GPUs are designed to hide latency - you'd be better off asking about the extra latency Nehalem-EX experiences as a result of SMB.Don't you think this approach creates too much latency to be useful for GPUs?