Well, so it has an extra chip between itself and the memory, but isn't that the same situation as a northbridge with memory controller?
It won't be meaningful then. I mean what's the point of adding another level in thread & memory hierarchy which has no useful benefit what soever. If you wanted inter block communication, why would you limit it to 4 blocks only? This super block would allow more threads to communicate within itself, but what happens to inter superblock communication. They may make shared memory more useful, "featureful", but if all that is to be achieved with this proposal is limited inter block communication (short of full any-block-to-any-block communication), then why not just make the max block size larger?
assuming that InterlockedAdd() supports floating point in DX11 (which I'm not 100% sure of)
It would be a shame if these atomics are not floating point.
Well, so it has an extra chip between itself and the memory, but isn't that the same situation as a northbridge with memory controller?
I've realised that the longest sides on Nehalem-EX can't be more than around 30mm - similar to the longest sides on Larrabee. That would mean that SMI on Nehalem-EX is providing what seems to be a 512-bit DDR interface in ~same perimeter as RV770's 256-bit GDDR5.I haven't worked out yet what the effective DDR bus width per chip is or how this compares with the size of the DDR interfaces on Nehalem.
One program is a peculiar idea. If anything I'd say that only one type of kernel could run on a given multiprocessor at any one time.Getting back to the multiple-program-multiple-data/MPMD (not MIMD) : So GT200 and prior compiled VS(+GS)+PS into one "program"? This "program" might mean that shared memory block size would be the union of shared memory block size used by the VS(/GS)/PS shaders so that the hardware could dynamically switch between VS and PS entry points of the program depending on geometry/load balancing needs. With GT300 we expect to be able to run more than one "program" simultaneously.
One of the remaining problems I have with G80 architecture is the scheduling of work within a cluster. It's my suspicion that scheduling is common to the cluster (i.e. there is one scoreboard of some type that rules in the creation of warps, the progress of instruction pages, issue to TMUs), which is how TMUs can be shared across multiprocessors.I went back a read through Theo's term mangled comments again because I didn't get what he ment before when saying "cluster organization is no longer static. The Scratch Cache is much more granular and allows for larger interactivity between the cores inside the cluster." Seems as if he used the same term "sratch cache" when talking about G80 arch compared to ATI previously not having shared memory. So "sratch cache" literally means "shared memory".
Shared memory already provides fine granularity accesses, unless he is saying that bank conflicts would be lower (which wouldn't make any sense at all). Also thread2thread interactivity or warp2warp (in the same block) interactivity is also fully granular (through shared memory). So he must be attempting to describe something else,
The bigger you make it the slower it gets. But if you want some kind of on-die variably sized inter-kernel queuing then maybe sacrificing some shared memory is part of the trick - but that would only be able to feed the per-cluster scheduler if it was flat as viewed by the cluster scheduler. But that kind of flatness shouldn't necessitate making it literally flat.What if shared memory was now shared across four 8-wide SIMD multiprocessors in a cluster (moving up from 3 multiprocessors/cluster in GT200, to fit the 512 "core" rumor)? So you'd have 32KB*4 or 128KB of banked shared memory per cluster.
I suppose matrix multiplication has the same problem.Well, we don't have any floating point atomics on CPUs either. A floating point InterlockedAdd wouldn't make much sense anyway since floating point adds are order dependent, which kind of breaks the usage model of it.
I guess you can start with Nehalem-EX's SMB. That's real, it's working and it appears to be doing exactly what this patent proposes.
That's how the patent application paints it - so really it's a question of whether GDDR5 surmounts those eventualities.The hub chips are useful if Nvidia forsees that it will be heavily pad limited, or that it sees that it will continually be on the losing end of memory technology transitions.
But GPU chips are required to support multiple configurations, even if they're static once sold.Given that GPU boards are not expandable, and do not support multi-drop buses, a large part of the argument for the hub chip is irrelevant in this context.
That's how the patent application paints it - so really it's a question of whether GDDR5 surmounts those eventualities.
But GPU chips are required to support multiple configurations, even if they're static once sold.
Any kind of reduction kernel also inherently suffers the same order-dependent-precision problem - and reductions in Brook+ for example support floating point.
it's kinda hard to see in this die photo, does anyone have higher rez?I've realised that the longest sides on Nehalem-EX can't be more than around 30mm - similar to the longest sides on Larrabee. That would mean that SMI on Nehalem-EX is providing what seems to be a 512-bit DDR interface in ~same perimeter as RV770's 256-bit GDDR5.
But there's no doubt a question on the actual bandwidth...
Jawed
I can inform you that current A1-samples are clocked at 700/1600/1100 MHz. So they come to impressed 2,457 Gigaflops and 281 GB/s memory bandwidth.
It is nice, is not it?
Source: Hardware-Infos
If correct, then there are two good reasons:
1. Volume. Takes time.
2. Hard launch. Takes coordination and planning.
If correct, then there are two good reasons:
1. Volume. Takes time.
2. Hard launch. Takes coordination and planning.
Ahhh.... I hate you, I just spend like an 1.5h at the website in your sig...