22 nm Larrabee

Absolutely not. GCN merely catches up with Fermi.
There are certain elements of this change that are pointing to a different track than Fermi.
The CU's scheduling changes are making it appear as the possible base of a shared coprocessor.
Ironically, if it were to become something like a FlexFP unit, it could very well go back to being VLIW or at least LIW-ish, given what macro-ops and dispatch groups actually are.
In AMD's case, it's easier, because the coprocessor model isolates the rename and scheduling portions of the FP unit from the OoO integer pipes. If the unit was instead a CU, the in-order nature would be none of the core's business.

And do you think PCIe based memory coherence for discrete GPUs is going to work well? Or are we expected to buy an APU plus a discrete GPU? Developers don't like programming two devices, so they'll downright hate programming three. There's really no other choice but to make the CPU fully homogeneous. And it's well within reach, so I'm sure Intel is looking into it right now.
There may be a third way, though it isn't fleshed out as of yet. AMD has already speculated on perhaps putting instructions into the ISA that would integrate disparate devices into the same instruction stream.
The CU design is already built to be shared amongst various controllers, either the graphics pipes or compute pipes. Adding a third client in the form of a CPU core would not be impossible.

If the CU keeps its control flow capability, an instruction with the proper escape sequence could make a thread migrate to the CU, where it would function as an offload engine until an escape sequence sends it back. If not, it behaves like the upcoming FlexFP unit, reliant on the integer pipe for control flow.
Potentially, this could be regarded as a compiler hint or noop in a chip without a CU.
This drops it down to 2 devices, or perhaps 2.1 devices.
 
Socket or AIB

Please tell me this thing is going into the socket and not into an AIB, even with coherency extensions.

I'd imagine they would do both.
PCIe (3.0?) first since those can be deployed into existing machines and sit side by side with Tesla's.

Personally... I'd like to see them integrated MIC with a normal core(s) and use a common socket.
i.e buy a 4 socket motherboard an put in any mix of Xeon+MIC or normal 10 core Xeons.

Is there a quad+ socket LGA 2011?
 
The first examples seem to be add-in boards.
Perhaps the 22nm design is different, as the 45nm version's x86 cores probably would not play well in a mixed system, or with system code less than a decade old.

Then there's the interest of the Xeon line, which may have been part of the reason why Intel quietly dropped early promises of socketed Larrabees. Statements from some in the Xeon group were pretty cool to the project.
 
If you say so, you already know how it will look like? :rolleyes:
I know Fermi doesn't stand a chance against Knight's Corner, and NVIDIA can't afford to lose the HPC market.
Tiny in comparison to the amount of data in the regs, yes. Try swapping out hundred of kilobytes registers on a CPU. Wait, a CPU doesn't have so many registers. I guess the performance would tank on a CPU if you completely trash your caches frequently. Hmm.
GPUs have very few registers per strand. And no, performance does not "tank" on a CPU, because you don't trash an 8-way set associative L1 cache that easily. And when data does gets evicted it's only because more important data is taking its place.

Anything the GPU runs efficiently, the CPU can run efficiently too. It doesn't work the other way around (yet).
I thought we are talking about stuff integrated on the same die, sharing the last level cache and the memory controller anyway?
The slides about GCN say discrete GPUs will support PCIe based memory coherence.

They could integrate GCN cores on every CPU die, but the x86 performance would suffer and you'd have to choose between using the discrete GPU and taking the communication overhead, or using the integrated GCN cores and leaving the GPU unused. Neither is desirable. And with Haswell likely capable of 500 GFLOPS or more I seriously doubt vendor specific heterogeneous computing will ever become a success.

1024-bit instructions will make the CPU's power consumption go down considerably, so that's the nail in the coffin for GPGPU on an IGP.
What about an software stack allowing to write code that will run on any device or combination of devices?
Sounds fantastic. Who's going to write this and provide it for free? AMD? Their software teams are understaffed. And have they started yet, 'cause usually it takes at least half a decade to create such a framework and gain any sort of market acceptance.

Solving the issues of a hardware architecture with software has never really worked out well...
And again, I really think we should agree that we disagree and stop here.
AVX2 and LRBni are very similar (and both x86 extensions), so the recent announcements clearly show that the convergence toward homogeneous computing is ongoing and this discussion has anything but ended. But don't feel obliged to take part in it.
 
And no, performance does not "tank" on a CPU, because you don't trash an 8-way set associative L1 cache that easily.
Just a side note as this will be my last post on this topic. GPUs have often much a higher associativity in their caches, as they can tolerate much higher latencies. Just as an example, the texture L1 cache of the current Radeons is 128-way set associative and the L1 cache for GCN is 64-way set associative. So much for this.
 
So CPU-s can now software render Crysis with same quality and speed like GPU-s ?
Being able to execute code just as efficiently doesn't mean they can deliver same overall performance with greatly lower computing power in the chip.
 
Being able to execute code just as efficiently doesn't mean they can deliver same overall performance with greatly lower computing power in the chip.
So can a CPU render it at 1/10th of the speed with 1/10th of the power consumption? ;)
 
So can a CPU render it at 1/10th of the speed with 1/10th of the power consumption? ;)
I haven't seen any rasterizer being optimized anywhere near as well on x86 as e.g BF3 on Cell but I would imagine unless they get memory bottlenecked it could be possible once FMA gets added to the instruction set.
 
Just a side note as this will be my last post on this topic. GPUs have often much a higher associativity in their caches, as they can tolerate much higher latencies. Just as an example, the texture L1 cache of the current Radeons is 128-way set associative and the L1 cache for GCN is 64-way set associative. So much for this.

Fully associative vs. 8 way doesn't make a big difference for caches that small and general purpose workloads.

David
 
The description of the AMD L1 cache seems kind of funky to me.
4 sets x 64-way.

With 64B cache lines, that is 256 lines in total for the capacity given.
With 64-way associativity, that is 4 lines per set.

Or:

256 lines that are split 4-ways in a manner consistent with how the GPU tiles its memory space.
That's 64 lines per set.
So 4 fully associative caches, each given a subset of the address space?

edit:
And to go back to AVX-1024 for a moment, I am not certain about the power savings.
Decode power, yes.
Gating the scheduler does not look like a good idea with Intel's unified scheduler.
Similarly, since we are using 256-bit registers, something has to send the register ID in the uop to the register file and bypass network.
Where does this logic reside in Intel's design? At least the uop sourcing comes from the scheduler.
 
Last edited by a moderator:
The description of the AMD L1 cache seems kind of funky to me.
4 sets x 64-way.

With 64B cache lines, that is 256 lines in total for the capacity given.
With 64-way associativity, that is 4 lines per set.

Or:

256 lines that are split 4-ways in a manner consistent with how the GPU tiles its memory space.
That's 64 lines per set.
So 4 fully associative caches, each given a subset of the address space?
The number of sets are the number of fully associative sub-caches. So their description is completely correct, it is even over-determinded. Any [strike]two[/strike]three of the [strike]three[/strike]four given numbers would fully characterize the organization of the cache: 16kB size, 64Byte line size, 4 sets, 64 way associative. Does it really matter how they map each of the 4 sets to adress space?

Edit:
Textures don't reside in memory as normal arrays. They are reordered and saved along a space filling curve (a hierarchical Z in ATI's case). That accounts for the locality in textures, not the cache itself.
 
Last edited by a moderator:
It does point to a specialization in the memory subsystem not present in the CPU space.
The description seems funky to me because it is overdetermined. If this were a CPU, they'd say 64B lines 4-way.
edit: Sorry, 64B lines 64-way.

That they don't means there's something different.
Perhaps the tiling means the cache does not only go by the least significant bits of the address for placement.
The set would be picked by the more or most significant. Possibly, the sub-caches would remain 64-way if the capacity were increased, indicating the least significant bits are used within the set.
 
Fully associative vs. 8 way doesn't make a big difference for caches that small and general purpose workloads.
When you say general purpose I think about a normal processor cache ... not one designed for a throughput of 16 independent 32 bit words per cycle (well at least in Cypress, maybe more for their next gen ... also strictly speaking it needs to retrieve 8 64 bit words, with 32 bit alignment per cycle, but that's just nitpicking).
 
I haven't seen any rasterizer being optimized anywhere near as well on x86 as e.g BF3 on Cell but I would imagine unless they get memory bottlenecked it could be possible once FMA gets added to the instruction set.

The only reason why a lot of PS3 games make rasterization on spe-s is because RSX 8 vertex pipelines are too slow. I bet noone would touch the spe-s for rasterization if PS3 had a G80 GPU with unified shaders.
 
I bet noone would touch the spe-s for rasterization if PS3 had a G80 GPU with unified shaders.
How strong would that G80 be given RSX transistor budget (and bandwidth)?
[edit]
RSX was "over 300M" transistors, g84 is ~289M. How big performance difference they have?
 
I'm sure there are highly optimized software renderers out there. CPUs are just slow at some tasks, end of story. No amount of fantasizing will change that.

I'm no engineer but I find the distinction to be really academic and useless at this point. When Intel adds a bunch of vector processing to x86 it's called a CPU. What do you call it when nVidia adds ARM cores to its GPUs? Sounds like the same friggin thing to me.
 
Back
Top