Haswell vs Kaveri

J30oz6Q.png


So, apparently the i-cache size scales with the associativity, although in an "odd" manner, which means the bank structure is unchanged.
 
J30oz6Q.png


So, apparently the i-cache size scales with the associativity, although in an "odd" manner, which means the bank structure is unchanged.
I believe the bigger problem with BD family l1i cache is associativity (or rather the lack thereof...), not size, so I wonder why they didn't change the structure and went with a 64kB cache with 4 times associativity instead. Maybe just increasing the size was simpler, but still that looks like a more expensive solution to me.
 
The idea that a DRAM-free board hanging off of PCIe can be cheap presupposes that a highly non-standard and standard-violating board with a dubious business case and non-standard GPU can be cheap.

If someone is so cost-conscious that even inexpensive DRAM is too much, you might be getting down to the most stripped-down and non-expandable motherboards you can find.
A graphics unit without access to local memory hasn't been practical since early in the last decade, and I doubt even the vaunted latency-hiding capabilities of a GPU can hide the impact of having no local framebuffer. The ROPs would probably be one of the first elements to falter, with the necessary batch sizes and local caching necessary becoming too large to be practical.
The following is more speculative, but pure PCIe accesses may also subject the GPU to more stringent ordering constraints than its aggressive memory pipeline can tolerate, negating the GPU's ability to utilize it well.

I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.

Does that sound feasible?
 
I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.

Does that sound feasible?
Thanks to 32-bit OSes, only a portion of the graphics card's memory is exposed to the CPU. For AMD parts, this is around 256MB of the total video memory. This means that the CPU cannot directly access arbitrary regions of video memory.
 
Thanks to 32-bit OSes
with XB1&PS4, 32 bit will become less relevant for big games in 1 year or less... even in PC space. At least for all titles that matters, I believe (EA,UBI, TAKE2).

only a portion of the graphics card's memory is exposed to the CPU

I bet you can remap it easily, or use a GART for that. It would lead to the hassle of addressing memory in 256Mb chunks, maybe -but possible, no?
 
with XB1&PS4, 32 bit will become less relevant for big games in 1 year or less... even in PC space. At least for all titles that matters, I believe (EA,UBI, TAKE2).
We have to support our customers and that means people upgrading machines with older OSes.
pMax said:
I bet you can remap it easily, or use a GART for that. It would lead to the hassle of addressing memory in 256Mb chunks, maybe -but possible, no?
No, not possible. What would you do if multiple applications were trying to access different 256MB windows? (This is supposed to be seamless, right?) Adding in new APIs to change the mappings isn't seamless (or simple or bulletproof), so you may as well stick with existing technology.

Once everyone is convinced that 64-bit is the de facto standard then we can improve the situation.

Also, keep in mind that discrete GPU memory is tagged as uncached since the GPU can't/doesn't probe the CPU's caches for data. This means that CPU reads from discrete GPU memory will be very slow. CPU writes are fine as long as you fill write-combine buffers.
 
We have to support our customers and that means people upgrading machines with older OSes.
[..] Once everyone is convinced that 64-bit is the de facto standard then we can improve the situation.
...of course, you have to support legacy 32 bit customers.
But you have rewritten the driver for 64 bit support, and you easily add the feat for 64 bit family. 64 bit code and 32 bit code cant mix.
Leave the devs deal with 32-bit porting, they will then choose what to do. Most of today's game customer should already have 64 bit OS since years - at least for top selling games that would love to use such features, I think.

No, not possible. What would you do if multiple applications were trying to access different 256MB windows?
Ouch, yeah - you are right, I tend to forget... anyway, it can just be a "singleton" resource, on 32bit desktop space (or maybe 2 with half space). And full access on 64 bit space. But probably not worth the effort, I suppose.

CPU writes are fine as long as you fill write-combine buffers.
...you say WB is ok ...but you have to do a read-modify no?
Aaah, I see. You mean final write is delayed in CPU until it needs to be flushed. Interesting, thanks.
 
Last edited by a moderator:
I assume at some point (soon hopefully) we'll see HSA like functionality in discrete GPU's in that the GPU can freely read/write to the main memory and the CPU can do the same to the graphics memory. Bandwidth and latency would still be crap because you're going over PCI-E (although if implemented under PCI-E 4.0 at 64GB/s total bandwidth it's nothing to sneeze at) but at least it would do away with having to copy data back and forth.

Does that sound feasible?

A GCN GPU's primary cache subsystem is operating on primary 64 byte transactions, and a GDDR5 channel controller would mostly be working in terms of two or more 32B bursts.
The memory subsystem spends several hundred cycles trying to coalesce as many accesses as it can, so total latency hiding for that memory traffic is several hundred cycles which translates into as many nanoseconds. There are physical protocols and heuristics handled by the hardware in an autonomous fashion, and their actions are solely the concern of the GPU's low-level hardware.

PCIe transaction latencies for GPGPU seem to be all over the place, but link latency seems to start in the microsecond range, thanks to the traversal of hardware and software protocol layers and system management of the device. There seem to be issues with different GPGPU tests and driver/software differences with very different latency numbers, where the number of microseconds can vary by orders of magnitude.

Going from those, there may be an order of magnitude in latency difference, with a bus whose best utilization is transfer sizes in the hundreds or thousands of bytes.
Small transactions would lose a significant fraction to packet overhead, and I wouldn't trust this arrangement to not be throttled by uncore and IO subsystem of the CPU, coupled with the driver stack and OS management.

The necessary expansion in the size of the GPU, and the even larger batches seem counterproductive to me. The GPU texturing and CU path may have good latency hiding, but they have their limits with the current wavefront count and storage size, and the caches are sized for a certain amount of reuse that will not happen when ideal transaction sizes exceed them.
ROP cache tiling doesn't seem to have the level of scalability needed, and the command front end and queueing latency even with local memory is atrocious in terms of compute.
 
We have to support our customers and that means people upgrading machines with older OSes.

Steam survey shows 75% of users on 64bit versions of Windows. 30% of the 32bit users on still on XP, which won't be supported anyway.

I wouldn't bother supporting 32bit any more, just like I wouldn't bother with DX9 (and wouldn't have for years now).
 
I wouldn't bother supporting 32bit any more, just like I wouldn't bother with DX9 (and wouldn't have for years now).
Both Windows 7 and Windows 8 have 32 bit versions. Most users having computers with 32 bit version of either OS do not know about it. It would be a customer support nightmare to explain to a paying customer that their brand new computer with Windows 8 doesn't support our game. I was personally hoping that Microsoft would have released only a 64 bit version for Windows 8, but unfortunately that didn't happen, so we will have computers sporting latest GPUs with DirectX 11.2 and a 32 bit OS. And new computers are sold with 32 bit Windows 8 every day.

Unfortunately games designed for a 32 bit OS cannot be designed access more than 2 GB of memory (since that's the upper limit per process in a 32 bit OS). Thus this is unfortunately the memory limit for most PC games. Not many game developers are willing to design a game with reduced game logic (smaller levels, etc) just for 32 bit computers. That would add quite a bit for development costs.
 
I believe the bigger problem with BD family l1i cache is associativity (or rather the lack thereof...), not size, so I wonder why they didn't change the structure and went with a 64kB cache with 4 times associativity instead. Maybe just increasing the size was simpler, but still that looks like a more expensive solution to me.

Between you and AMD I expect AMD to have done more real code analysis on capacity vs associativity misses.. well, I hope they have at least. Note that others seem to be trending in this direction too, Apple increased icache (and dcache) size with Cyclone, and ARM is doing the same sort of extension from 32KB/2-way in Cortex-A15 to 48KB/3-way in Cortex-A57. Then there's nVidia's Denver with 128KB icache, although that could be more influenced by VLIW or overhead from translation or whatever Denver is doing.

Maybe it's justified by code getting more bloated and library/middleware-ridden, and more JITed code being executed.

Intel could do it too but they've stuck with alias-free caches, so if they want to go above 32KB they'd have to increase beyond 8-way set associativity which is already high.
 
One thing to ponder is what 28nm SHP portends for Excavator.
It is a bulk process, but still one tweaked somewhat for the benefit of AMD and high-speed CPUs over the standard processes GPUs thrive in.
Perhaps this the last remnant of benefit from AMD's initial 32nm work, or a concession it or GF made in the last few years of process drama between the two.

Kaveri's clock rates regressed at least in part because of the loss of SOI and a reduction in the upper clock range in exchange for density.
Where does this take a design that is still in the Bulldozer lineage at 20nm or below, given the constriction in process choices and a heavy focus on SOC parameters?
The elements that allowed Kaveri to be a performance wash or disappointment may not persist, and 20nm planar is not looking to be that great even for SOC clients, much less a speed racer design.

The rumored TDP reduction in Carizo may be a reflection of this.
If this is the case, then Bulldozer is very much an architecture for a different universe than the one we inhabit. Part of its design targets took into account maintaining high speeds on an inferior CPU logic process, but AMD's flaky mainline core strategy and better success with Jaguar point to BD's whole philosophy being incompatible with a process that goes beyond being inferior for high-speed logic and simply not being meant for it.

Another question, going by the AMD slides that indicate that bigger metals allow for higher clocks, coupled with the loss of clock, does this mean a 28nm SHP has compromised on that as well?
And what of the (apparently mostly unused) resonant clock mesh, which was primarily meant for the higher clock speeds and which has a taste for heavier metalization?
 
Both Windows 7 and Windows 8 have 32 bit versions. Most users having computers with 32 bit version of either OS do not know about it. It would be a customer support nightmare to explain to a paying customer that their brand new computer with Windows 8 doesn't support our game. I was personally hoping that Microsoft would have released only a 64 bit version for Windows 8, but unfortunately that didn't happen, so we will have computers sporting latest GPUs with DirectX 11.2 and a 32 bit OS. And new computers are sold with 32 bit Windows 8 every day.

Unfortunately games designed for a 32 bit OS cannot be designed access more than 2 GB of memory (since that's the upper limit per process in a 32 bit OS). Thus this is unfortunately the memory limit for most PC games. Not many game developers are willing to design a game with reduced game logic (smaller levels, etc) just for 32 bit computers. That would add quite a bit for development costs.

This year we should start seeing more 64bit, DX11 only games regardless. The vast majority of modern systems are on 64bit OSes, gaming PCs even more so. Also consoles have 8GB RAM etc.

Soo... Haswell vs Kaveri! :D It seems Kaveri can match Haswell on the graphics front, although in the lower power envelopes Intel should dominate as usual. Even with a better architecture AMD can't win there. That ball is in GloFo or TSMC's court.
 
...of course, you have to support legacy 32 bit customers.
But you have rewritten the driver for 64 bit support, and you easily add the feat for 64 bit family. 64 bit code and 32 bit code cant mix.
Leave the devs deal with 32-bit porting, they will then choose what to do. Most of today's game customer should already have 64 bit OS since years - at least for top selling games that would love to use such features, I think.
It's not as simple as that as it requires changing the VBIOS behavior at POST when it doesn't know what flavor of OS you will load later.
pMax said:
...you say WB is ok ...but you have to do a read-modify no?
Aaah, I see. You mean final write is delayed in CPU until it needs to be flushed. Interesting, thanks.
Yes, write-combine buffers were added to the North Bridge in the AGP days IIRC, now they are integrated into the CPU. There used to be varying numbers of WC buffers per North Bridge, which caused confusion for low-level programmers.
 
One thing to ponder is what 28nm SHP portends for Excavator.
It is a bulk process, but still one tweaked somewhat for the benefit of AMD and high-speed CPUs over the standard processes GPUs thrive in.
Perhaps this the last remnant of benefit from AMD's initial 32nm work, or a concession it or GF made in the last few years of process drama between the two.

Kaveri's clock rates regressed at least in part because of the loss of SOI and a reduction in the upper clock range in exchange for density.
Where does this take a design that is still in the Bulldozer lineage at 20nm or below, given the constriction in process choices and a heavy focus on SOC parameters?
The elements that allowed Kaveri to be a performance wash or disappointment may not persist, and 20nm planar is not looking to be that great even for SOC clients, much less a speed racer design.

The rumored TDP reduction in Carizo may be a reflection of this.
If this is the case, then Bulldozer is very much an architecture for a different universe than the one we inhabit. Part of its design targets took into account maintaining high speeds on an inferior CPU logic process, but AMD's flaky mainline core strategy and better success with Jaguar point to BD's whole philosophy being incompatible with a process that goes beyond being inferior for high-speed logic and simply not being meant for it.

Another question, going by the AMD slides that indicate that bigger metals allow for higher clocks, coupled with the loss of clock, does this mean a 28nm SHP has compromised on that as well?
And what of the (apparently mostly unused) resonant clock mesh, which was primarily meant for the higher clock speeds and which has a taste for heavier metalization?

The Register had interesting information about this, from Joe Macri:

[URL=http://www.theregister.co.uk/2014/01/14/amd_unveils_kaveri_hsa_enabled_apu/?page=3]The Register[/URL] said:
There were reasons to go with 28nm rather than 22nm, Macri told us, that were discovered during the design process. That process was run by what he identified as a "cross-functional team" composed of "CPU guys, graphics guys, mixed-signal folks, our process team, the backend, layout team." […] The problem, he said, was that "our IDsat was unpleasant" at 22nm […] "So what we saw was the frequency just fall off the cliff," he said. "This is why it's so important to get to FinFET."

This leads me to think AMD might skip 20nm altogether and go straight for 14nm FinFET. However, I don't know how long that would take, and whether AMD would need another 28nm design in the meantime.
 
Between you and AMD I expect AMD to have done more real code analysis on capacity vs associativity misses.. well, I hope they have at least. Note that others seem to be trending in this direction too, Apple increased icache (and dcache) size with Cyclone, and ARM is doing the same sort of extension from 32KB/2-way in Cortex-A15 to 48KB/3-way in Cortex-A57. Then there's nVidia's Denver with 128KB icache, although that could be more influenced by VLIW or overhead from translation or whatever Denver is doing.

Maybe it's justified by code getting more bloated and library/middleware-ridden, and more JITed code being executed.

Intel could do it too but they've stuck with alias-free caches, so if they want to go above 32KB they'd have to increase beyond 8-way set associativity which is already high.
Well yes I hope AMD did some analysis too. Though given how BD family looks like you have to wonder what kind of analysis they really did...
It is quite true that cpus seems to increase cache sizes lately. I don't know though what the l1i associativity of Swift/Cyclone/Denver is, so for comparison that's only half the story. Compared to A15/A57 these still are half the size and same associativity compared to BD and SR respectively. None of these cpus have to feed two cores with one l1i, so considering that SR still doesn't really have a large cache (and terrible associativity).
Thankfully AMD didn't do something crazy like 128kB 2-way associative :). 3-way l1i is the record for any cpu AMD ever did so there's progress at least ;-).
 
It's not as simple as that as it requires changing the VBIOS behavior at POST when it doesn't know what flavor of OS you will load later.

I'm taking about this feature (mapping the entire graphics card memory into system memory) only being available on AMD FM2+ boards with AMD HSA enabled processors and with AMD hawaii/sea island+ graphics card.

Yes a bios setting to handle assigning 64 bit BARS to the graphics card is needed. As is a special 64bit graphics card driver.
The default behavior for an AMD graphics card in an FM2+ board should be to assign 64bit BARS.
While I don't think it would come to it, I don't think selling 64Bit OS only system would be impractical. 64 bit OS's completely dominate the Steam Hardware Survey. Anyone buying the latest hardware now would be putting a 64 bit OS on it.

AMD is very popular with its graphics cards.
If they say, our graphic cards work great on all platforms, but if you plug it into a HSA enabled system you will get X% better performance and these extra features, then they can sell a heap of FM2+ and HSA processor systems and push out Nvidia.

Neither Intel or Nvidia can compete with this.
Intel doesn't have the graphics cards and Nvidia doesn't have the cpu or motherboard chipsets.

As I see it, Intel, AMD and Nvidia are all heading down separate paths now.
I don't see AMD implementing AVX-512 for example. Or Intel and Nvidia HSA
AMD is going to need a solid niche if its going to survive.

Note that for 32 bit games running on a 64 bit windows, I believe its up to the WOW64 emulation layer to handle emulating the legacy method.
 
Between you and AMD I expect AMD to have done more real code analysis on capacity vs associativity misses.. well, I hope they have at least. Note that others seem to be trending in this direction too, Apple increased icache (and dcache) size with Cyclone, and ARM is doing the same sort of extension from 32KB/2-way in Cortex-A15 to 48KB/3-way in Cortex-A57.
It's also a more straightforward way of enhancing a cache without having to rearchitect it. Adding or removing capacity by adding ways to the cache is done in the reverse when Intel disables L3 cache in its processors in lockstep with reductions in associativity.

The cache pipeline and the control logic don't need to change significantly. There's either another cache array performing the exact same checks as the others, or there isn't.

In AMD's case, they probably felt some kind of capacity and associativity crunch relative to when there were independent 64 KB Icaches, and at the same time they weren't in a position to replumb the instruction cache and fetch pipeline to change the logic there. I'm still a little hazy on what the fetch logic has to do in a cycle with Steamroller, but it sounds like there's some decently tight timings that would require more engineering effort to get right.

Losing SOI may have allowed them the chance to use a denser cell, so they may have found themselves with extra room, but not much extra engineering capability or time to market for a new design and a weaksauce fab process. Slathering on another way of the same crusty cache architecture uses up room, but not as much cost has to be further sunk into Bulldozer's line.

ARM likely has similar cost-benefit analysis in terms of how much effort it really wants to put into its cores.
 
It is quite true that cpus seems to increase cache sizes lately. I don't know though what the l1i associativity of Swift/Cyclone/Denver is, so for comparison that's only half the story. Compared to A15/A57 these still are half the size and same associativity compared to BD and SR respectively. None of these cpus have to feed two cores with one l1i, so considering that SR still doesn't really have a large cache (and terrible associativity).
Looks like only Intel is out of the "trend", but their main architecture line already spots rather fast and low-latency L2 for the last 5 years as a standard feature. Curiously, Intel is being more generous to the i-cache sizes in its Itaniums for several generations now, but that's mainly because of the overhead of the weird instruction bundle format.
 
Back
Top