Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Precisely Keller himself stated as the main reason that K12 would be the most performant one (among both designs), because they could get rid of the X86 decoding.
Since they scraped skydrive, they become silent about K12, yep
It shouldn't be too hard to turn that into an ARM chip by replacing the instruction decoder I guess.
That remaining 24% power consumption would still be an increment above if the L3 were inclusive and it served as a snoop filter, but perhaps AMD saw performance or power gains in freeing up the L3 capacity.
TLBs would be easy. The ARM v8 supports one of 4, 16 or 64KB pages sizes. Which one is implementation specific, AMD could just roll with 4KB pagesThe data TLBs and load/store subsystem have reason to change. The TLBs need to, although I suppose K12 might be able to live with the stricter requirements of the x86 memory queues.
although I suppose K12 might be able to live with the stricter requirements of the x86 memory queues.
AFAICT address generation of x64 is a superset of ARM v8. In ARM you have the following address generation rules: base, base+index, base+scaled index and PC relative modes.The AGUs or their equivalents would behave differently, the flags generated in the pipelines would have substantial similarities without fully matching
It's a trade-off AMD made differently than Intel, given the choice to make the L2 large enough that a backing L3 could get away with lower associativity.Since each L2 is 8-way set-associative, an inclusive L3 would have to be at least 32-way set-associative to avoid false evictions.
The line is "76% less power than accessing L2 tags directly", which I interpret as a description of the whole shadow tag check process versus broadcasting to all 4 L2s and checking them.Also, the way I read it, AMD has optimized the checking of shadow tags for power, so the checking of the 32-way shadow tags only takes 24% the power of a normal 32-way parallel tag-comparator. They do this, by first having a 32-way comparison of a small number of the tag bits (say... four), and then check the subsequent positive candidates for hits.
This may or may not be true, based on the status of the L2 line, and perhaps based on the line's history of sharing. AMD's L3 apparently retains the old behavior from prior generations where it's mostly a victim cache until it decides to retain lines it determines are being frequently shared between cores.This two-stage tag comparison is obviously slower than a full 32-way comparator, but if you hit, you get to read out from a L2 with its shorter access-time (so total time ends up being at least as fast as an inclusive L3).
Their behaviors are somewhat different. For example, Zen supports PTE coalescing like ARM V8 does, but they don't settle on the same size.TLBs would be easy. The ARM v8 supports one of 4, 16 or 64KB pages sizes. Which one is implementation specific, AMD could just roll with 4KB pages
That would be part of what is left on the table. The depth and number of buffers/queues would be affected by how expensive they are.AMD could just use the strong memory ordering model of x64 in their ARM v8 implementation, wrt. correctness it is a superset. It takes a lot of hardware to make a strong memory ordering model fast, but once you do it is as fast as a weak memory ordering model (because hardware knows more at runtime about fencing hazards than a compiler at compile time).
Ok, so there is a bright side!Windows 8 won't work with Ryzen!![]()
That doesn't sound like a good compromise... The number of memory instructions needing synchronization in common code base is several orders of magnitude smaller compared to basic load/stores. Faster and more energy efficient loads & stores are more important than faster release/acquire atomics/barriers. x64 memory model is too strict. ARMv8 memory model is very well balanced.That would be part of what is left on the table. The depth and number of buffers/queues would be affected by how expensive they are.
More outstanding accesses might be possible if the constraints on checking them are relaxed. On the other hand, a strong pipeline makes cases with synchronization cheaper and more consistent than a broader L/S system that needs heavyweight fencing.
Btw wouldn't make sense to tried to make square waffles? the amount of area lost in circular ones is huge.
This may depend on what workloads are part of the set. Zen and K12 are ranging from consumer to HPC and high(er) end servers.That doesn't sound like a good compromise... The number of memory instructions needing synchronization in common code base is several orders of magnitude smaller compared to basic load/stores.
The memory pipeline is one area where I considered the possibility where the "engine" could be bigger. The decoder could be wider, up to the uop issue width. The actual growth in processor width may become gated by the renamer rather than the decoder.I am not a hardware engineer, but my educated guess is that looser memory model (reordering loads and stores) would affect many places in the OoO machinery. Keller saying that K12 has better performance than Zen would point out to the direction that AMD planned to customize the chip to take advantage of the looser ARMv8 memory model.
It's a result of how the wafer is created. The wafer must be one large silicon crystal, which is created by starting from from a seed in a vat of molten silicon and slowly growing it outwards.Btw wouldn't make sense to tried to make square waffles? the amount of area lost in circular ones is huge.