It is interesting that it has not been given any particular name like SSEx or AVX. Or are they just introducing new instructions now and will implement them when they see fit?
Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.
Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.
That being said, it's pretty clear to me that it would be a huge undertaking for AMD to verify SMT. Definitely the kind of thing that would have made sense for Barcelona, but it's tantamount to doing an entirely new uarch.
My comment was intended more about the throughput and not so much emphasizing on the access latency. And, yes - I know there's lower latency cache designs for a few named architectures (Willamette and Northwood L1D was 2 cycle, but a tiny size). Again - I'm more about the trade off balance. AMD was keen enough to keep the L1D size, access time, bank organization & etc. pretty much intact throughout the entire K7~K10 architecture refreshes and manufacturing technology updates, while just now with K10 the had to double the throughput from 2*64-bit R/W up to 2*128-bit to accommodate it for the extended FP/SIMD paths.By what metric do you mean fastest?
Itanium's L1 latency is 1 cycle.
Opteron's is 3.
My comment was intended more about the throughput and not so much emphasizing on the access latency. And, yes - I know there's lower latency cache designs for a few named architectures (Willamette and Northwood L1D was 2 cycle, but a tiny size). Again - I'm more about the trade off balance. AMD was keen enough to keep the L1D size, access time, bank organization & etc. pretty much intact throughout the entire K7~K10 architecture refreshes and manufacturing technology updates, while just now with K10 the had to double the throughput from 2*64-bit R/W up to 2*128-bit to accommodate it for the extended FP/SIMD paths.
By looking the larger picture, it looks like AMD kept developing their last 3 uarch generations around the very same general cache design (quite unlike Intel, though) -- [relatively] large L1 arrays being low associative but high throughput and latency tolerant, compensated with a slow but high (16-way) accusative and exclusive L2 array. In fact, only the very first Slot-A K7 had an inclusive L2 relation, being an off-chip impl of course. It's all a kind of heritage from the old days of scarcity of silicon die real estate, where inclusive relation with such a large L1 (2*64KB) would be a waste with 256K L2 and even larger sizes for the time.
It is still arguable of why AMD initially jumped with the decision, to implement an architecture (K7) with such large L1's -- some say it was to compensate for the rather poor FP register file organization (I'm not sure), but it is a fact, that they milked this "trade off" model for too long.
INQ has a story about AMD having taken up AVX.
Apparently 'SSE5' is no more but the instructions from SSE5 that aren't part of AVX are '+ XOP'.
Anyway, 3dilettante I think you misinterpreted what I was saying.
I think we are mostly suggesting the same thing.
I'm not arguing that its a de-emphasised FPU, I'm saying that by sharing one between two INT cores, it will be getting more ops coming to the FPU = higher utilisation than having one FPU per INT core.
The current case of having one FPU per INT core is presumably leaving that FPU idle a lot (inefficient).
The FPU is only going to get bigger over time so it needs to be running at high utilisation (efficient).
The 2nd INT thread is not about making massive INT throughput but it will help since there are more INT ALUs.
(though I see now they won't be full speed, they'll each be slower per-clock due to being only 2 wide)
I don't know anything about frontend/decoder stuff but I'd have thought 2 * 2 wide should be easier to get high utilisation out of (and/or simpler) than the Intel single 4 wide/SMT.
That should then allow engineering of it better with the same resources or re-allocation of resources for other areas of the chip.
The new core expands what has been the single-threaded nature of the AMD cores "in a different fashion than Hyperthreading," said Conway
A similar approach was used by Digital Equipment Corporation in their VAX family of computers. Initially a 32-bit TTL processor in conjunction with supporting microcode implemented the programmer-visible architecture. Later VAX versions used different microarchitectures, yet the programmer-visible architecture didn't change.
Many RISC and VLIW processors are designed to execute every instruction (as long as it is in the cache) in a single cycle. This is very similar to the way CPUs with microcode execute one microinstruction per cycle. VLIW processors have instructions that behave similarly to very wide horizontal microcode, although typically without such fine-grained control over the hardware as provided by microcode. RISC instructions are sometimes similar to the narrow vertical microcode.
Should this Bulldozer core be considered a good design for a desktop processor that has to score well in game benchmarks? Two 2-unit integer groups sharing one float unit... games tend to use more floating point operations don't they?
Should this Bulldozer core be considered a good design for a desktop processor that has to score well in game benchmarks? Two 2-unit integer groups sharing one float unit... games tend to use more floating point operations don't they?