Better architecture?how can be possible that bobcat is so little and so powerful compared to the atom core that still has only necessary stuff?
The other elephant in the room is the 32nm gate-first HiKMG SOI process.
Which for all of AMD's and GF's bluster has not shown to have overcome known problems, and the public weight of pretty much everybody else going gate-last, ahead of AMD, faster than AMD, and with great effect in terms of yields and variability.
Better architecture?
AFAIK Atom was based on old Pentium CPUs while Bobcat is newly engineered from ground up.
Better architecture?
AFAIK Atom was based on old Pentium CPUs while Bobcat is newly engineered from ground up.
TSMC will be a close finish, and that is with the consideration that it did an about-face and switched to gate-last after seeing what happened with gate-first.I am a few days late, but this surprised me a little bit. Apart from Intel, who is going gate-last and ahead of GloFo, let alone faster and with better yields?
It's been producing chips with margins and service revenue that make otherwise horrible yeilds acceptable.I believe IBM went gate-first and has been making 45nm CPUs with HK/MG for a while.
It might be a close finish even after TSMC changed its mind halfway through.TSMC is going gate-last, but doesn't appear to have any significant advantage over GloFo at this point, whether in terms of time to market, yields or performance.
How, exactly, can you support AVX and NOT have ymm registers?An unknown about the implementation of AVX on BD could make it worse, if the FP register count is halved because they are 128-bit, whereas SB natively supports the width.
With all the other changes being very carefully chosen then there must surely be a good reason for such a small L1D.The L1 is small, and the shared L2, reminiscent of Conroe or Penryn, is slower on a per-cycle basis.
How, exactly, can you support AVX and NOT have ymm registers?
IIRC, the ISA says there are 16 128 bit registers (xmm series) and 16 256 bit registers (ymm series) aliased to the xmm series.So if the ISA says there are 16 128-bit registers, and 16 256-bit registers, you could for example in reality have 64 or 80 actual 128-bit registers and all those values would fit into these, and still have some "freedom of renaming to avoid antidependencies".
How, exactly, can you support AVX and NOT have ymm registers?
AMD is targeting high clocks, and a larger L1 could have been too much to fit into the reduced time period, at least not without increasing the L1 latency even further.With all the other changes being very carefully chosen then there must surely be a good reason for such a small L1D.
The smaller L1 may be an acceptable sacrifice for a multithreaded server load.I'm presuming that they are very confident that the new pre-fetchers will make sure that the L1D will nearly always be populated with the next bit of needed data or at least that sufficient other instructions will be ready in the L1I to be executed while waiting for L1D to get populated.
Alternatively perhaps their modelling found that most of the time only about or less than 16KB of L1D is actually ever re-used so having a bigger one is just a waste because nearly all the data gets evicted or invalidated anyway.
The latency should be measured in core clocks.High per-cycle latency for the L2 should be fine as-long as the actual clock rate is suitably high. (is cache latency measured in core clocks or cache clocks?)
AMD is free whip out a surprise for the L3. AMD's cache subsystem has normally been sub-par compared to Intel's, so it would be nice to see a change.I still hold out a probably forlorn hope for BDs L3 to be a big eDRAM or similar high density tech.
For much of BD's market, the additional throughput is worth the single-threaded IPC loss, but that goes to show that the desktop and laptop markets are not the primary target.
Sandy Bridge is significantly better-provisioned for the single-threaded case than BD in terms of Load/Store queue depth, Int rename, and ROB capacity. Its L1 is larger, and its L2 is faster. BD has more capacity at the L2 level, the latency numbers are ho-hum and don't include the L3, which unfortunately cannot be massively faster because the L2's long latency sits in the way.If you're only running a single threaded workload on a BD module, resources seem more than ample.
Despite the simpler arrangement, the L2 latency is still slower on a cycle basis than other shared-cache implementations. The associativity of the L1 is twice as high, but the cache is a quarter of the size so we might see this as a break-even at best.The data cache is smaller but is 4 way associative vs 2-way for K8. The cache structure of the D$s and L2 is now inclusive, so a cache miss is simpler to serve, - in particular back-to-back misses don't see the latency penalty associated with swapping cache lines of earlier AMD CPUs.
The L2's capacity is the big non-debatable advantage BD has over SB that I see.With speculative loads and 128 instructions in flight there should be no problem covering the latency of the L2. In a single threaded situation, all of the L2's resources will be dedicated to a single core.
For single-threaded work, load/store for a single thread in FP should be able to use both load/store paths shouldn't it? L2->L1 only has to be duplicated so that everything in either L1 is in the other, to enable arbitrary loads. Stores from FP can use both paths to keep the two L1s coherent.Within the processor core, BD has an assumed small per-clock advantage is in its FP rename resources, when not using AVX. It has potentially higher 128-bit throughput, though the extent that this can be exploited is capped by the load/store capability of a single core, and a smaller number of issue ports for FP operations.