from what generated the idea about speculative execution?
It probably came from a number of places, usually when people reviewed patent applications and presentations on similar ideas for architectures, such as AMD's patents and papers by Andy Glew. It's one of the perils of relying on patents and theoretical papers for predicting what a design will turn out to be.
Patents are not always used and architects are free to think up various ideas that may not come to fruition in a physical chip for any number of reasons, or may be deferred to later designs.
In addition to things like eager execution, Glew had posited a facility for extremely fast thread forking to the paired cores, amongst other things.
What do they mean by "Only needed arrays are clocked"?
Only so much of a cache is active in a cycle and arrays can be clock-gated when not in use.
The cache will wait for a hit to come back from the tag check before then selecting and powering the needed cache line. This adds latency to the process, whereas in most chips the arrays within the associative set begin the access process prior to knowing which line will actually be used.
There is the second line in that slide that seems to indicate there is slightly more indirection involved in branch prediction. My interpretation is that there is a branch predictor-predictor that will attempt to predict the branch type, and by extension only clock the type of predictor that is needed.
Given that branch history tables rival some small L1s in size and that there are now loop, jump, branch, indirect, and other predictors, this would save power at the price of likely worse latency.
The bulldozer slides indicate that there has been a change in the Int pipeline as well. The PRF-based register rename means its OoO scheme is likely closer to Bobcat's than K8 (which used reservation stations). This may make them closer in some ways to the Pentium 4 or K8's FPU.
The data cache is way-predicted. I'm not sure if K8 did this. P4 did, I believe.
Per-core, Bulldozer is slightly inferior in the number of load queue entries, and proportionately worse in store (40 vs 48, 24 vs 32)
However, in a multithreaded situation, the situation would change signficantly because most resources in the Intel solution are cut in half with threading, and then BD has a whole other core.
BD's branch predictor is capable of moving ahead of the main pipeline which in stall situations can keep the predictor's latency from adding to whatever stall happened in the main pipe.
There are more aggressive prefetchers, including a correlative prefetch for data accesses, which sounds vaguely like the application of a correlating branch predictor's methods to memory accesses.
Lightweight profiling is mentioned, and I'm curious about more details on that. It could be used by software to better load-balance or performance tune.