I'll include this link since it has the most complete slide set I've encountered:
http://www.anandtech.com/show/10907...nvme-neural-net-prediction-25-mhz-boost-steps
The overall concept neural net branch prediction isn't new, since perceptron predictors have already been used in prior AMD CPUs.
The hashed version seems to be different, but in what way is not clear. There are weaknesses from Agner's optimization document, such as nested loops not doing particularly well, and research showing that perceptron predictors don't do well with patterns that are not linearly separable yet generally predictable (alternating taken/not taken).
A hashed perceptron might do more to detect these cases and route them to different predictors or somehow separate out components of them so that the perceptron can more readily digest the components.
The prediction for instruction routing inside the core is unclear, but might have to do with predictors for routing ops to certain schedulers or timing their issue. At some level, it might help if the individual schedulers cannot readily communicate or immediately forward results between themselves. Something like a predictor for what architected registers tend to hold up dependent ops could allow them to route to one set of lanes over the other. Other cases involve possible issue port conflicts, where instructions issued based on age might wind up contending for a specific port more than they need to, although the simple operations are well-distributed and fast on the integer side pictured. Some complex ops and perhaps inter-thread contention might make this useful.
It might also plug into the power management scheme, if things like the schedulers, uop cache, or issue queue can keep some limited history of what gaps might open up for a given run of operations, or how sensitive they are to clock-gating or duty cycling.
The fine-grained clock management might be finer at a unit level with those.
The prefetch item is curious. There was talk of a pattern-based prefetcher in Bulldozer, which basically went unmentioned ever since. Perhaps this is a new version.
Involving that whole Load/Store and LD1 block is uncertain for prefetching, particularly the queues.
One exception, possibly, is stack accesses. Perhaps some of the relative addressing there can be handled more speculatively and the stack engine can prefetch and pre-calculate simple strides without overly polluting the queues (maybe take some load of the AGUs?).