Definitely, this also suggests quite some architectural changes. Maybe added a decode cache?
A post-decode cache has less of an impact than it would for a processor with a long decode pipeline like the Core series, and it has a habit of adding variability to the penalty in the case that the correct target misses to the L1.
Oddly enough, there are two scenarios where variability has been measured.
Agner Fog's optimization testing has Haswell ranging 15-20 cycles, a testament to the complexity of its front end and uop cache.
However, his tests showed AMD's Jaguar ranging from 9-19, depending on what the subsequent instructions were.
Going from 14-19 to a flat 9, assuming the same test code was used, partly raises a question as to why A8 was that high and variable to begin with, then why A9 is so consistent and relatively low.
There are definite physical changes with the introduction of FinFETs that might allow a design to rebalance the pipeline and maybe cut out stages that might have existed as drive stages or to allow lower voltage scaling on leaky planar, the degree of reduction hints at a more significant change or that A8 was being very conservative for some reason.