If anything, would would want to profile while the player is gaming, as this would also enable you to tune for extrem ingame situations.
Modern architectures do provide more means of gathering data at runtime, although one of the extreme in-game situations introduced with game-time profiling is going to be "the profiling is excessively impacting performance".
In the CPU realm, Intel and AMD have various tools.
One example is AMD's lightweight profiling instructions, which all signs point to being discarded in future cores. It seems like some developers did like it.
But ultimately, you don't want to probe for the hardware capabilities at all. You want to build an engine layout which scales well regardless on which platform it runs on. Which is flexible enough to adapt to a wide range of hardware configurations without any explicit notion.
This a nice theoretical ideal, but the potential glass jaws can have implementations dropping to fallbacks, emulation, bus transfers, or (heaven forbid) IO when they simply don't crash. This means differing platforms, even more advanced versions of the same platform, do not provide a smooth function for determining how an engine should scale. Given the complex problem space, the desire to create an engine that can optimize for any contigency runs into the desire to ship something, and it assumes that being able to handle anything doesn't yield the result of making everything suboptimal.
The latter case is a perennial problem since computer scientists have started working on dynamically optimizing software runtimes, which is not a new concept and which has some imperfect exemplars in those thick API black box drivers whose deprecation is being celebrated.
Being generally scalable in such a situation is still possible, as CPU vendors try to do when they tout super high scaling numbers in the absence of absolute measurements--since its often exponentially easier to get fantastic scaling figures once you set the baseline low enough. When facing a need to scale when one platform or another can fall down by an order of magnitude, a good baseline for scaling may not be the same as a good baseline for acceptability.
And still, in order to design such a system properly, you need to know the edge cases. You must know what worst case scenarios you need to avoid at all costs or at least where you need to stay flexible enough to give the driver/hardware that option.
This unfortunately runs against the desire to not probe for capabilities, since those are a major source of edge cases, and knowing what edge cases there may be in a dynamic context and with evolving platforms is a tall order.
Prefetch is actually a good example. Explicit, manual prefetch is only going to work well for a specific architecture. But restructuring your code to hide memory latencies by reducing the number of instructions your memory access depends on and performing memory access early and in batch, is going to benefit almost every platform, regardless of OoO, speculative execution and whatever.
Aggressively optimizing memory accesses statically has a cost, particularly once you exhaust the number of safe accesses that can be determined at coding or compile time.
That's not to say that such optimizations are not great things to do, just that in many realms designers have moved past where comparatively low-hanging fruit has been plucked, although I may be speaking out of turn since it seems there's a lot of things GPU and game development seem to treat as being new that are not.
On top of that, the more heroic the reorganization, the larger the compiler's optimization space and the human programmer's headspace need to be, oftentimes for an optimization that might not break even with its own overhead.
It would be nice if there were a sufficiently light and expressive way to inform the code of what performance state the system is in on a dynamic basis, and that the software could readily mold to fit. This is a very non-trivial problem, unfortunately, and progress at a hardware and platform level is uneven.
You don't want to optimize for it in the sense of enforcing it. But you don't want to prevent these "optimizations" from occurring naturally either. You want to know how flexible you need to remain to cover all the extremes found in current and upcoming hardware generations.
This is traveling into the territory of asking the developer and the software to know the unknowable. How is a developer to know what decisions might break an optimization that does not yet exist?