The single most common reason to use asm/intrinsics is for vector instructions. You just can't live without them, and no compiler can automatically generate good vectorized code (esp for in-order CPUs that have lots of stall cases).
Other common uses for asm/intrinsics are cache hints (preload data), cache control (clearing/flushing pages, non-cache polluting stores, etc) and complex CPU instructions (bitscan, fsel, etc). We do not write the asm/intrinsics around our code, we have macros for these, so that we can define different sets of intrinsics to emit for different platforms.
Sometimes you have to use assembler to order your instructions perfectly when you have a pipeline stall situation. For example when you want to move data from float pipeline registers to vector pipeline registers, the data needs to go though the L1, and this takes time. You want to do all your float stuff first, move some other instructions in between (to "waste" some cycles) and then load the data to vector registers. The compiler however is happy to "optimize" and reorder your code so that the float and vector stuff gets "nicely" interleaved. With assembler you have direct control of the order of execution. On PC OOO architectures, this problem is much smaller, since the CPU can automatically move instructions to fill stall cycles. You do not do this kind of optimizations right away when you write the code. Console profilers have good tools for detecting pipeline stalls. Sometimes you are fine with restructuring the C++ code, but sometimes getting hands dirty with assembler is the only choice. Other common case when ordering is important is when you are writing to memory that is not cached (write combined pages or streaming non-polluting vector writes --- for example when you are writing to your dynamic vertex buffers). You need to write whole cache lines (and with certain alignment requirements), or you get severe stalls. Sometimes C++ just doesn't want to cooperate.
Having said that I still consider assembler to be one of the most important skills for a game programmer, because C++ programmers who don't understand processors and caches are a liability.
That's very true. We only have a handful of (senior) programmers that are comfortable with assembler/intrinsics, processor caches and various pipeline stall situations. And that's a burden, since we are the only ones who do performance profiling and optimize all our bottlenecks.