Nvidia should do it since G80 (the interpolation is done in the SFU, the whitepaper saying "Plane equation unit generates plane equation fp32 coefficients to represent all triangle attributes". Intel always did (since gen4, i965) interpolation in the shader (with the help of a PLN instruction), however they only switched to barycentric with Gen6 (Sandy Bridge).DX9 hardware (including last gen consoles) already supported centroid interpolation. You could actually become interpolation (fixed function hardware) bound if you had too many interpolants (or used VPOS).
If all the modern PC/mobile GPUs do the interpolation in the pixel shader using barycentrics (like GCN does), it would mean that cross vendor SV_Barycentric pixel shader input semantic would be possible. This would be awesome, as it would allow analytical AA among other goodies on PC (assuming DX12 and/or Vulkan support it).
Mobile could be different though. From the freedreno sources, it seems adreno (since 300 series) also does some kind of barycentric interpolation, but from a quick glance the barycentric coordinates may not be available as ordinary registers (the driver issues OPC_BARY_F instructions). Not sure though it was a very quick glance, and no idea about others...
I don't have any idea. Intel doesn't publish throughput numbers for the extended math functions ever since these were no longer a true external shared unit (the dreaded MathBox), so not since Sandy Bridge. The last numbers thus are for Ironlake - the docs are saying throughput 3 rounds per element for quotient, 4 for remainder. Note this is for a _scalar_ element, so dead slow (at least i965 only had one MathBox for the whole gpu, IIRC a very frequent bottleneck). For comparison, sqrt was also quoted to need 3 rounds per element, and things like sin/cos, pow being much slower still. (The docs are actually saying one round has 22 clock cycle latency, I don't know if I should believe what I think this really means, as it would be really awful...) In any case it's probably best not to extrapolate anything from these values to more modern chips, wouldn't be surprising imho if some operations now have less or more throughput compared to others.Not surprising that Intel is leading the pack. Do you know how fast their integer divider is (2 bits per cycle? = 16 cycles total?). That would be 3x+ faster than emulation. Still, I wouldn't use integer divides unless there's a very good reason.