I doubt you'll find a paper specifically written on how to build a benchmark. You have to know what it is you want to measure, and then you have to know how to do that work. The "benchmark" is simply how much work you just did divided by how much time it took. The trick is to make sure that your work is singularly focused enough not to be constrained, limited or affected by other factors that you don't wish to measure.
So, for pixel fillrate, you would probably draw one huge polygon that covers the entire view area, and see how fast you can "fill" it multiple times. For texturing rate, you'd do the same, but now apply a texture. For vertex rate, you draw one metric shit-ton of triangles.
Now here's the hard part that I alluded to earlier: if you use a low-res texture, say 128x128 to fill up your entire screen, then you'll end up scaling that texture with some sort of filter (bilinear, trilinear, anisotropic?) which means, on the most modern hardware, you're using ALU power to do this. So part of your texturing rate may be affected by your ALU rate, which gives you an "impure" result.
Same thing with lots and lots of little triangles: when you start getting in to triangles that are below ~20 pixels in size, you start running into other limitations in the raster ops.
Back when video cards were all fixed-function chunks glued together, you could have a very specific triangle rate, and a very specific texturing rate, and a very specific pixel fill rate. But now that we're going more and more programmable, these all sort of melt into eachother. The absolute peak triangle rate of a modern card really means nothing, because there is never a real-world case that you're doing absolutely nothing other than generating triangles. Those triangles will need textures, and will need shaders, and will need to be rasterized. All that texturing, shading and vertex processing power is a
shared pool of resources, so doing
more of one thing will result in doing
less of another in general.