We captured the frames by intercepting the DirectX 9 command
stream being sent to a conventional graphics card while the game
was played at a normal speed, along with the contents of textures
and surfaces at the start of the frame. We tested them through a
functional model to ensure the algorithms were correct and that
the right images were produced. Next, we estimated the cost of
each section of code in the functional model, being aggressively
pessimistic, and built a rough profile of each frame. We wrote
assembly code for the highest-cost sections, ran it through cycle-
accurate simulators, fed the clock cycle results back into the
functional model, and re-ran the traces. This iterative cycle of
refinement was repeated until 90% of the clock cycles executed
during a frame had been run through the simulators, giving the
overall profiles a high degree of confidence. Texture unit
throughput, cache performance and memory bandwidth
limitations were all included in the various simulations.