I think it's easiest to consider an example. For the sake of simplicity, let's assume vsync is off, and the driver never buffers more render commands than necessary, i.e. no more than one frame that isn't being processed currently.
Let's say calculating a single frame takes 10ms on the CPU, and 50ms for rendering on a single card.
Single card system:
010 - CPU finished first frame, GPU starts rendering
020 - driver has buffered one frame (#2) in the command buffer, stalls
060 - GPU finishes first frame displays it, starts second. Driver returns from stall. CPU captures user input for #3
# we're now in normal operation #
070 - CPU finished frame #3, driver stalls again
110 - GPU finishes frame #2, displays it, starts rendering third one. Driver returns from stall. CPU captures user input for #4
120 - CPU finishes #4, driver stalls
160 - GPU finishes frame #3, displays it, starts rendering #4. CPU captures user input and starts computing #5
170 - CPU finishes #5, driver stalls
210 - GPU finishes frame #4, displays it, starts #5
etc.
Now let's see. At 110, the CPU captures all the keypresses that happened during the 50ms before for frame #4. During that same time, 60 to 110, frame #1 became visible. At 210, frame #4 starts to become visible on screen (but it will take a full refresh cycle until completely visible). So you have at least two frames' time delay (100ms in this case) between keypress and visible response.
AFR system:
010 - CPU finished first frame. GPU1 starts rendering
020 - CPU finished second frame. GPU2 starts rendering
030 - CPU finished third frame. Driver stalls
060 - GPU1 finished #1, starts #3. CPU captures input, starts #4
070 - GPU2 finished #2. Knowing the rendering time of both frames, the driver (hopefully) attempts to balance frame distribution and delays displaying #2 for 15ms. CPU finished #4, driver stalls.
085 - display swaps to #2. GPU2 starts rendering #4. CPU starts #5
095 - CPU finished #5, driver stalls
110 - GPU1 finishes #3, starts #5. CPU starts #6
120 - CPU finishes #6, driver stalls
135 - GPU2 finishes #4, starts #6. CPU starts #7
145 - CPU finishes #7, driver stalls
160 - GPU1 finishes #5, starts #7. CPU starts #8
etc.
At 085, the CPU captures all the keypresses that happened during the 25ms before for frame #5. During that same time, 60 to 85, frame #1 became visible. At 160, frame #5 starts to become visible on screen (but it will take a full refresh cycle until completely visible). So you have at least three frames' time delay (75ms in this case) between keypress and visible response.
Overall, the relative latency will increase by one frame, but the absolute latency will actually go down.