The reason is that DDR is already designed to work at full bandwith with a given burst rate (if you ignore the overhead of loading a new row, changing pages, etc) and GDDR3 supports only one burst mode: 4. Burst 4 means that four 32 bit elements per memory chip are read/written each 2 cycles. GDDR2 supported burst modes 4 and 8 but I doubt any GPU used 8. For 64-bit independent buses (2 memory chips per bus) and burst 4 the minimum access is 32 bytes. So for burst 8 the minimum access is 64 bytes (what I'm currently using in the simulator, just to get even worst bandwidth usage
).
R520 implements 32-bit independant buses (1 memory chip per bus) so the minimum access is 16 bytes and given the GDDR3 restriction it can't get longer. Of course the GPU stages can request or send more than 16 bytes to the memory controllers but those accesses are splitted into multiple 16 byte accesses to the memory chips.