This GPU is more like a wide SIMD little RISC core. There are 32 register and 32 vector registers, there's no huge static RF. (can't remove this line break, wtf)
Thus, being stack-based, any non-trivial (hello mandelbrot) graphics/compute task will start happily spilling (1 cache line wide) registers on the tiny stack. As such, cache pressure on a 4-thread core working on 16-wide SP vectors (512-bits) is probably a key consideration factor.
The paper shows a 40% (or 71% when looking from above) worse cycle times when going from 256 lines to 64 lines (I'll pick the associativity that better matches my argument). Those are 64-bytes wide lines so we're talking 4kB to 16kB caches. I assume those are the per core L1s, there was a L2/MEMCTRL on a ring joining the CUs, err, cores, at least on previous versions of the project.
IIRC texturing is quite cache friendly, with a hello-goodbye pattern as a triangle gets textured (even more so on fancy bi-linear, or better, filtering). I'd venture TMUs with separate texture caches would have a marked effect on performance on graphics tasks more complicated than a phong torus. Thanks to cleaner L1Ds, to be spilled into, and HW parallelism.
Edit: there was another open source HDL FPGA project that had simple TMUs (don't remember if they did more than bi-linear RGB). It was a video synthesizer for video DJs that was even sold as a product on a standard FPGA demo board in a little enclosure.