I'm not Faf, but...
Does vu1 have write access to SPRAM (presumably not)?
Not directly... But data from VIF1 can be DMAd to SPRAM
What is the machines physical address length (24bit?)/ memory wordsize(prob. 64bit as the caches)?
Physical address length is 32bits. Word length is also 32bits. As for the caches, the line size is 64 bytes.
having a 3.2GB/sec memory sub-system on a 2GB/sec bus seems to be a strange design desicion,
Having a little extra memory bandwidth just helps to assure data can be queued up as quickly as possible when bus arbitration causes additional latency.
as all kinds of other data(e.g. IPU --> GIF, SPRAM --> VU1) also have to use that pipe, esp. when taking into consideration the near absence of caches in EE.
Near absence? There's like 7 caches on-chip (80KB). Since many of the execution resources are rather discrete it makes sense to have cache pools distributed so each device can operate on data without fighting over the bus.
So what is the actual latency?
Max row access is 45ns, a transfer of 2 bytes per/beat (1.25ns) or a qword (16 bytes) every 10ns, and it takes 4 bus cycles (8 CPU) to fill a cacheline (4 qwords)...
Interesting. Though dual processor anything changes the latency.
Not necessarily, if you're talking latency to the CPU. When processing data on several processors with a high locality of reference, good cache snooping can kill a lot of latency. Of course on point to point topologies, it doesn't really work.
It also most likely has similar optimizations as a GeForce video card or any other non-expandable memory bus would have -- like PS2, since it is non-expandable, I bet you can run the timings much tighter than on a real PC.
It's a wee bit more complex than that. It doesn't have much to do with expandability (well yeah in the case of Rambus it does, but not so much with DDR-RAM).
The thing is there is no N-bridge controller to reach out of the chip to fetch memory addresses.
Also, controller optimizations that work for a GeForce don't necessarily translate well to a CPU. It works for a GeForce because that's the only device it has to work with, so you can tweak the memory controller for wide alignments for blitting large tracts of predicated data around. However the XGPU has to provide support for not only the GPU portion of itself, but also MCPX and the CPU, each with their own demands. This can make itself visible with regards to address granularity for one (think cache pollution).