Here's a dumb question: what would be the advantages of a 64-bit wide GDDR5?
I suppose that it mathematically allows for a GDDR5 module to be able to transfer the same amount of data over a bus that is physically half as fast.
However, the challenges that scale with wire count like routing, signal integrity, and timing between more wires could be related to why that path wasn't chosen.
Keeping the link speed the same to get a bandwidth gain would hit those same scaling challenges with no extra timing margin that a slower bus would provide.
Some other implementation-related issues could be things where it's nice to have separate channels and separate devices with 2x the power and current budget.
Bank activation restrictions went into why HBM2 included a pseudo-channel mode, and tFAW is a wall clock penalty for a whole channel due to physical limits to the device. Having two devices with their separate capacities to absorb activation cost in that scenario would be preferable to having one device that stalls. The same could go for refreshes, where in a capacity-equivalent system could be one device maintaining refresh and using up command bandwidth for twice as much time versus two devices able to do so independently. Commands can be queued and delayed up to a point, but it's also the case that a single device's queuing capacity is finite and might not be equivalent to the capacity of two separate devices.
There are various factors that could change the math, so that might be highly dependent on the hypothetical implementation of the 32-bit and 64-bit devices, and I am not sure how big an obstacle these might be overall.
Global power consumption might be better with fewer devices, but the modules have a comparatively low ceiling in terms of power that is delivered and dissipated in one spot, where the various issues above and their mitigation measures could bump into a hard power limit in terms of delivery or requiring a heatsink where two devices would not.