Been thinking about it. The volume of stream processors hides latencies by the volume of things in flight. The less stream processors you have, the less you can have in flight, the less latency you can hide, the desire for lower latencies increases.
Not exactly. They hide latencies by usually operating on a large amount of patterned/sequential data, so it's only the kick-off to a major operation that's "slow".
Let's consider a (very) simplistic hypothetical situation. Suppose we have a chip with a very high-bandwidth memory that correctly pre-fetches things for our task, but has an initial latency of 100 cycles. Suppose that it takes a single processor in the chip 10 cycles to execute our program over 1 data.
Suppose we have two parts: one with a single built-in processor, and one with ten built-in processors.
Now, suppose we're going to run the program over 100 units of data.
For the smaller part, we have 100 cycles of initial latency plus 10 cycles on each of the 100 units of data, for a total of 100+(10*100) = 1100 cycles. For the larger part, we have 100 cycles of initial latency, but the execution takes only 1/10th as long since we have ten processors, and our total is 200 cycles. In this scenario, a 10x larger part is only 5.5x faster; the initial latency consistitutes a larger fraction of the execution time, so the utilization is actually
lower than for the small part. So, for tasks of the same size, wider parts can actually be
more susceptible to latency issues.
Now, suppose we make the large part execute the program over 1000 units of data. Then we have 100 cycles of initial latency plus 10*1000/10 cycles of execution time, for a total of 1100 cycles. This is the same as the time it takes the /10-size part to execute over a /10-size data set. We see that if we expand the task proportionally with the width of the part, utilization stays the same.
I think you might have been imagining that wider parts will operate on bigger data sets, and so the initial latency represents a smaller fraction of the overall memory bus usage time. The problem is that this thinking is in terms of cycles rather than datas, and wider parts usually have wider memory busses to keep their processors just as well-fed as those in smaller parts. So even though there's more data being transfered for a larger task on a larger part, the fraction of memory bus usage taken up by the initial latency might be the same, since the data transfer has a higher bitrate. And internally, whereas in the smaller part you're only stalling 1 processor during the initial access, in the larger part you're stalling 10 processors.
Hopefully this all makes some semblance of sense.