TL,DR: Designing GPUs is hard.
AP, clearly one of us is confused. I'd like to think that it isn't me, or at least, that my previous DRAM controller designs do work as intended
AP said:
The concern I have with your post is that you're failing to take into account the pipelining of requests to a DRAM. Just because the latency of a request to the DRAM is k cycles doesn't mean you can't issue other requests to it in parallel. I think your posts confuses throughput issues with latency issues in several places.
I am accounting for pipelining. The read-to-read command time is 6.4 ns for the selected DRAM. That's a fact. You will not go faster than that. That includes pipelining. The actual time it takes for the read data to return is about 20 ns for that DRAM, give or take. That's a lot longer than 6.4 ns.
The command bus looks a little empty because every read/write command takes 1 clock to issue, but occupies 4 clocks of the data bus (well 8 clocks in DDR anyway). It's also not configured the same as mine, and will have different read-to-write and write-to-read turnaround times (the configuration I picked lets me be symmetric).
If changing from read mode to write mode is so costly, a smart memory controller would defer the writes in a buffer (as writes aren't on the critical path) and then burst all the pending reads together. Seems like you picked an unrealistic worst case situation.
Ah, the
magical smart memory controller. More on this later.
If you delay writes, you need to store the write address/data somewhere. In a queue or in a cache for example. If you use a queue, that's dedicated hardware (= area) and only solves a small part of the problem. Moreover, at some point you
will need to write that data. That will cause pending reads to be pending for potentially a long time. If you use a cache, you end up pinning down lines in your cache for potentially too long, which results in less cache available for other tasks.
You're also assuming that writes don't happen all that often.
Some rhetorical questions for you: Where do the outputs of a vertex or geometry shader sit? In an on-chip queue? In a cache that can spill to memory? How much output are we talking about to sustain high utilization of the chip? Same question for the pixel data that you write out. Where does it go? How many pixels are you emitting and how much data are we talking about?
For reference, G80 (a ~1.5 year old GPU) can transform ~500 million triangles/sec and write out ~12 billion pixels/sec, simultaneously, at 500 MHz.
Hang on, much of this latency can be overlapped with other transfers. The "Read to Write" diagram on page 33 clearly shows only 2 "dead" clocks cycles between a read and a write. This is only a couple of nanoseconds of dead time, not the 28ns you calculated in your last post.
If you look at the command bus, there are 14 dead cycles between the read and the write. That's 22.4 ns. Notice the ellipsis in the timing diagram.
The data bus is a red herring: We know it will be underutilized, even if the command bus is saturated because of the overhead of issuing commands in a non-bursty fashion.
So, I'm confused. Are you talking about uncontended latency of 35ns? If so, if you look back in my post, I used an estimate of 50ns for uncontended latency, so my estimate seems conservative based on what you're saying. If you're talking about throughput, they you're not considering the small dead time between reads and writes (or that consecutive read or write bursts can be done at full DRAM bandwidth).
I'm not sure what you mean by “uncontended”. If you mean that only a single thread (or a very few threads) are issuing requests to the DRAM, then that's far from being the interesting case. It's like saying “my CPU runs really quickly when virtually nothing is running on it”.
The interesting case is when you try to maximize your bandwidth, so that you may maximize your shader computations in real-world scenarios.
Or just a better memory scheduler (which is part of every high-performance DRAM memory controller built today).
Let's start by laying down the groundwork. What bandwidth do you want and/or can afford? Let's pick some numbers. 512-bits to DRAM with GDDR with a 1.6 ns cycle time, same as above. That works out to a peak bandwidth of 80 GB/sec. Let's say we want to be 90% efficient, for a sustained bandwidth of 72 GB/sec. Let's also assume no DRAM refresh needs to happen, that no external clients need to access the memory for trivialities like “displaying a picture on your monitor” and page switches are free.
What do we need to do to hit our sustained rate? There is an overhead of 22.4 ns for changing from reading to writing or vice-versa. Burst reads or writes is 6.4 ns per command. That means the overhead is equivalent to 3.5 bursted commands, which need to round up to 4 since the command bus is SDR. We also have:
efficiency = burst_size / (burst_size + overhead)
Filling in numbers, that's:
90% = burst_size / (burst_size + 4) <=> (burst_size + 4) * 0.9 = burst_size <=> 0.1 * burst_size = 3.6 <=> burst_size = 36 commands.
Thus, to hit 90% efficiency on your DRAMs, you need to issue at least 36 commands in one direction before you can switch. That means that if you are doing a read and then need to a write, you need to wait over 230 ns. And then, you will need to occupy the DRAMs with 230 ns worth of writes before you can resume reads.
If you do anything else, like try to insert “emergency” read/writes in the stream, you will break the burst and incur the 28.8 ns penalty, lowering your efficiency.
Ok, now to tackle the question of the magic memory controller/scheduler.
What does this the scheduler do? It needs to queue up requests for (given the above) ~230 ns, sort those requests by request type, then issue them.
Moreover, it also needs to sort them by page, across banks, etc so that you don't incur the other penalties associated with DRAM access.
So now the question is: How long until a request is serviced? What if you get a single odd request to a lone page? How long does the controller hold on to that request before it gives up trying to find adjacent requests to keep the DRAMs busy? We're already at ~230 ns of wait time for a single page, after all.
Yet, what does this have to do with the difference between Larrabee and a GPU? It seems like all of the above issues are equally true for both systems? I'm really confused what your point is about all this.
My original point was that Larrabee's 128 threads was likely enough to saturate its available DRAM bandwidth. Do you still disagree with that assertion?
My point is: You need to be careful about how you design the hardware and how you use it. You can't just stick a bunch of CPUs together and pretend like it'll be fast.
I disagree with your assertion, unless Larrabee's clock rate is less than ~400 MHz, in which case 128 threads should be plenty.
Furthermore, since my original comment, we've now concluded that Larrabee likely has vector scatter/gather support. If that is the case, Larrabee will have no problem generating enough outstanding misses to swamp any reasonable GDDR3/4 memory system.
Great. That makes things worse, not better.
Edit: Misc typos fixed.