PDA

View Full Version : Shared memory bandwidth on Fermi


codedivine
17-Jul-2010, 03:32
I am trying to calculate the theoretical peak read bandwidth from shared memory on a GTX 480. From the CUDA programming guide, on Fermi shared memory has 32 banks and each bank can read 32-bits in 2 cycles but I am not too sure whether the cycles referred to are 2 SP cycles (i.e. hot clock, 1.4GHz on the GTX 480) or base clock (0.7 GHz on GTX 480) cycles. I am assuming that shared memory runs on the base clock and hence the cycles being referred to are the base clock.

My theoretical estimate is as follows: (Number of ports)*(1/Number of cycles taken to read value)*(Bytes/port)*(Number of SMs)*(Base clock) = 32*(1/2)*4*15*0.7 = 672 GB/s.

Is this correct?

Jawed
17-Jul-2010, 09:16
Cycles are warp scheduler cycles I reckon:



A warp scheduler can issue an instruction to only half of the CUDA cores. To execute an instruction for all threads of a warp, a warp scheduler must therefore issue the instruction over:

2 clock cycles for an integer or floating-point arithmetic instruction,
2 clock cycles for a double-precision floating-point arithmetic instruction,
8 clock cycles for a single-precision floating-point transcendental instruction.
Here 2 cycles are described to issue an entire warp which fits with a 16-wide SIMD and a 32-wide warp.

So bandwidth is 4 bytes per bank * 32 banks * 0.5 banks per cycle * 1400MHz * 15 SMs = 1.344TB/s.

Jawed

EduardoS
17-Jul-2010, 23:54
"Only" that? 32 banks, that's all? Didn't previous generation had more banks?

pcchen
18-Jul-2010, 03:33
"Only" that? 32 banks, that's all? Didn't previous generation had more banks?

No, GT200 and earlier GPU have 16 banks.
This number is not necessarily "the more the better." The reason why GT200 and earlier GPU have only 16 banks is because that they only need 16 for a MP to work fully without causing bank conflicts (there are 8 SP per MP and share memory has 2 cycles latency). On GF100 it takes 32 banks for a MP to work fully.

Jawed
18-Jul-2010, 12:07
HD5870 is 4 bytes per bank * 32 banks * 1 bank per cycle * 850MHz * 20 cores = 2.176TB/s.

Byte/FLOP is arguably more useful: GTX480 is 1 byte per FLOP while HD5870 is 0.8.

On HD5870 reads from LDS eats into available FLOPs. The way the compiler currently works, at minimum it costs 3 scalar ops to fetch 2 floats from LDS. That's a loss of 3 FLOPs per float read from LDS. Reading a single float from LDS costs two scalar ops, i.e. 4 FLOPs per float.

In GTX480 reads from shared memory also eat into FLOPs, since a read from shared memory causes one MAD SIMD to idle for 2 clocks. That's a loss of 2 FLOPs per float.

In NVidia writes are the same. In ATI writes cost 2 FLOPs per float unless one work item wants to write 2 floats to two consecutive banks, when it's 1 FLOP per float.


Obviously net byte/FLOP depends on the quantity of:

shared memory operations
barriers - all the hardware threads in a workgroup have to wait for all others.

Worst case loss for a workgroup of 1024 work items is:

GTX480 : 30 hardware threads * 32 work items * 2 FLOP * 2 SIMDs = 3840 FLOPs
HD5870 : 16 hardware threads * 64 work items * 10 FLOP * 1 SIMD = 10240 FLOPs - overhead is actually worse in theory as the latency of Control Flow instructions cannot be hidden - perhaps another 6000-12000 FLOPs. A barrier for local memory introduces a dedicated ALU clause.

Barrier count in ATI has no effect if workgroup size matches hardware thread size (i.e. 64 work items per work group on Cypress)
In theory barrier count has no effect in NVidia if workgroup size is 32, but I'm unsure.

bank collisions
stalls. ATI's LDS operations use a queue and it's possible for the queue to fill-up. This appears to be separate from stalls caused by bank conflicts, i.e. bank conflicts cause stalls which can affect the queue, but they aren't the only mechanism - I think compilation is the other factor here, bit of a puzzler. Basically a kernel can submit a long list of LDS reads, say, but not pop the results off the output queue.
GTX460 has better per-SM shared memory rates due to 4-issue (3 MAD SIMDs + load/store SIMD), but register file bandwidth constraints are a new variable.

EduardoS
19-Jul-2010, 01:46
I'm a bit too tired to check the math but my point was only about nVidia chips.

AMD chips improved in last generation but still too bad, there is just too few LDS bandwidth per float, it's at least twice as worse as nVidia and with an aditional issue, if a single thread reads four consecutive values there will be a stall, compiler help sometimes but not always, and it is pretty common because of the vectorization we do to improve packing.

Back to nVidia, I tought it was 64 banks, ok 16, 16 on 8 way still better than 32 on 16 way, the later have a higher probability of bank conflicts on random access.

I'm not saying that, for example, AMD should have 64 banks, may this is the easiest way... But, in my opinion, there should be a multidimensional way to access the LDS/register file, for example, a thread being able to read both a small strip of the line (like 1 byte) as being able to read a large strip (like 64 byte) without performance or bank conflicts like today.

Jawed
19-Jul-2010, 12:30
I'm a bit too tired to check the math but my point was only about nVidia chips.
Sadly I often slip up, so I'm hoping someone will correct me if I'm wrong.

AMD chips improved in last generation but still too bad, there is just too few LDS bandwidth per float, it's at least twice as worse as nVidia
Overall I think it's pretty reasonable.

and with an aditional issue, if a single thread reads four consecutive values there will be a stall, compiler help sometimes but not always, and it is pretty common because of the vectorization we do to improve packing.
Yeah the compiler is an awful mess here. Supposedly 10.7 is a big step forward :???:

Back to nVidia, I tought it was 64 banks, ok 16, 16 on 8 way still better than 32 on 16 way, the later have a higher probability of bank conflicts on random access.
Physics.

I'm not saying that, for example, AMD should have 64 banks, may this is the easiest way... But, in my opinion, there should be a multidimensional way to access the LDS/register file, for example, a thread being able to read both a small strip of the line (like 1 byte) as being able to read a large strip (like 64 byte) without performance or bank conflicts like today.
More physics.

In the end, how is this different from respecting cache lines in CPU programming? Or respecting the sizes/associativities of a cache hierarchy?

EduardoS
20-Jul-2010, 23:32
In the end, how is this different from respecting cache lines in CPU programming? Or respecting the sizes/associativities of a cache hierarchy?
What do you mean by "respecting"? A CPU core never access 16 different memory positions in a cycle...

Jawed
23-Jul-2010, 09:58
What do you mean by "respecting"? A CPU core never access 16 different memory positions in a cycle...
Technically a CPU does access multiple memory locations - that's what a cache line represents: an access to any one of them is an access to all of them.

Anyway the programmer has to be careful to maximise utility of cache lines, for maximum performance.

Obviously the GPU is much more difficult, but there's no escaping the fact all these systems have coarse granularities - they're the trade-off for throughput and compute density.

3dilettante
23-Jul-2010, 14:08
With the exception of Larrabee (or some other design with a port width equal to a cache line I cannot think of), this is not the case.
It is physically impossible for a CPU with a 128-bit port to load a 64-byte cache line in a cycle.

Jawed
24-Jul-2010, 08:59
That's merely populating, I'm including lines that are already somewhere in cache hierarchy.

3dilettante
26-Jul-2010, 13:43
For all released non-Larrabee x86, the widest data path I know of is the L1/L2 path for Intel, which is 256 bits.
Sandy Bridge will have one 256 bit port and one 128 bit, which is closer to a whole line in aggregate. Maybe it will be widened in a successor architecture.