AMD: R9xx Speculation

I'm not sure I understand you. Evergreen doesn't have a memory limitation, but 256-bit memory bus is bandwidth limited? Can you clarify what you mean?

I'm afraid it's the "higher bus-width is better" notion that is at fault here. A GPU with a 256bit bus-width would be oversimplified restricted if the GDDR5 memory used today would had reached a peak frequency that cannot be surpassed and no other higher clocked ram would be available at the time. This is clearly not the case on Cypress and besides it's always more important how an architecture handles its raw bandwidth then how much raw bandwidth it has on paper.
 
I'm not sure I understand you. Evergreen doesn't have a memory limitation, but 256-bit memory bus is bandwidth limited? Can you clarify what you mean?

I mean, maybe rv870 limitation is not on memory bandwidth but u do agree with me that a 256-bit is limited at some point dont u, a 384-bit bus would draw this limitation line further....I ask if AMD is read to replace the good old 256-bit memory bus bcuz of that...it might add rv9xx more performance - even if rv870 is not limited by 256-bit.

Do you understand what I mean ? My english suck.
 
Hypothetically is there any reason why they wouldn't use a 320 bit bus? I mean aside from having a 1.25GB/2.5GB maximum ram capacity? Im not suggesting that they are going to want to use that as I believe the ought to be able to get another generation out of the current ram technology with faster GDDR5 chips.
 
I mean, maybe rv870 limitation is not on memory bandwidth but u do agree with me that a 256-bit is limited at some point dont u, a 384-bit bus would draw this limitation line further....I ask if AMD is read to replace the good old 256-bit memory bus bcuz of that...it might add rv9xx more performance - even if rv870 is not limited by 256-bit.

Do you understand what I mean ? My english suck.

We understand what you mean, but the 256 bit bus isn't bottlenecking it at the moment. What good would it do to AMD if the created something the size of GF100 with a 384 bit bus and had all of the issues and delays GF100 had?

It ultimately is about how much bandwidth the processors can consume, it doesn't matter if you feed 50 or 100GB of data to a chip that can only consume 40GB. It's assumable that it's more economic to increase the memory speeds then to increase memory width.
 
We understand what you mean, but the 256 bit bus isn't bottlenecking it at the moment. What good would it do to AMD if the created something the size of GF100 with a 384 bit bus and had all of the issues and delays GF100 had?

It ultimately is about how much bandwidth the processors can consume, it doesn't matter if you feed 50 or 100GB of data to a chip that can only consume 40GB. It's assumable that it's more economic to increase the memory speeds then to increase memory width.

I know all that....I was talking about rv9xx and not rv870.
Since memory bandwidth requirements vary a lot from arch to arch, I ask if rv9xx has any chance to come up with a higher bus (higher than 256-bit).
 
I know all that....I was talking about rv9xx and not rv870.
Since memory bandwidth requirements vary a lot from arch to arch, I ask if rv9xx has any chance to come up with a higher bus (higher than 256-bit).

I don't see a tremendous increase in off-die memory bandwidth need besides the natural evolution
 
Having more than a few bytes/flop would be lovely actually, but it'd need a corresponding increased in things elsewhere to make it worth it for graphics.
 
What I meant is, we'd see a much improve cache architecture on r9xx before .. oh well.. the die size would probably support something bigger than 256 bit anyway. Heck, some people are gunning for 512 bit on the high end part and just under 6 TFlops, a.k.a. the right R600.
 
Last edited by a moderator:
Having more than a few bytes/flop would be lovely actually, but it'd need a corresponding increased in things elsewhere to make it worth it for graphics.

to even get to 1 byte/flop with 870, you'd need to increase the bandwidth 20x. Things only get worse in the future.
 
to even get to 1 byte/flop with 870, you'd need to increase the bandwidth 20x. Things only get worse in the future.

The L2 to L1 cache bandwith is already 435GB/s on cypress and the aggregated L1 texture cache bandwith is 1 TB/s.(and these should be on 850 MHz clock)

Those theoretical flops are paralel on the 20 SIMDs and 1600 SP so the 1 byte/flop could be reached with just 20(L1 cache)x138 GB/s. :?:
 
The L2 to L1 cache bandwith is already 435GB/s on cypress and the aggregated L1 texture cache bandwith is 1 TB/s.(and these should be on 850 MHz clock)

Those theoretical flops are paralel on the 20 SIMDs and 1600 SP so the 1 byte/flop could be reached with just 20(L1 cache)x138 GB/s. :?:
There's also LDS which can provide up to 2 TB/s on Cypress.
 
32 lanes per clock reading from LDS.

Only 16 lanes writing though.

Jawed
How does one get a single core to perform 32 reads per clock if it can issue 'only' one load per instruction (and one instruction per clock..ok four clocks, but you get the gist)?
 
Simple: one instruction can issue two loads! This instruction sends a load instruction to the A and B queues.

Let's say I have 4 addresses I want to fetch from: r0.xyzw. I'll fetch the results into r1.xyzw:

Code:
10 x: LDS_READ2_RET QAB, r0.x, r0.y 
   [other stuff]
11 x: LDS_READ2_RET QAB, r0.z, r0.w 
   y: MOV R1.x, QA.pop 
   z: MOV R1.y, QB.pop 
   [other stuff]
12 x: MOV R1.z, QA.pop
   y: MOV R1.w, QB.pop 
   [other stuff]
As it happens I'm at the mercy of the compiler. So it may or may not work out that neat...

Of course LDS operations consume instruction slots, which lowers available FLOPs, which "softens the ALU:byte problem"...

This basically means one needs to be very careful in minimising the count of LDS writes and reads per FLOP.

Jawed
 
Simple: one instruction can issue two loads! This instruction sends a load instruction to the A and B queues.

Let's say I have 4 addresses I want to fetch from: r0.xyzw. I'll fetch the results into r1.xyzw:

Code:
10 x: LDS_READ2_RET QAB, r0.x, r0.y 
   [other stuff]
11 x: LDS_READ2_RET QAB, r0.z, r0.w 
   y: MOV R1.x, QA.pop 
   z: MOV R1.y, QB.pop 
   [other stuff]
12 x: MOV R1.z, QA.pop
   y: MOV R1.w, QB.pop 
   [other stuff]
As it happens I'm at the mercy of the compiler. So it may or may not work out that neat...

Of course LDS operations consume instruction slots, which lowers available FLOPs, which "softens the ALU:byte problem"...

This basically means one needs to be very careful in minimising the count of LDS writes and reads per FLOP.
The latency is actually higher than what you are showing, but I assume you're just giving an example.
 
The latency is actually higher than what you are showing, but I assume you're just giving an example.
Actual compiled ISA:

Code:
         20  x: LDS_READ2_RET  QAB,  R1.w,  PV19.z      
         21  y: MOV         T0.y,  QB.pop      VEC_120 
             w: MOV         T0.w,  QA.pop

The earlier snippet I posted has 1 cycle latency between enqueue and pop, same as this snippet.

So I'm not sure what you're saying about latency :???:

Jawed
 
Actual compiled ISA:
Code:
         20  x: LDS_READ2_RET  QAB,  R1.w,  PV19.z      
         21  y: MOV         T0.y,  QB.pop      VEC_120 
             w: MOV         T0.w,  QA.pop

The earlier snippet I posted has 1 cycle latency between enqueue and pop, same as this snippet.

So I'm not sure what you're saying about latency :???:
I believe it's 4 clocks latency between issuing the read request vs. when the data is available. Is it possible you've issued some prior LDS_READ2s in your code?
 
Back
Top