AMD: R9xx Speculation

Ailuros · Apr 19, 2010

caveman-jim said:
I'm not sure I understand you. Evergreen doesn't have a memory limitation, but 256-bit memory bus is bandwidth limited? Can you clarify what you mean?

I'm afraid it's the "higher bus-width is better" notion that is at fault here. A GPU with a 256bit bus-width would be oversimplified restricted if the GDDR5 memory used today would had reached a peak frequency that cannot be surpassed and no other higher clocked ram would be available at the time. This is clearly not the case on Cypress and besides it's always more important how an architecture handles its raw bandwidth then how much raw bandwidth it has on paper.

Gordobobo · Apr 19, 2010

caveman-jim said:
I'm not sure I understand you. Evergreen doesn't have a memory limitation, but 256-bit memory bus is bandwidth limited? Can you clarify what you mean?

I mean, maybe rv870 limitation is not on memory bandwidth but u do agree with me that a 256-bit is limited at some point dont u, a 384-bit bus would draw this limitation line further....I ask if AMD is read to replace the good old 256-bit memory bus bcuz of that...it might add rv9xx more performance - even if rv870 is not limited by 256-bit.

Do you understand what I mean ? My english suck.

Squilliam · Apr 19, 2010

Hypothetically is there any reason why they wouldn't use a 320 bit bus? I mean aside from having a 1.25GB/2.5GB maximum ram capacity? Im not suggesting that they are going to want to use that as I believe the ought to be able to get another generation out of the current ram technology with faster GDDR5 chips.

neliz · Apr 19, 2010

Gordobobo said:
I mean, maybe rv870 limitation is not on memory bandwidth but u do agree with me that a 256-bit is limited at some point dont u, a 384-bit bus would draw this limitation line further....I ask if AMD is read to replace the good old 256-bit memory bus bcuz of that...it might add rv9xx more performance - even if rv870 is not limited by 256-bit.

Do you understand what I mean ? My english suck.

We understand what you mean, but the 256 bit bus isn't bottlenecking it at the moment. What good would it do to AMD if the created something the size of GF100 with a 384 bit bus and had all of the issues and delays GF100 had?

It ultimately is about how much bandwidth the processors can consume, it doesn't matter if you feed 50 or 100GB of data to a chip that can only consume 40GB. It's assumable that it's more economic to increase the memory speeds then to increase memory width.

Gordobobo · Apr 19, 2010

neliz said:
We understand what you mean, but the 256 bit bus isn't bottlenecking it at the moment. What good would it do to AMD if the created something the size of GF100 with a 384 bit bus and had all of the issues and delays GF100 had?

It ultimately is about how much bandwidth the processors can consume, it doesn't matter if you feed 50 or 100GB of data to a chip that can only consume 40GB. It's assumable that it's more economic to increase the memory speeds then to increase memory width.

I know all that....I was talking about rv9xx and not rv870.
Since memory bandwidth requirements vary a lot from arch to arch, I ask if rv9xx has any chance to come up with a higher bus (higher than 256-bit).

neliz · Apr 19, 2010

Gordobobo said:
I know all that....I was talking about rv9xx and not rv870.
Since memory bandwidth requirements vary a lot from arch to arch, I ask if rv9xx has any chance to come up with a higher bus (higher than 256-bit).

I don't see a tremendous increase in off-die memory bandwidth need besides the natural evolution

Rys · Apr 19, 2010

Having more than a few bytes/flop would be lovely actually, but it'd need a corresponding increased in things elsewhere to make it worth it for graphics.

rpg.314 · Apr 19, 2010

Rys said:
Having more than a few bytes/flop would be lovely actually, but it'd need a corresponding increased in things elsewhere to make it worth it for graphics.

Well, the golden adage was 1byte/flop and things have only regressed since then. Multiple bytes per flop seems outrageous to provide, but great to use.

http://perilsofparallel.blogspot.com/2010/01/problem-with-larrabee.html

neliz · Apr 19, 2010

What I meant is, we'd see a much improve cache architecture on r9xx before .. oh well.. the die size would probably support something bigger than 256 bit anyway. Heck, some people are gunning for 512 bit on the high end part and just under 6 TFlops, a.k.a. the right R600.

aaronspink · Apr 19, 2010

Rys said:
Having more than a few bytes/flop would be lovely actually, but it'd need a corresponding increased in things elsewhere to make it worth it for graphics.

to even get to 1 byte/flop with 870, you'd need to increase the bandwidth 20x. Things only get worse in the future.

GZ007 · Apr 19, 2010

aaronspink said:
to even get to 1 byte/flop with 870, you'd need to increase the bandwidth 20x. Things only get worse in the future.

The L2 to L1 cache bandwith is already 435GB/s on cypress and the aggregated L1 texture cache bandwith is 1 TB/s.(and these should be on 850 MHz clock)

Those theoretical flops are paralel on the 20 SIMDs and 1600 SP so the 1 byte/flop could be reached with just 20(L1 cache)x138 GB/s. :?:

OpenGL guy · Apr 19, 2010

GZ007 said:
The L2 to L1 cache bandwith is already 435GB/s on cypress and the aggregated L1 texture cache bandwith is 1 TB/s.(and these should be on 850 MHz clock)

Those theoretical flops are paralel on the 20 SIMDs and 1600 SP so the 1 byte/flop could be reached with just 20(L1 cache)x138 GB/s.

There's also LDS which can provide up to 2 TB/s on Cypress.

nAo · Apr 19, 2010

OpenGL guy said:
There's also LDS which can provide up to 2 TB/s on Cypress.

Umh..how? 20 cores * 16 lanes (gather/scatter) * 4 bytes * 850 Mhz = ~1.1 TB/s

Jawed · Apr 19, 2010

32 lanes per clock reading from LDS.

Only 16 lanes writing though.

Jawed

nAo · Apr 20, 2010

Jawed said:
32 lanes per clock reading from LDS.

Only 16 lanes writing though.

Jawed

How does one get a single core to perform 32 reads per clock if it can issue 'only' one load per instruction (and one instruction per clock..ok four clocks, but you get the gist)?

Jawed · Apr 20, 2010

Simple: one instruction can issue two loads! This instruction sends a load instruction to the A and B queues.

Let's say I have 4 addresses I want to fetch from: r0.xyzw. I'll fetch the results into r1.xyzw:

Code:

10 x: LDS_READ2_RET QAB, r0.x, r0.y 
   [other stuff]
11 x: LDS_READ2_RET QAB, r0.z, r0.w 
   y: MOV R1.x, QA.pop 
   z: MOV R1.y, QB.pop 
   [other stuff]
12 x: MOV R1.z, QA.pop
   y: MOV R1.w, QB.pop 
   [other stuff]

As it happens I'm at the mercy of the compiler. So it may or may not work out that neat...

Of course LDS operations consume instruction slots, which lowers available FLOPs, which "softens the ALU:byte problem"...

This basically means one needs to be very careful in minimising the count of LDS writes and reads per FLOP.

Jawed

OpenGL guy · Apr 20, 2010

Jawed said:
Simple: one instruction can issue two loads! This instruction sends a load instruction to the A and B queues.

Let's say I have 4 addresses I want to fetch from: r0.xyzw. I'll fetch the results into r1.xyzw:

Code:

10 x: LDS_READ2_RET QAB, r0.x, r0.y [other stuff] 11 x: LDS_READ2_RET QAB, r0.z, r0.w y: MOV R1.x, QA.pop z: MOV R1.y, QB.pop [other stuff] 12 x: MOV R1.z, QA.pop y: MOV R1.w, QB.pop [other stuff]

As it happens I'm at the mercy of the compiler. So it may or may not work out that neat...

Of course LDS operations consume instruction slots, which lowers available FLOPs, which "softens the ALU:byte problem"...

This basically means one needs to be very careful in minimising the count of LDS writes and reads per FLOP.

The latency is actually higher than what you are showing, but I assume you're just giving an example.

MfA · Apr 20, 2010

Why only 16 writes? I see a DS_INST_WRITE2 opcode in the ISA.

Jawed · Apr 20, 2010

OpenGL guy said:
The latency is actually higher than what you are showing, but I assume you're just giving an example.

Actual compiled ISA:

Code:

         20  x: LDS_READ2_RET  QAB,  R1.w,  PV19.z      
         21  y: MOV         T0.y,  QB.pop      VEC_120 
             w: MOV         T0.w,  QA.pop

The earlier snippet I posted has 1 cycle latency between enqueue and pop, same as this snippet.

So I'm not sure what you're saying about latency :???:

Jawed

OpenGL guy · Apr 20, 2010

Jawed said:
Actual compiled ISA:

Code:

20 x: LDS_READ2_RET QAB, R1.w, PV19.z 21 y: MOV T0.y, QB.pop VEC_120 w: MOV T0.w, QA.pop

The earlier snippet I posted has 1 cycle latency between enqueue and pop, same as this snippet.

So I'm not sure what you're saying about latency

I believe it's 4 clocks latency between issuing the read request vs. when the data is available. Is it possible you've issued some prior LDS_READ2s in your code?

AMD: R9xx Speculation

Ailuros

Epsilon plus three

Gordobobo

Squilliam

Beyond3d isn't defined yet

neliz

GIGABYTE Man

Gordobobo

neliz

GIGABYTE Man

Rys

Graphics @ AMD

rpg.314

neliz

GIGABYTE Man

aaronspink

GZ007

OpenGL guy

nAo

Nutella Nutellae

Jawed

nAo

Nutella Nutellae

Jawed

OpenGL guy

MfA

Jawed

OpenGL guy

Similar threads