AMD: R8xx Speculation

pjbliverpool · Jun 4, 2009

no-X said:
...while using 128bit memory bus?

The HD4770 has no trouble and there will be much faster GDDR5 (or 6) available by the time the next gen consoles hit.

TimothyFarrar · Jun 5, 2009

Jawed said:
Yes, this is like R600->RV670 - the ALU/TU/RBE counts were unchanged and clocks got bumped by 4%, the bus got chopped in half and the GDDR3 clock was raised by 36%.

Except if the bus got chopped in half and memory clock was raised, Evergreen would have ~45mm² to fill with stuff. That's a hell of a lot of stuff, since I estimate that RV740's clusters are around 52mm².

Or, that's 45mm² of D3D11-specific additions

Or, that's 45mm² of D3D11 stuff + architectural re-jigging.

It's conceivable that the architecture needs a shake-up to handle the memory-intensive nature of D3D11.

Jawed as usual that was quite a post! Just don't go on vacation any time soon, because things wouldn't be as good! So I'll byte since it doesn't look as if others want to get that detailed...

It seems to me that D3D11 is making buffers (resources) of indeterminate count and size a more finely-grained component of rendering. Previously rendering consisted of consuming some fixed-size buffers whilst writing to other fixed-size buffers.

Was going through Bill Bilodeau's Presentation again, and realized I had missed some important info from slide 20 saying that the "3 times faster and 1/100th the size" is comparing ATI DX9 (actually DX10, with vertex texture fetch) tessellation (requiring at least one pass to export per edge tessellation factors, and a second to tessellate) to running the high poly mesh through the standard DX9 pipeline, also that "DX11 Tessellator algorithms can usually be done in one pass".

Not sure what that "usually" is referring to other than perhaps with DX11 sometimes they really do spill to main memory if they cannot fit the data flow between the pipeline on-chip.

Guess the other point here is that with DX10 tessellation, the spill to main memory was the edge factors, all the amplified (ie expensive) stuff was always on-chip (GS excluded). Point being that it doesn't look too hard to still have a majority of the data on-chip.

Now D3D11 gives the developer access to their own arbitrarily sized buffers to be used pretty much whenever they feel like it (PS or CS seem the most likely places, and arguably CS is a distinct rendering pipeline all of its own) - though it seems there is still a hard limit on the population of these buffers bound to the rendering pipeline at any one time.

It seems very enlightening to realize that DX11 PS stage has a RTV+UAV limit of 8, and CS also has a UAV limit of 8. And that even though UAV (unordered access views) can access a texture, this is still "unordered". Seems rather natural to assume that at least on one DX11 arch, that render targets and UAVs are the same interface!?

BTW, SRV (shader resource views) are read only.

As for Append/Consume there are 2 options, Raw or Structured, and I believe now that buffers are mapped as either Append or Consume but NOT both in one shader! So wouldn't be surprised at all if Append/Consume is direct to global memory!!!! Also I haven't seen limits posted yet, but I'd bet that there is a strict limit as to how many of these one gets in a shader.

The DX11 hardware transition is looking much more mild to me now.

no-X · Jun 5, 2009

pjbliverpool said:
The HD4770 has no trouble and there will be much faster GDDR5 (or 6) available by the time the next gen consoles hit.

HD4770 has no trouble, but is it at the performance level, which Microsoft requires for their next-gen console?

ninelven said:
So what costs more, enough e-dram to do 4x AA @ 1920x1080 with HDR or a 192bit memory bus?

The question is, if 192bit bus would be enough...

So what does cost more? EDRAM for 1920x1080 / MSAA 4x, 128bit bus + cheap slow low-power memory modules, cheap routing, no Z-compression hardware, simplified ROPs...

or (at least) 256bit interface with more PCB routing, more expensive and power hungry memory modules, bigger GPU with z-compression hardware, more complex memory bus, more complex ROPs...?

There is one more advantage - you can shrink the GPU using new manufacturing process without reducing memory bus width. Xenos was designed for 90nm production, now is produced on 65nm and 45nm version is on the way. No need to change memory bus or use faster (more expensive) memory parts for compensation.

I'm not convinced that we will see this 180mm2 GPU in future Xbox console...

Ailuros · Jun 5, 2009

no-X said:
HD4770 has no trouble, but is it at the performance level, which Microsoft requires for their next-gen console?

I'd say that even LRB's extreme edition wouldn't fit that description.

The question is, if 192bit bus would be enough...

Assuming we're talking about a mainstream GPU I don't see why it should be a problem if the board will come with 900-1000MHz GDDR5. The theoretical raw bandwidth range is between 86.4 and 96.0GB/s. The first could be even over 1/2 the bandwidth of a hypothetical performance GPU's bandwidth.

So what does cost more? EDRAM for 1920x1080 / MSAA 4x, 128bit bus + cheap slow low-power memory modules, cheap routing, no Z-compression hardware, simplified ROPs...

or (at least) 256bit interface with more PCB routing, more expensive and power hungry memory modules, bigger GPU with z-compression hardware, more complex memory bus, more complex ROPs...?

There is one more advantage - you can shrink the GPU using new manufacturing process without reducing memory bus width. Xenos was designed for 90nm production, now is produced on 65nm and 45nm version is on the way. No need to change memory bus or use faster (more expensive) memory parts for compensation.

I'm not convinced that we will see this 180mm2 GPU in future Xbox console...

Desktop GPU != Console GPU. Besides that the framebuffer requirements for even a future console have risen tremendously since the advent of the XBox360, which means that for similar future designs the amount of necessary eDRAM would be higher and thus would further increase the transistor count for such a hypothetical module, even more so if you'd think of embedded ram for a desktop GPU where you cannot restrict as with consoles to a fixed highest resolution.

Besides I'm not entirely sure that you'd want a desktop GPU (even more so a X11 one) that would need to rebuffer geometry in order to fit N AA samples into framebuffer size X.

Finally where has any of you seen any reliable rumours yet that console manufacturer X has decided to license GPU IP from IHV Y anyway? In other words at the moment all options seem to open for all possible contenders in that regard and no there's no guarantee for anything.

no-X · Jun 5, 2009

Desktop GPU != Console GPU.

That's what I'm talking about. The entire idea of RV840(?) GPU as multi-platform product (PC/XBOX) seems to be completely unrealistic to me.

Ailuros · Jun 5, 2009

no-X said:
That's what I'm talking about. The entire idea of RV840(?) GPU as multi-platform product (PC/XBOX) seems to be completely unrealistic to me.

Console manufacturers should be interested in graphics IP only for future consoles.

ninelven · Jun 5, 2009

The entire idea of RV840(?) GPU as multi-platform product (PC/XBOX) seems to be completely unrealistic to me.

I don't know where you got that idea; I wasn't talking about this particular chip, rather the architecture in general (since it is quite efficient per mm^2). Obviously, it would be a custom design.

BTW, plenty of consoles have done just fine w/o edram.

neliz · Jun 5, 2009

Please, stop talking about consoles and start discussing GT300 here. The console chip discussion should go to the politics topics and RV8X0 discussion is over at GT300.

Jawed · Jun 5, 2009

One thing I didn't really tackle is what the hardware is apparently doing now. e.g. the degree of caching/coalescing that occurs within the MCs.

TimothyFarrar said:
Was going through Bill Bilodeau's Presentation again, and realized I had missed some important info from slide 20 saying that the "3 times faster and 1/100th the size" is comparing ATI DX9 (actually DX10, with vertex texture fetch) tessellation (requiring at least one pass to export per edge tessellation factors, and a second to tessellate) to running the high poly mesh through the standard DX9 pipeline, also that "DX11 Tessellator algorithms can usually be done in one pass".

Not sure what that "usually" is referring to other than perhaps with DX11 sometimes they really do spill to main memory if they cannot fit the data flow between the pipeline on-chip.

Guess the other point here is that with DX10 tessellation, the spill to main memory was the edge factors, all the amplified (ie expensive) stuff was always on-chip (GS excluded). Point being that it doesn't look too hard to still have a majority of the data on-chip.

This document:

http://developer.amd.com/gpu_assets/Real-Time_Tessellation_on_GPU.pdf

is considerably more detailed on the approach taken under DX9 and D3D10. It mentions how R2VB is used in DX9 to generate adaptive tessellation factors and seemingly to prepare extensive per vertex data for consumption by the evaluation shader (which is DS in D3D11 as far as I can tell). These are work-arounds for the lack of HS/DS it seems and so necessitate the multi-pass tessellation pipeline. Streamout is an alternative under D3D10.

In theory I should be able to pick my way through this document to associate the DX9/architecture-specific formats and techniques with their D3D11 counterparts, in a bid to determine dataflows.

One thing I've realised is that TS is analogous to the rasteriser, but for vertices, not pixels (fragments). It's notable that RS generates fragments and barycentrics for the interpolator fixed function stage (SPI) to then do two key things:

interpolate all vertex attributes
allocate the entirety of registers for an instance of a shader (i.e. a wavefront of 64 fragments), populating certain registers with the interpolated attribute data required over the shader's entire lifetime

and what's particularly interesting is that SPI is responsible for allocating registers for all shader types, not just PS.

Now TS is responsible for generating vertices and interpolating attributes (i.e. vertex coordinates and all per vertex data). This seems extremely similar to the pairing of RS+SPI in the conventional pipeline.

So now I'm wondering if, in D3D11, ATI's TS generates a stream of vertices and parameters that determine where SPI actually places those vertices and then performs all interpolations. SPI would then create the DS wavefronts and perform register allocations, filling registers with per vertex attributes. So, perhaps in the same way that RS+SPI consumes vertices and generates fragments with all their interpolated attributes, TS+SPI consumes patches and generates vertices and interpolated attributes.

I'm a bit wary about this, because the developer is meant to do their own attribute interpolation in DS, so I'm trying to understand the bounds of what interpolation is automatically performed by TS and what has to be done manually. If a vertex is moved along the normal (in DS code) that implies manual re-interpolation of a vertex that was already interpolated during TS, I think.

As to multi-passing of D3D11 tessellation, I guess this relates to doing things like instancing and maybe shadow buffering (per light) or generating cube maps. Also, since TS is optional, I think animation using low-resolution "control cage" data that is the basis of all geometry effectively makes post-animation rendering merely a second or later pass, with TS turned on?

My remaining question then becomes, will ATI retain SPI? With more manual interpolation required, the fixed-function SPI looks partially redundant. I don't think TS is capable of swamping SPI, since RS currently generates 32 fragments per clock for SPI to consume, and TS is effectively setup rate limited. So performance doesn't appear to be the reason to delete SPI.

But I'm wondering if there might be other reasons that contribute to the deletion of SPI, leaving interpolation as a purely programmable operation as seen in NVidia (one that is automatically inserted by the compiler, in effect).

It seems very enlightening to realize that DX11 PS stage has a RTV+UAV limit of 8, and CS also has a UAV limit of 8. And that even though UAV (unordered access views) can access a texture, this is still "unordered". Seems rather natural to assume that at least on one DX11 arch, that render targets and UAVs are the same interface!?

I dare say that the underlying model here is that a UAV is simply a fixed-size render target where arbitrary pixel locations can be written by any number of fragments, in any order. This is how scatter works in GPUs already, as far as I can tell.

The limit of 8 implies that the hardware that controls addressing has limited capability - after all the RBEs/ROPs are having to cooperatively work with the same set of 8 base addresses, so this implies a very simple/small/fast mechanism for generating addresses within each unit, according to the named buffer.

ATI Stream currently only allows a single base address for scatter, in effect (i.e. only one, arbitrarily sized scatter buffer can be bound at one time). EDIT: (In Brook+, not sure about CAL, now I think about it.) So if the developer wants to have multiple independent scatter buffers they have to do their own addressing math with offset base addresses.

BTW, SRV (shader resource views) are read only.

As for Append/Consume there are 2 options, Raw or Structured, and I believe now that buffers are mapped as either Append or Consume but NOT both in one shader! So wouldn't be surprised at all if Append/Consume is direct to global memory!!!! Also I haven't seen limits posted yet, but I'd bet that there is a strict limit as to how many of these one gets in a shader.

Yes I neatly dodged around the question of read/write to a single buffer in my earlier posting, since I didn't want to touch on the reading of a buffer "shortly" after writing it. I suspect there'll be limited use of this under D3D11 for append/consume buffers - they seem to be explicitly multi-pass centric, geared for ease of scheduling. This is the basic problem with the single extant draw state model of the current GPU pipeline as far as I can tell.

If you want to read and write concurrently then you're stuck with UAVs and CS it seems. Although now I think about it, I was under the impression that PS could read and write the render target(s). Hmm.

In my view append/consume is logically in memory. This is similar to how registers are logically in memory too.

The question then becomes when will D3D give us pre-emptive multiple concurrent logical pipelines, to support the model of arbitrary kernels producing and consuming mutual data. Which comes back to your theories about NVidia's likely introduction of persistent, task-parallel, kernels in GT300. Under this model the buffers could be meaningfully cached on-die with spill to memory as required.

My long-standing question though is relating to the quantities of data involved. e.g. D3D10 GS has a limit of 1024 scalars per input vertex of amplified data. That's there because someone decided that building big-enough on-die buffers was difficult. That limitation is clearly a nonsense in ATI's model of paired ring buffers.

The DX11 hardware transition is looking much more mild to me now.

Me too. The lack of pre-emption is, to me, a big mistake. It means everything's forced down a single logical pipeline and it means that unruly applications can kill the system with a meaty CS. The latest instance of this appears to be the people playing with ATI's PCI Express bandwidth tester, reporting that VPU recover is kicking-in or the system is rebooting - all because they're copying huge amounts of data

The general instability of folding@home on GPUs also points to the general fragility of the single current logical pipeline right now. I can't help thinking this kind of instability is going to bite back. Not within games, but when people start running compute-applications under Windows while at the same time trying to do anything else (e.g. play a game).

Jawed

hoom · Jun 5, 2009

I'm curious that nobody commented on the MVI_2932.AVI file that was linked earlier. (or is that on the G300 thread?)

The speaker clearly states "...you need those 800 shader processors enabled by the compute shader capabilities of DX11"

Could just be talking about the SM4.1 variant of compute shaders & running the demo on RV770 I guess?

Is it the standard Froblins demo or is it supposed to be a DX11ised version?

neliz · Jun 5, 2009

hoom said:
I'm curious that nobody commented on the MVI_2932.AVI file that was linked earlier. (or is that on the G300 thread?)

The speaker clearly states "...you need those 800 shader processors enabled by the compute shader capabilities of DX11"

Could just be talking about the SM4.1 variant of compute shaders & running the demo on RV770 I guess?

Is it the standard Froblins demo or is it supposed to be a DX11ised version?

They are talking about SM5. the Demo has been updated with some CS A.I. which wasn't available in the DX10 and DX10.1 demo's

And it's a very good possibility that RV840 is a beefed up DX11 RV790.

Jawed · Jun 5, 2009

I have to say I'm dubious that Froblins has been updated for D3D11. All the features referred to are part of the original demo.

http://ati.amd.com/developer/SIGGRAPH08/Chapter03-SBOT-March_of_The_Froblins.pdf

I did notice the "800 slip-up", not sure whether it means anything.

Jawed

pjbliverpool · Jun 5, 2009

no-X said:
HD4770 has no trouble, but is it at the performance level, which Microsoft requires for their next-gen console?

Probably not but I was just pointing out that 128bit bus does not preclude the use of MSAA even in todays most advanced games (baring Crysis).

With faster memory in the future, a console GPU may still be feasibly based on a 128bit bus without eDRAM, especially if they are not targetting bleeding edge performance.

Jawed · Jun 5, 2009

neliz said:
http://vr-zone.com/articles/amd-dx11-r800-card-exposed/7154.html?doc=7154

Cooling setup imho is reminiscent of RV740.

OK, this is pretty weird, the article neliz linked earlier no longer has the picture (which can still be seen in neliz's post) and the article actually says "We can't provide you a close up right now but judging from the looks of it, the card is about 8.5" long, dual slot and requires a 6-pin power."

And it seems to have been replaced by:

http://vr-zone.com/articles/amd-dx11-rv8xx-card-exposed/7154.html?doc=7154

notice the slightly different URL?

Meanwhile, this is what Device Manager says:

http://resources.vr-zone.com//uploads/7154/System.JPG

Jawed

TimothyFarrar · Jun 5, 2009

Jawed said:
I dare say that the underlying model here is that a UAV is simply a fixed-size render target where arbitrary pixel locations can be written by any number of fragments, in any order. This is how scatter works in GPUs already, as far as I can tell.

The limit of 8 implies that the hardware that controls addressing has limited capability - after all the RBEs/ROPs are having to cooperatively work with the same set of 8 base addresses, so this implies a very simple/small/fast mechanism for generating addresses within each unit, according to the named buffer.

...

If you want to read and write concurrently then you're stuck with UAVs and CS it seems. Although now I think about it, I was under the impression that PS could read and write the render target(s). Hmm.

UAV read/write and RT read/write in PS will be interesting on DX11. Unlike CS where blocks can be made to directly correspond to vector memory accesses, this isn't likely to happen is PS stages other than when a triangle fully covers a SIMD sized coarse tile. And outside Larrabee, that means collecting pixel quads differently from coarse fully filled SIMD sized tiles and non-fully filled SIMD sized tiles (collection of various 2x2 pixel quads). Small tessellated triangles surely doesn't help here.

On Larrabee (and perhaps ATI and NVidia depending on actual hardware design), there could be a bunch of cases (I'm thinking some very strange deferred shading and particle stuff) were it is actually better to roll your own "rasterizer" in a CS pass to insure good pixel grouping for vector cacheline aligned and coalesced accesses and to avoid the overhead of the triangle raster for non-triangle stuff (like lights).

If PS UAV/RT read/write is going to be a fast path on ATI (and NVidia) we are bound to be looking at something quite a bit different than say how CUDA accesses global memory (because that isn't going to be fast for PS) and rather something like a read/write version of an ROP.

So if the developer wants to have multiple independent scatter buffers they have to do their own addressing math with offset base addresses.

Yeah, but that (ie own addressing math) is the way it should be, because anything else isn't going to scale. A small number of UAVs isn't a limitation in my eyes.

Nebuchadnezzar · Jun 5, 2009

Jawed said:
Meanwhile, this is what Device Manager says:

http://resources.vr-zone.com//uploads/7154/System.JPG

Jawed

768MB Ram then?

neliz · Jun 5, 2009

Jawed said:
OK, this is pretty weird, the article neliz linked earlier no longer has the picture (which can still be seen in neliz's post) and the article actually says "We can't provide you a close up right now but judging from the looks of it, the card is about 8.5" long, dual slot and requires a 6-pin power."

And it seems to have been replaced by:

http://vr-zone.com/articles/amd-dx11-rv8xx-card-exposed/7154.html?doc=7154

notice the slightly different URL?

Meanwhile, this is what Device Manager says:

http://resources.vr-zone.com//uploads/7154/System.JPG

Jawed

Nebuchadnezzar said:
768MB Ram then?

32 bit

They would have major balls and made a lot of marketing people happy if that machine ran AMD64.

nice Find Jawed. Looks like they talked to some marketing chaps and were advised to change their story. Though I think the RV840 reference was already in there

Jawed · Jun 5, 2009

TimothyFarrar said:
UAV read/write and RT read/write in PS will be interesting on DX11. Unlike CS where blocks can be made to directly correspond to vector memory accesses, this isn't likely to happen is PS stages other than when a triangle fully covers a SIMD sized coarse tile. And outside Larrabee, that means collecting pixel quads differently from coarse fully filled SIMD sized tiles and non-fully filled SIMD sized tiles (collection of various 2x2 pixel quads). Small tessellated triangles surely doesn't help here.

On Larrabee (and perhaps ATI and NVidia depending on actual hardware design), there could be a bunch of cases (I'm thinking some very strange deferred shading and particle stuff) were it is actually better to roll your own "rasterizer" in a CS pass to insure good pixel grouping for vector cacheline aligned and coalesced accesses and to avoid the overhead of the triangle raster for non-triangle stuff (like lights).

If PS UAV/RT read/write is going to be a fast path on ATI (and NVidia) we are bound to be looking at something quite a bit different than say how CUDA accesses global memory (because that isn't going to be fast for PS) and rather something like a read/write version of an ROP.

I think only RTs can support R/W and this "works" because the fragment can only access its pixel. Don't know where to look to find this out for sure, though.

UAVs are bindable only as read or write though. Jack Hoxley's refined his tessellating terrain renderer:

http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=316777&reply_id=3459758

using a CS to analyse the height field in order to fine-tune the LOD algorithm according to local complexity of heights. It's nice stuff, the CS writes the UAV then it's rebound as input for HS.

CS as a "rasteriser" for post-processing of a render target is pretty much the simplest use case as far as I can tell. Are you thinking of a volumetric space-filled (Z) walk through a particle system?

Any kind of irregular rasterisation, e.g. resolution matched shadow maps, should be fun.

http://graphics.cs.ucdavis.edu/~lefohn/work/dissertation/lefohnPhdDefense.pdf

Yeah, but that (ie own addressing math) is the way it should be, because anything else isn't going to scale. A small number of UAVs isn't a limitation in my eyes.

Hmm, well, I guess this is like being told to write a program with only one malloc(). Then again multiple buffers will start to fragment memory if they have varying lifetimes. Virtualised memory (everything paged) is a partial solution as contiguousness no longer becomes a prerequisite of being able to create the buffer. But I'm well out of my depth on this whole subject, gladly so.

Since buffers theoretically have independent lifetimes, one UAV might only be bound for writing for the duration of one pass, but another UAV might be bound for multiple passes (e.g. the latter is a tree you're building).

So multiple UAV support seems part and parcel of insulating the developer from the horrors of memory management on the GPU (and structured UAVs from the computational overhead of indexing anything that isn't a float/float4). Well except the application, as a whole, still needs to control mappings and lifetimes of buffers and decide whether to use buffers that reside in CPU-side memory or copy back/forth CPU<->GPU. Stuff I've never tangled with...

Jawed

Blazkowicz · Jun 5, 2009

Nebuchadnezzar said:
768MB Ram then?

no, addressing doesn't work like this. there's usually a 256MB window of addressing space per GPU, regardless of video memory size. Other addressing space is allocated to other peripherals (the likes of system clock, PCI devices, DMA stuff..); a bunch is reserved anyway by the OS, that's why 3.25GB memory is the most common number under windows.

Kaotik · Jun 5, 2009

Blazkowicz said:
no, addressing doesn't work like this. there's usually a 256MB window of addressing space per GPU, regardless of video memory size. Other addressing space is allocated to other peripherals (the likes of system clock, PCI devices, DMA stuff..); a bunch is reserved anyway by the OS, that's why 3.25GB memory is the most common number under windows.

http://support.microsoft.com/default.aspx/kb/929605
It doesn't specify it really, but IMO the way it's worded suggests the video memory can take bigger chunk than just 256MB

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

pjbliverpool

B3D Scallywag

TimothyFarrar

no-X

Ailuros

Epsilon plus three

no-X

Ailuros

Epsilon plus three

ninelven

PM

neliz

GIGABYTE Man

Jawed

hoom

neliz

GIGABYTE Man

Jawed

pjbliverpool

B3D Scallywag

Jawed

TimothyFarrar

Nebuchadnezzar

neliz

GIGABYTE Man

Jawed

Blazkowicz

Kaotik

Drunk Member

Similar threads