AMD: R8xx Speculation

elsence · Oct 4, 2009

MfA said:
It's a pity setup rate wasn't doubled (I still think it would be easy given that the rasterizers deal with different tile sets)

Why it would be easy?

MfA · Oct 4, 2009

We had a sizeable discussion about it before, I'm not going to trawl this mega thread for it ... DIY

elsence · Oct 4, 2009

MfA said:
We had a sizeable discussion about it before, I'm not going to trawl this mega thread for it ... DIY

lol,

i don't want to trawl this mega thread either.

I just asked if there was a short answer.

trinibwoy · Oct 4, 2009

Jawed said:
Setup does a coarse rasterisation, identifying all the tiles that a triangle at least partially covers, then giving the rasteriser(s) a list of tiles and triangle data in order to rasterise. I suspect the rasteriser has a tile-centric view of rasterisation, not a triangle-centric view. That's because threads of 16 quads of fragments need to be despatched, and those need to be strictly tile-aligned (because the render target is tiled). Though I also expect it to handle triangles in strict order.

Is that based on some documentation you saw? Why would setup need to determine tile coverage? Shouldn't it be: vertices -> setup -> triangle -> raster -> tiles -> shaders?

One of the key questions that's still unanswered is can a thread of fragments refer to more than one triangle (e.g. 5 adjacent small triangles from a strip)?

Nvidia had a patent on something like that so people are thinking about it.

I think it might be a matter of practicality in instancing a block of hardware rather than re-jigging things for 32-rasterisation. I don't think the number 32 is problematic (since other ATI GPUs have 4-, 8- and 12-rasterisers) merely that scaling isn't free of latency/pipelining issues across the entire width of the unit.

Jawed

Oh where'd you get those tile sizes from?

MfA said:
It's a pity setup rate wasn't doubled (I still think it would be easy given that the rasterizers deal with different tile sets) but meh ... being a bitch about getting implementation details just because they aren't relevant to performance I see as counter-productive, I'd still rather hear them than not.

I want details. You want details. I don't see the difference.....

It's not like it was a paper launch where misunderstandings could fester for months ...

Heh, I'm sure you'll be able to sleep again soon once the DP througput is confirmed

MfA · Oct 4, 2009

trinibwoy said:
I want details. You want details. I don't see the difference.

If it's just an increase in scan conversion throughput it's stupid.

We know it's just an increase in scan conversion throughput now ... regardless of it's the truth you are saying it's stupid the double rasterizers were in the diagram AFAICS.

trinibwoy · Oct 5, 2009

MfA said:
We know it's just an increase in scan conversion throughput now ... regardless of it's the truth you are saying it's stupid the double rasterizers were in the diagram AFAICS.

Nope, not exactly. I'm saying that if it's the same kind of increase that we've always been getting then it's weird that they suddenly chose to market it like this, given how much people have been hyping the setup bottleneck. If (as people are speculating) the implementation is tangibly different in some meaningful way then I'm just as interested as you to know more.

swaaye · Oct 5, 2009

Re: Bandwidth

Look what 5870 does with only +33% bandwidth compared to 4870. So I'm guessing that a new card with slightly more bandwidth than 4770 can deliver some nice gains.

I'm kinda curious what's going to replace 4350, 4550 and 4650.

Davros · Oct 5, 2009

At a guess i'd say the 5350, 5550 and 5650.

MfA · Oct 5, 2009

trinibwoy said:
If (as people are speculating) the implementation is tangibly different in some meaningful way then I'm just as interested as you to know more.

It reduces fanout inside the rasterizer and allows you to get the rasterizers closer to the shader cores, also it was probably less work.

neliz · Oct 5, 2009

Davros said:
At a guess i'd say the 5350, 5550 and 5650.

no 3, but a 4!
no 6, but a 7!

Jawed · Oct 5, 2009

trinibwoy said:
Is that based on some documentation you saw? Why would setup need to determine tile coverage? Shouldn't it be: vertices -> setup -> triangle -> raster -> tiles -> shaders?

No I haven't seen any documentation on this for R800. I was thinking in terms of each rasteriser working solely on the tiles it's given - the alternative is that both rasterisers get all triangles and decide which tiles they own by first doing the coarse rasterisation themselves. Since you queried it, I now think the latter approach is more likely.

Oh where'd you get those tile sizes from?

Those aren't tile sizes, they're rasterisation rates. The rasterisation rate equals the colour fillrate on ATI. So my X1950Pro has 12 RBEs and therefore needs a 12-rasteriser. The tiles it works with might be 8x8 pixels or larger. Thread size is 48.

The mapping of tiles and threads is something I don't understand. e.g. does this mean that tiles are 12x12 pixels on X1950Pro (144 pixels is 3x48) and restricted to such multiples? Or does this point to more flexibility with the way quads of fragments are assigned to threads?

If distinct (discontiguous) quads of fragments from multiple tiles can share a thread, then it's only a small step from that capability to being able to run multiple triangles' fragments together in one thread, even if they share an edge.

If that's the case, then there would be a speed-up from two rasterisers on small triangles, contrary to my view in the last post.

But I don't know how loose or strict quad assignment to threads is.

My original understanding of ATI's architecture is that tile sizes are small and thread sizes are no bigger - and this goes right back to R300. In stark contrast with NVidia's approach which on older GPUs has threads of as many as thousands of fragments (NV40) - under which regime the need to have multiple triangles' fragments share a thread is considerably more important.

A key feature of R300 onwards until R800 is that a fragment is fully described before pixel shading starts. Every vertex attribute is interpolated per pixel location, i.e. every fragment is independent of all other fragments and independent of the triangle that generated it. Though I think Primitive ID is carried forwards to be used in RBE, in order that primitives are written in their original order (as defined by setup) into the render target.

With that kind of independence, there's no reason why multiple triangles' fragments can't share a thread.

Except that Primitive ID may be something held by the scheduler (rather than per quad of fragments) - apart from anything else it's a key upon which to prioritise scheduling. But it also enables correct ordering for RBE. In which case there's no support for multiple triangles' fragments per thread.

The preponderance of small triangles in a tessellating pipeline might have made the designers bias towards Primitive ID-keyed quads of fragments. Alternatively they might have left the design alone, on the basis that the architecture is already happy with small triangles. Maybe the change was made in R600 when the tessellator first appeared?

R800's deletion of SPI means fragments are no longer independent of the triangle that generated them once pixel shading is under way. That theoretically means that each thread, at minimum, has triangle data associated with it (attributes + barycentrics). It may be that each quad of fragments is keyed for triangle, in order to perform interpolation, which brings us back to having multiple triangles' fragments sharing a thread.

The problem with that is that attribute interpolation (thread-wide) has variable throughput if a variable number of triangles shares the thread (similar to waterfalling fetches when constants are dynamically indexed). A compromise that isn't untenable, but makes me wonder if it's worth handling. Attribute/barycentrics fetch for interpolation may be handled by the constant hardware, in which case it's not a special case and easily handled.

So, I dunno...

Jawed

fellix · Oct 5, 2009

"Radeon 100" (!) next year?

mboeller · Oct 5, 2009

fellix said:
"Radeon 100" (!) next year?

codenames could be: "COZUMEL" ; "IBIZA" and "KAUAI"

source: forum posting from Gipsel @ 3DCenter.org

He also found something about "ASIC_ALU_REORDER" directly related to "ASIC_R9XX" so he thinks that R9xx could be MIMD or a OOO-GPU.

neliz · Oct 5, 2009

launching six months after GF100 eh?

fellix · Oct 5, 2009

*runs for cover*

Ailuros · Oct 5, 2009

neliz said:
launching six months after GF100 eh?

I don't think so, but that's way too far into the future to debate right now.

DegustatoR · Oct 5, 2009

Ailuros said:
I don't think so, but that's way too far into the future to debate right now.

For all we know GF200 may launch 6 months after GF100 too.

rpg.314 · Oct 5, 2009

fellix said:
"Radeon 100" (!) next year?

Time for a new thread

neliz · Oct 5, 2009

Ailuros said:
I don't think so, but that's way too far into the future to debate right now.

The interwebs are crazy these days

rpg.314 · Oct 5, 2009

Have the uber analysts of b3d been able to figure out what is so "most-special-since-r600" about evergreen?

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

elsence

MfA

elsence

trinibwoy

Meh

MfA

trinibwoy

Meh

swaaye

Entirely Suboptimal

Davros

MfA

neliz

GIGABYTE Man

Jawed

fellix

mboeller

neliz

GIGABYTE Man

fellix

Ailuros

Epsilon plus three

DegustatoR

rpg.314

neliz

GIGABYTE Man

rpg.314

Similar threads