AMD/ATI Evergreen: Architecture Discussion

Dave Baumann · Jan 21, 2010

Mintmaster said:
Okay, let me be a bit more clear. Are the SC's blocks fed tiles and edges to test against or just raw (duplicated) triangles?

Mint, take a look at the patent diagram that Jawed singled out on the previous page.

The SC's are deteriming the tiles; they are both being fed triangles from the the Primitive Assembler; those triangles are duplicated from the PA, but, at any point in time, will the SC's be working on different triangles? Yes, because they are there to ensure that there is sufficient workloads to be fed into both shader engines and have them well balanced and busy (and I've yet to see much reason to suggest that Fermi isn't operating on a similar system / structure in terms of tiling the pixel load over the 4 shader clusters).

Dave Baumann · Jan 21, 2010

3dilettante said:
The diagrams for Cypress make me wonder if there is a pathological case where a screen can be filled with tiny triangles that fall on every other screen tile. If the rasterizers serve alternating tiles, and each rasterizer feeds a separate dispatch processor that controls one bank of SIMDs, it could be possible to cut shader throughput in half.

Alternate is fine. If you are having a pathological case where it it stalled spitting out triangles in a single tile then most arch's these days will have a problem because the ROPS/memory access are tiled so they would be stalled on a single ROP output.

Of course, there is buffering and stuff all through the pipeline to minimise this.

Mintmaster · Jan 22, 2010

Dave Baumann said:
Mint, take a look at the patent diagram that Jawed singled out on the previous page.

The SC's are deteriming the tiles; they are both being fed triangles from the the Primitive Assembler; those triangles are duplicated from the PA, but, at any point in time, will the SC's be working on different triangles? Yes, because they are there to ensure that there is sufficient workloads to be fed into both shader engines and have them well balanced and busy

I still say it's one rasterizer that's split up in the layout. The only difference is buffering at the triangle stage instead of the tile or quad stage.

Look at it this way: If the primitive assembler was doubled in speed, could the rasterizer output two triangles per clock? No, because you have to duplicate the triangles. It needs big changes to handle two triangles per clock.

(and I've yet to see much reason to suggest that Fermi isn't operating on a similar system / structure in terms of tiling the pixel load over the 4 shader clusters).

It is true that NVidia has fixed the tiles to the ROPs since G80 (ATI started this with RV770, right?), but it is not clear if the pixels worked on in the shader are from the same tile, although it's probably the simplest way to maintain ordering for alpha blending.

Fermi does something very different with its four rasterizers than Cypress does with its two. It probably figures out which tile(s) a triangle belongs to in the PolyMorph engines (i.e. preliminary scan conversion) and then add it to the appropriate GPC(s). It then needs to make sure that each GPC gets the cached data for that tri before flushing it out.

Either that, or it just scan converts each triangle four times (once in each GPC) at a rate of four per clock, hence 16 geometry units.

Mintmaster · Jan 22, 2010

Dave Baumann said:
Alternate is fine. If you are having a pathological case where it it stalled spitting out triangles in a single tile then most arch's these days will have a problem because the ROPS/memory access are tiled so they would be stalled on a single ROP output.

I think he's thinking about a case that's shader limited, so halved ROP throughput is a non-issue. Imagine 20 or more vec5 instructions per pixel. If pixels from odd tiles are allowed on each 10-SIMD unit, then you could reach 16 pix/clk, despite being restricted to one ROP for output. If the tiles are assigned to only one 10-SIMD unit, then you'd only hit 8 pix/clk, which is half the single ROP's ability.

Of course, this is a pathological case, and I expect that a little buffering will keep the engines over 95% used if setup is not the limiting factor.

Dave Baumann · Jan 22, 2010

Mintmaster said:
I still say it's one rasterizer that's split up in the layout. The only difference is buffering at the triangle stage instead of the tile or quad stage.

And Mfa's earlier reply gets repeated there. Semantics.

We've claimed that we have two raster units, because we do. The point of those raster units it to ensure that both engines are kept occupied and we scale in performance for doubling the engine in relation to the previous gen.That means that at any point in time they will be operating on different triangles. Except....

Mintmaster said:
Look at it this way: If the primitive assembler was doubled in speed, could the rasterizer output two triangles per clock? No, because you have to duplicate the triangles. It needs big changes to handle two triangles per clock.

The PA can determine some level of coverage and can decide to send only to one of the SC's, though its programmable to just always send to both (or not).

Dave Baumann · Jan 22, 2010

Mintmaster said:
I think he's thinking about a case that's shader limited, so halved ROP throughput is a non-issue. Imagine 20 or more vec5 instructions per pixel. If pixels from odd tiles are allowed on each 10-SIMD unit, then you could reach 16 pix/clk, despite being restricted to one ROP for output. If the tiles are assigned to only one 10-SIMD unit, then you'd only hit 8 pix/clk, which is half the single ROP's ability.

The tile granulatity at the SIMD level is going to be the same size as the tile granulatity per memory channel (though more tiles are allocated per SIMD), so you are talking about a single quad of ROP output in that instance.

3dcgi · Jan 23, 2010

Mintmaster said:
Fermi does something very different with its four rasterizers than Cypress does with its two. It probably figures out which tile(s) a triangle belongs to in the PolyMorph engines (i.e. preliminary scan conversion) and then add it to the appropriate GPC(s). It then needs to make sure that each GPC gets the cached data for that tri before flushing it out.

Either that, or it just scan converts each triangle four times (once in each GPC) at a rate of four per clock, hence 16 geometry units.

I don't know what method Cypress is using, but with 2 rasterizers (which I'm defining as the part that determines pixel coverage) you only need to determine if a primitive is small enough to fit entirely in one tile. If the primitive fits in one tile you send it to only one rasterizer. If you don't agree with my definition of a rasterizer then I agree with others that it's just semantics.

Jawed · Jan 23, 2010

In ATI, hierarchical rasterisation consists of a first step in the setup engine that determines screen-tile presence for each triangle, i.e. coarse rasterisation. Intra screen-tile rasterisation consists of some more hierarchical steps down to quad level.

Assuming screen-tiles are two-colour checkerboarded (each colour determining one of the two 10-way SIMD blocks), triangles that straddle screen-tiles will always be sent to both 10-way halves.

Because triangles don't always straddle multiple screen tiles, the two independent rasterisers will not always run for the same triangle concurrently. Instead there needs to be a pre-rasterisation queue attached to each of the two rasterisers, consisting of tile IDs and basic triangle data.

The final variable in all this is that ATI has the option to vary the size of a screen-tile. Although I've never seen any evidence of this in practice. In theory this provides adaptivity to the size of triangles, e.g. if tessellation is active.

So, in ATI setup needs to run fast enough, with small-enough screen-tiles, so that both rasterisers are continually busy. Screen-tiles can't be made too small, because the texturing and RBE systems rely upon multiple quads of 2D data coherency.

No way is this semantics.

Jawed

iAndrei · Jan 23, 2010

What means in architecture RV870 20 SIMD engines (MPMD)? 2 kernels OpenCL are simultaneously carried out?

MfA · Jan 23, 2010

Jawed said:
No way is this semantics.

Well, it depends on whether you see the setup as part of the rasterizer or not (classically yes, in modern GPUs mostly no although the terminology is fluid).

Jawed · Jan 23, 2010

iAndrei said:
What means in architecture RV870 20 SIMD engines (MPMD)? 2 kernels OpenCL are simultaneously carried out?

I suspect the only time two kernels will be running is when kernel A is finishing its final work groups and there are execution slots available for kernel B to start its first work groups.

It's not clear if the behaviour depends on A and B being "the same code" or if B can be different code. e.g. I don't know if code A and code B could both execute on SIMD 0. I suspect not, because of register allocation. It seems that in Compute Shader mode registers are allocated from a single pool.

Whereas in Pixel Shader mode (i.e. when VS, HS, DS, GS and PS can all be loaded into the GPU) the register file is partitioned into multiple pools to take account of the competition for registers by the different types of kernels. There is no documentation I'm aware of that describes the flexibility of register allocation.

So, overall I guess that OpenCL can only execute multiple kernels as an optimisation for latency, i.e. to minimise the time that ALUs are idle between kernel launches (as kernel A finishes and B starts). I doubt it is able to support a producer kernel and a consumer kernel that both run for most of the time as a pair.

Jawed

Dave Baumann · Jan 23, 2010

iAndrei said:
What means in architecture RV870 20 SIMD engines (MPMD)? 2 kernels OpenCL are simultaneously carried out?

The architecture operates differently for Compute to graphics.

nAo · Jan 23, 2010

Running 2 OpenCL or CS kernels at the same time requires to share the LDS and I guess you really don't want to spill 32 Kbytes of memory to GDDR (or to the L2?) and read it back that often.

Jawed · Jan 23, 2010

nAo said:
Running 2 OpenCL or CS kernels at the same time requires to share the LDS and I guess you really don't want to spill 32 Kbytes of memory to GDDR (or to the L2?) and read it back that often.

Each kernel might be using less than 32KB of LDS per work group. It's the work group's allocation of local memory multiplied by the count of work-groups occupying a SIMD that determines the actual LDS usage.

On ATI 8 work groups can share a SIMD, regardless of the local memory allocation.

But generally this is a problem - one that hadn't occurred to me and yet another reason just to use cache with temporal hints and evicts.

Jawed

iAndrei · Jan 23, 2010

Two kernels will be executed consistently even if each kernel consists of 1 group with 256 streams?

nAo · Jan 23, 2010

Jawed said:
Each kernel might be using less than 32KB of LDS per work group. It's the work group's allocation of local memory multiplied by the count of work-groups occupying a SIMD that determines the actual LDS usage.

Only if LDS can be accessed as a register (via a base offset), haven't checked the new ISA document yet. Moreover many non trivial kernels will take the whole LDS just to perform dynamic allocations on it that might actually use only a fraction of the LDS.

Jawed · Jan 24, 2010

iAndrei said:
Two kernels will be executed consistently even if each kernel consists of 1 group with 256 streams?

I don't know. Requires experimentation. I don't know enough about the OpenCL asynchronous execution model to be sure. Also AMD's OpenCL is still very much beta quality. You might have more luck with NVidia's Fermi based cards, if you're interested in this kind of programming.

Jawed

Jawed · Jan 24, 2010

nAo said:
Only if LDS can be accessed as a register (via a base offset), haven't checked the new ISA document yet.

IL provides DWord addressing in LDS (dcl_lds_id

size in bytes), private to the work group - i.e. precisely the OpenCL model of local memory. n allows multiple LDS variables. Though OpenCL supports byte-sized local memory variables - not sure if ATI packs these.

In IL there is also support for structures: dcl_struct_lds_id

byte size of structure (multiple of 4), count of structures. I think (presume) this is also work group private, not sure.

Moreover many non trivial kernels will take the whole LDS just to perform dynamic allocations on it that might actually use only a fraction of the LDS.

Do I sense recurring Cell SPE LS nightmares?...

Jawed

AlexV · Jan 24, 2010

OK gentlemen, I took the liberty of splitting the interesting technical bits of discussion that had shown up in the R8xx speculation thread (if I missed any, please let me know), and spawned this here nice thread dedicated to pure geekiness. So, the old thread remains for...err...speculating about SKUs, clocks et al, and here we can nicely go on hounding Dave, OpenGL Guy and the others about their tiles. Thanks!

Jawed · Feb 22, 2010

ATI Cypress Gaming Performance Analysis

Saw this over at xtremesystems:

http://www.beyond3d.com/content/reviews/54/1

Heavy tessellation isn't cheap, and explaining it solely by "you're setup limited, silly" is a heavy over-simplification

Jawed

AMD/ATI Evergreen: Architecture Discussion

Dave Baumann

Gamerscore Wh...

Dave Baumann

Gamerscore Wh...

Mintmaster

Mintmaster

Dave Baumann

Gamerscore Wh...

Dave Baumann

Gamerscore Wh...

3dcgi

Jawed

iAndrei

MfA

Jawed

Dave Baumann

Gamerscore Wh...

nAo

Nutella Nutellae

Jawed

iAndrei

nAo

Nutella Nutellae

Jawed

Jawed

AlexV

Heteroscedasticitate

Jawed

Similar threads