AMD: R8xx Speculation

Arty · Oct 2, 2009

AnarchX said:
A 14 SIMDs and 128-Bit RV700 with Cypress' transitor density would be around 160mm².
(Cypress: 6.45 million transitors per mm², RV740 826 million transitors, a RV770 SIMD = 10mm²@55nm, RV770 to Cypress density factor 1.75)

So they put in the left 25mm² D3D11, another 64-Bit-MC/8ROPs and additional caches?

8ROPs? And you think its 192b?

ShaidarHaran · Oct 2, 2009

Arty said:
8ROPs? And you think its 192b?

I took this to mean 8 more ROPs than RV740, 24 vs. 16.

fellix · Oct 2, 2009

Juniper would have ~400KB of additional SRAM arrays versus a hypothetical 14 SIMD "RV740" chip.

elsence · Oct 2, 2009

AnarchX said:
A 14 SIMDs and 128-Bit RV700 with Cypress' transitor density would be around 160mm².
(Cypress: 6.45 million transitors per mm², RV740 826 million transitors, a RV770 SIMD = 10mm²@55nm, RV770 to Cypress density factor 1.75)

So they put in the left 25mm² D3D11, another 64-Bit-MC/8ROPs and additional caches?

the DX10 4770 is 137mm2 at 40nm.
So with 48mm2 we get:

1. DX11 compliance
2. 6 additional SIMDs (24TUs/480SPs)
3. 64bit additional width in the memory controller
4. 8 additional ROPs

All these, with 48mm2?

I mean if a "14 SIMDs and 128-Bit RV700 with Cypress' transitor density would be around 160mm²" like you say, this means:
4770 has 8 SIMD
14 SIMD-8 SIMD = 6 SIMD = 160mm2-137mm2=23mm2
So each SIMD = 3,83mm2.
So all the SIMD units in the 4770 will be 30mm2?

AlexV · Oct 2, 2009

elsence said:
the DX10 4770 is 137mm2 at 40nm.
So with 48mm2 we get:

1. DX11 compliance
2. 6 additional SIMDs (24TUs/480SPs)
3. 64bit additional width in the memory controller
4. 8 additional ROPs

All these, with 48mm2?

Not quite.

Kaotik · Oct 2, 2009

elsence said:
the DX10 4770 is 137mm2 at 40nm.
So with 48mm2 we get:

1. DX11 compliance
2. 6 additional SIMDs (24TUs/480SPs)
3. 64bit additional width in the memory controller
4. 8 additional ROPs

All these, with 48mm2?

You're still assuming 192bit and 24 ROPs while every single leak and whatnot has indicated it's 128bit (the initial Juniper shots with the heavier cooler had 4 memory chips on the backside, so unless you suggest there was 2 or 6 memory chips on the frontside there's no way it was 192bit

elsence · Oct 2, 2009

AlexV said:
Not quite.

What do you mean?
I was just trying to point out that the data doesn't add up.

elsence · Oct 2, 2009

Kaotik said:
You're still assuming 192bit and 24 ROPs while every single leak and whatnot has indicated it's 128bit (the initial Juniper shots with the heavier cooler had 4 memory chips on the backside, so unless you suggest there was 2 or 6 memory chips on the frontside there's no way it was 192bit

No, i was just trying to say that the data doesn't add up.
I already posted before a week that a 24ROPs and 192bit design doesn't make sense for me.

Kaotik · Oct 2, 2009

elsence said:
No, i was just trying to say that the data doesn't add up.
I already posted before a week that a 24ROPs and 192bit design doesn't make sense for me.

Then why did you post those as "what we get for 48mm^2", while we're just getting DX11 and 4 SIMDs (+TUs)

ninelven · Oct 2, 2009

Albuquerque said:
^^ dunno, maybe it's the resolution then? I play at 1680x1050, although I do use AA wherever feasible. I did a considerable amount of benchmarking when I built the rig, and GPU clocks always won out in my config.

That makes sense. I wasn't trying to say that core speed wasn't important for the 4850 (I own one as well

), just that bandwidth is much more important for it than the 4870. If Juniper is 1120/56 and clocked @ 850MHz, that is almost 2x the ALU and TEX resources compared to the 4850 with very little bandwidth increase. As 1920x1080 monitors have moved into the mainstream, I'd say it is a legitimate concern. Sure, it may still perform well, but that doesn't mean it is not unbalanced and not being held back by a lack of bandwidth.

Benchmarks and overclocking results @ 1920x1080 will be very interesting to see.

elsence · Oct 2, 2009

Kaotik said:
Then why did you post those as "what we get for 48mm^2", while we're just getting DX11 and 4 SIMDs (+TUs)

AnarchX said that:

"So they put in the left 25mm² D3D11, another 64-Bit-MC/8ROPs and additional caches?"

So i was trying to point out to him, that according to my perception, all these specs it is not possible to be implemented with only 48mm2.

That's why i asked: "All these, with 48mm2?"

Also see my edit, where i explain what i mean.

elsence · Oct 2, 2009

The 5750 in the test is definitely 10 SIMD design (it is easy to figure out if you check the 3dMark Vantage tests and correlate with 4770 & 4850 scores in 3dMark Vantage tests from other sources)

It is very strange a 50 part to be cut down in such a degree. (4 SIMD)
It doesn't make financial sense.

I know that the article and Anandtech say 14SIMDs for 5770 so probably it is , but for me it doesn't make sense this design.

If the ROPs is going to be 24 the design will need 192bit memory controller, otherwise it will be very bandwidth limited.

If the design has 16ROPs, 14SIMD will be a waste of transistor space with the games that are going to launch in Q4 2009 and in Q1 2010 imo.

For example in relation with a 8 SIMD design:

The TU/ROP and SP/ROP ratio will be +75%
But the design probably will be able to extract only +50% perf (or less in some games)

For me with the Q4 2009 and Q1 2010 games a 16ROPs/12 SIMD design (regarding ratio) is enough.

Groo The Wanderer · Oct 2, 2009

elsence said:
It is very strange a 50 part to be cut down in such a degree. (4 SIMD)
It doesn't make financial sense.

Almost like someone is going through Kontortions to get page hits. If I had to make a guess, I would say that the specs that the review posted are completely bogus, and the person is flat out lying. Then again, what do I know?

-Charlie

Jawed · Oct 4, 2009

elsence said:
The problem with 5870 is that the performance improvement in relation with a 4890 is not consistent. (it has much higher variations in perf. than what 2X specs would normally have, i know about the bandwidth...)

I don't know why is that, but i guess it is either the geometry setup engine (Geometry/Vertex assempler has same performance with 4890's) or something about the Geometry shader performance?

I could only find 3DMark Vantage tests, if you check:

http://www.pcper.com/article.php?aid=783&type=expert&pid=12

GPU Cloth: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test)
GPU Particles: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test)

Perlin Noise: 5870 is 2,5X faster than 4890. (Math-heavy Pixel Shader test)
Parallax Occlusion Mapping: 5870 is 2,1X faster than 4890. (Complex Pixel Shader test)

That's why Vantage is such a waste of time. Tests whose meaning is obscured.

Jawed

Jawed · Oct 4, 2009

trinibwoy said:
You can twist my words however you like but the fact is I'm just asking wth they're referring to in those diagrams. If it's just an increase in scan conversion throughput it's stupid.

But if it isn't then it could be useful. Which is why it's interesting.

Cause that's what everyone interpreted those diagrams to mean and that's what everyone keeps talking about - the need for faster triangle setup. What do you think they mean, and why hasn't it been further fleshed out by AMD or anyone else?

Initially I thought it might indicate that setup rate had doubled - and I even posted a theory that it might have something to do with a revised architecture for multi-chip rendering - i.e. the architecture is designed to scale across multiple shared setup units. But having discovered there is one setup unit (which spits out a list of tiles that need to be rasterised, for each input triangle) the question remains, how are the dual rasterisers working?

Also, how are the cores given work? Are 10 cores dedicated to each rasteriser?

Sure if you assume the author of the test is retarded

Actually, the problem is entirely with people who read that and assume it has anything to do with rasterisation rate. It's a triangle throughput test, not a rasteriser configuration/throughput analyser.

Jawed

elsence · Oct 4, 2009

Groo The Wanderer said:
Almost like someone is going through Kontortions to get page hits. If I had to make a guess, I would say that the specs that the review posted are completely bogus, and the person is flat out lying. Then again, what do I know?

-Charlie

Yes, the specs doesn't seem right.
Even the memory clock is strange.

The 1150MHz indicates 5Gbps ICs,
at this early stage of GDDR5, probably they can't find (with the volume that they need for <$150 market) 4,5Gbps certified ICs and be able overclock all of them to 1150MHz.

5Gbps ICs for <$150 part is seems kinda strange. (same as 5870 ICs)
But not impossible, we will see...

elsence · Oct 4, 2009

Jawed said:
That's why Vantage is such a waste of time. Tests whose meaning is obscured.

Jawed

Yes, it is difficult to extract conclusions based on the 3DMark Vantage tests, but it was all i could find.

I posted here my assumption that there are geometry setup engine (or GS) performance related problems before half month (at the day of the launch) but nobody replied me anything.

Heck, i even proposed at that time to a reviewer from another site to do testing with synthetic benchmarks (not Vantage) in order to see what's going on, but again i didn't get a reply.

I don't have the resources to buy and test a 5870, but many guys here can, i guess we will see in the future some tests that will show more clearly the performance behavior of 5870 and what are the potential problems.

trinibwoy · Oct 4, 2009

Jawed said:
But if it isn't then it could be useful. Which is why it's interesting.

Agreed, that's why we've been trying to find out whether it is.

But having discovered there is one setup unit (which spits out a list of tiles that need to be rasterised, for each input triangle) the question remains, how are the dual rasterisers working?

What about the theory proposed earlier about buffering triangles in the event triangle setup outpaces scan conversion? Not a likely scenario?

Actually, the problem is entirely with people who read that and assume it has anything to do with rasterisation rate. It's a triangle throughput test, not a rasteriser configuration/throughput analyser.

It may seem so now that we (think we) know setup rate hasn't increased. But in the absence of any info previously we turned to those tests for a clue.

Jawed · Oct 4, 2009

trinibwoy said:
What about the theory proposed earlier about buffering triangles in the event triangle setup outpaces scan conversion? Not a likely scenario?

Setup does a coarse rasterisation, identifying all the tiles that a triangle at least partially covers, then giving the rasteriser(s) a list of tiles and triangle data in order to rasterise. I suspect the rasteriser has a tile-centric view of rasterisation, not a triangle-centric view. That's because threads of 16 quads of fragments need to be despatched, and those need to be strictly tile-aligned (because the render target is tiled). Though I also expect it to handle triangles in strict order.

One of the key questions that's still unanswered is can a thread of fragments refer to more than one triangle (e.g. 5 adjacent small triangles from a strip)? While I've long thought that the answer's no on ATI, there'd be little reason for the hierarchical-Z unit to be able to resolve quads of pixels (as I think it does), as this wouldn't save any pixel shading effort - but it would save texturing effort and RBE bandwidth if the quads are rejected. On the other hand, when two adjacent triangles share an edge, that generates two fragments per pixel along that edge - something that doesn't map to a straightforward 2D translation from pixel locations into strands in a thread. That seems problematic to me (because tile ID and pixel position within a tile would no longer be enough for RBE to know which pixel a fragment is destined for).

Can ATI's 16-fragments per clock rasteriser rasterise 4 triangles in one clock, if each only occupies a quad of pixels? Seems unlikely to me, as the rasteriser prolly only works on 1 triangle's line equations per rasterisation-cycle.

One of the features of ATI's rasteriser is an optimisation for thin triangles. Rasterisation orientates itself to the horizontal or vertical depending on the alignment of the triangle, because the rasteriser wants to work on a minimum of 2 columns or 2 rows (since pixels need to be quad-aligned). This implies that rasterisation within a tile doesn't blindly run over all the pixels in the tile, merely that it processes the entire portion of a triangle that fits within the current tile before moving on. This might only amount to, say, 27 fragments in a tile of 64 pixels, for example, so would be 2 rasterisation clocks (as long as the triangle only occupies a maximum of either 4 rows or 4 columns).

So two rasterisers would have higher throughput than a single rasteriser at the same rasterisation rate, if there are any triangles that don't fully occupy an entire rasteriser's capability on each clock. That would be quite common, generally, as a triangle (or portion of a triangle in the current tile) that's small might result in only, say, 9 fragments produced by a 32-rasteriser (wasting 23 rasterisation ops), whereas two 16-rasterisers would only waste 7 rasterisation ops on this one triangle.

But I suspect adjacent small triangles can't go into a single thread of 64 fragments. A lot of the time small triangles will be adjacent (and their fragments would want to share a thread), and so no effective speed-up will be seen. Triangles that are larger (e.g. >32 and <64 pixels) are going to generate a speed-up, since the chances of adjacent triangles of this size falling within the same thread fall-off. But triangles of such a size don't match the expectation: "tessellation generates huge numbers of small triangles!!!"

You could say that a thread size of 64 and a limit of 1 triangle's fragments per thread (if true) are the key limitations here. So I'm dubious that dual rasterisers were made to increase performance, per se. I think it might be a matter of practicality in instancing a block of hardware rather than re-jigging things for 32-rasterisation. I don't think the number 32 is problematic (since other ATI GPUs have 4-, 8- and 12-rasterisers) merely that scaling isn't free of latency/pipelining issues across the entire width of the unit.

Jawed

MfA · Oct 4, 2009

trinibwoy said:
You can twist my words however you like but the fact is I'm just asking wth they're referring to in those diagrams. If it's just an increase in scan conversion throughput it's stupid.

Why is it stupid if it's true? Even if it's just an artifact of implementation which is indistuigishable from having one 32x rasterizer in practice, it's still nice to get implementation details. It's a pity setup rate wasn't doubled (I still think it would be easy given that the rasterizers deal with different tile sets) but meh ... being a bitch about getting implementation details just because they aren't relevant to performance I see as counter-productive, I'd still rather hear them than not.

It's not like it was a paper launch where misunderstandings could fester for months ...

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Arty

KEPLER

ShaidarHaran

hardware monkey

fellix

elsence

AlexV

Heteroscedasticitate

Kaotik

Drunk Member

elsence

elsence

Kaotik

Drunk Member

ninelven

PM

elsence

elsence

Groo The Wanderer

Jawed

Jawed

elsence

elsence

trinibwoy

Meh

Jawed

MfA

Similar threads