AMD: R8xx Speculation

rpg.314 · Sep 18, 2009

My point is power is the real limit on chips of today. Not keeping power budgets in check is a recipe for power bloat. Idle power has been reduced quite a bit this gen, but we all know what happened to die sizes of gpu's as both ihv's abandoned all sense and got involved in a die size war.

Sure, 5850 costs 1.5x 4850 at launch, but 5770 still costs $200. And that is with the dx11 premium. Already, we are seeing ati going past the sweet spot. If GT300's mainstream derivatives kick ass, we'll see significant pressure put on ati's mainstream but they'll have larger thermals and die sizes to deal with.

psolord · Sep 18, 2009

XMAN26 said:
What card did you go with is what I gotta ask. I've got BFG and eVGA cards and both blow the air of the case totally, none goes into the case.

I thought all GT200/b based cards had some holes all around, but mine does not have a reference cooler. It is a Palit GTX 260 which was a bargain at the time and i had to have it, especially since it was not a permanent solution!

It's nice to hear Nvidia's reference design also blows the hot air out of the case. I hope they do the same with the GT300!

Jawed · Sep 18, 2009

OpenGL guy said:
You can almost always do early rejection, but not always early acceptance. Thus, HiZ with late Z makes sense. Think about alpha test or tex kill.

So hierarchical-Z is usually never off (for normal rendering when Z is active), it's always being set and always used for, at least, conservative early rejection? I'm inferring that sometimes an RBE Z test is never done (i.e. when early acceptance occurs).

Well, I definitely didn't understand all these wrinkles. I can't find any comprehensive, up-to-date, documentation on this subject. e.g. what is Re-Z and what's the difference between hierarchical-Z and early-Z?:

R6xx_3D_Registers said:
EXEC_ON_HIER_FAIL Will execute the shader even if Hierarchical Z or Stencil would kill the quad. Enable if the pixel shader has a desired side effect not covered by the above flags for Z or Stencil failed pixels. EarlyZ and ReZ kills will still stop the shader from running.

What is it about short shaders that makes Late Z preferable?

Jawed

psolord · Sep 18, 2009

I just noticed at Def's pics, that there is going to be 5870X2s with 2GB and 4GB configurations. This is kinda sad, because it essentially blows any memory sharing hopes i may had, away. I mean if the 5870X2 used a shared memory pool, why on earth would they need to bring a 4GB model out unless it was 2X2GB, right?

trinibwoy · Sep 18, 2009

Well the architectural diagrams didn't hint at any sort of shared memory hierarchy either so it's probably reasonable to expect dedicated framebuffers as usual.

Jawed · Sep 18, 2009

MfA said:
Early/Late-Z is the normal per pixel Z-check. Hierarchical Z is the conservative per tile Z-check, you can do this and still do a late per pixel Z check.

So Early-Z is querying RBE-Z before the shader starts processing the pixel? Yuck, long long long latency.

Since the hierarchical Z-check never goes to memory (ie. has negligible latency relatively speaking) it would be kind of weird to disable it, the extra storage required to buffer the tris while it's being done doesn't really seem relevant in the big scheme of things.

Agreed, I'm struggling to discern when hierarchical-Z would be best turned off, unless the current render state has no Z at all.

PS. I guess somewhere in there there has to be a scoreboard too to block early Z shading of quads still owned by a running late Z shader, this is probably done inside the hierarchical Z unit at a tile resolution.

The two kinds of shader shouldn't be able to overlap in their execution because a state change, implying pipeline flush, is required to switch between these two modes.

Jawed

Jawed · Sep 18, 2009

rpg.314 said:
Good point. I think the rasterizer (sw or hw) would be the best place to sort fragments into bins. Stream out the fragments being generated by the rasterizer into global memory and then sort them. Or may be sort them on chip. At least the fragments generated in a single draw call should have large amount of spatial locality.

Sorting fragments is super-costly - and that's why there is a Z buffer, to let the pixels sort themselves at back-end time, with the associated downside of overdraw, mitigated by hierarchical-Z etc.

Jawed

Jawed · Sep 18, 2009

CarstenS said:
It has 10*16 FP32-Samplers, but only 10*4 Adress-Units and Filters.
AMD gives a figure of 120GTex/sec. 32 bit Texture fetch rate - no mentioning, if this is done for FP or INT-Textures.

I guess the only solution is to ask AMD and Futuremark.

Additionally, I see strong indication that apart from Texture fetch, other operations seem limited to 32-Bit INT full-speed.

You need to link these things I guess.

Jawed

mczak · Sep 18, 2009

Davros said:
And here comes the kicker - when AMD introduces ATI Radeon HD 5750 and Radeon HD 5770, all of Radeon HD 4800 series are getting "EOL" [End Of Life] marking, since they will be pitched at same price points [149 bracket has HD4870 and HD5750, 199 bracket has HD4890 and HD5770]. Bear in mind that some e-tailers in the USA might offer a MIR [Mail-In-Rebate], ranging anywhere between 5-15 USD."

Hmm so HD5770 should be similar in performance to HD4890? Now that would be quite good.
If you say EOL for HD 4800 I don't see any replacement for HD4850 however. Or maybe the HD 4770 finally takes over? It's slower though due to the too much underclocked ram and currently costs just as much.

neliz · Sep 18, 2009

mczak said:
Hmm so HD5770 should be similar in performance to HD4890? Now that would be quite good.
If you say EOL for HD 4800 I don't see any replacement for HD4850 however. Or maybe the HD 4770 finally takes over? It's slower though due to the too much underclocked ram and currently costs just as much.

5750?

Jawed · Sep 18, 2009

3dilettante said:
Some slides by Tom Forsyth gave a figures of about 10-20% of the work being in the front-end, depending on what is being done.

I can't find what you're referring to.

All I can find is figure 13 in Seiler et al. where "Pre-Vertex" is ~1-2% of workload.

This amount of work can remain relatively unchanged if the bin setup occurs on one chip and the other chip or chips wait for work to be farmed out to them.
Creating a producer/consumer relationship between the chips will have some additive effect, as cores on the other chips will have to wait until various bins are set up by the main core and then stored off to their local memory pool.

Seiler refers to the processing of a PrimSet as occurring on a single core, which effectively makes it a serialisation point. In general Larrabee will be working on multiple PrimSets in parallel (i.e. it can overlap processing of draw calls, so long as per-tile primitive ordering is respected).

Since this work is relatively undemanding you might argue that it simply isn't worth spreading around chips, but it doesn't sound intrinsically single-chip.

A bigger concern that occurred to me is that I'm not sure if the bins themselves are a worse bandwidth burden than just sending the command and vertex streams to both chips.

I think you might have a point here - dunno.

If the front-end is at the lower 10% range, it may be more compact as at that range the front end only does triangle placement and forestalls other vertex work and tesselation.

In theory triangle placement is just Draw Call ID (probably PrimSet ID), Triangle ID and Bin ID.

If at the higher range, it means other vertex work and tesselation was done in the front-end, and then with sufficient amplification the bins may pose a bandwidth bottleneck over the interconnect.
I'm pretty sure the interconnect can handle the raw command stream, as the worst this can be is the maximum the PCI-E bus can carry.
A large number of tesselation-amplified bins can potentially be a problem.
Even without tesselation, I am curious if a bin is more compact than the commands that spawned it.

Tessellation is my biggest question mark with Larrabee, to the extent that a just-in-time approach is used for the creation of bins. i.e. create bins with non-tessellated triangles (patches) and then when rasterisation/shading/back-end starts, the bin is tessellated. This inevitably causes leakage across tile boundaries though (new vertices can't be constrained by a tile), which makes it seem unworkable. Dunno.

I'm sceptical Intel plans to brute-force tessellation...

Jawed

rpg.314 · Sep 18, 2009

Jawed said:
Tessellation is my biggest question mark with Larrabee, to the extent that a just-in-time approach is used for the creation of bins. i.e. create bins with non-tessellated triangles (patches) and then when rasterisation/shading/back-end starts, the bin is tessellated. This inevitably causes leakage across tile boundaries though (new vertices can't be constrained by a tile), which makes it seem unworkable. Dunno.

I'm sceptical Intel plans to brute-force tessellation...

Jawed

I am sure Intel will do bin creation after the vertices have been baked by the domain shader. That way, there will be no crossing.

Jawed · Sep 18, 2009

MarkoIt said:
I have read your concerns about the next iteration of Ati's current architecture, due to the limits of bandwidth availability with GDDR5. But what if Ati replaces GDDR5 with faster types of memory like XDR 2?
http://www.rambus.com/us/products/xdr2/xdr2_vs_gddr5.html
http://www.rambus.com/us/products/tbi/index.html
Wouldn't resolve,partially, the problem?

That was part of my point, that GDDR5 isn't enough, something like Rambus might be needed (or AMD can fit a 512-bit bus into the space of a conventional 256-bit interface). On the other hand, smarter processing should work wonders.

These leaked slides are pretty nebulous though, because they don't compare with HD4890. We'll find out soon enough...

Jawed

Creig · Sep 18, 2009

XMAN26 said:
Dual slot coolers mean single slot ones were not good enough to keep them cool.

Not necessarily. It could simply mean ATi is placing more emphasis on keeping temps and sound levels down. Besides, how many add-in cards do most people have in their PC's these days besides a video card? In years past, everything was on its own discrete card (sound, network, USB, Raid, etc). Back then, it was possible to fill every available slot on your motherboard with a card of some type. So a dual-slot cooler on your video card could actually prevent you from having all the features you wanted on your computer and thus was not desirable. These days, all those features can be found on your motherboard, so having your video card take up two slots rather than one isn't really an issue for the vast majority of people.

Jawed · Sep 18, 2009

elsence said:
I mean:

AMD has difficulties engineering the GPU end and scale the GDDR5 speed beyond 5Gbps regarding only the temperature that AMD wants to maintain for the GPU core.

The old designs had 90°C and maybe AMD decided that it must do something to lower the temperature level.

It is very important for the future progression to design with such policies.

My interpretation is that temperature variations lead to changes in electrical characteristics. Since GDDR5 is generally sensitive to electrical characteristics (hence the per-lane training and other measures) it makes sense that R800's interface uses GDDR5's capabilities in more advanced ways than in R700.

Like i said, i may misunderstood the AMD slide due to my lack of technical background.

I agree about the possibility that the 58XX is bandwidth limited (i posted about it one week before the AMD event and i was dead on that 5870 will use 5Gbps modules.

At only 4.8Gbps.

Like i said back then for me the main reasons that this is happening is because price level and volume (quantity) of the GDDR5 ICs that AMD is targeting for the 5870 SKU.

Unfortunately we can't separate-out the cost versus technical-feasibility of >5GHz GDDR5.

For me 5870 needs around 1,5GHz to show its true colors. I don't mean that with 1,5GHz will reach the level of 4870 performance/bandwidth efficiency, but that after 1,5GHz the performance improvements will be meaningless from a performance / memory scaling ratio perspective.(for the vast majority of the games at 4X AA)

Seems reasonable, with the proviso that RV740 doesn't scale with bandwidth as I was expecting. Not sure what's going on there (e.g. whether 512MB is invalidating the analysis - seems likely).

Afterall 4870 had +20% clock speed and +80% memory bandwidth than 4850 and only +33% perf. (weighted average at 1920X1200 4XX AA 16X AF)
and the memory controller in the 5870 will be a new one, possible better than 4870's...)

I'd like to see a comparison of 1GB HD4850 and 1GB HD4870 to be sure about that, now...

Jawed

Jawed · Sep 18, 2009

elsence said:
But the logic, that 5870 doesn't have a bandwidth limiting problem because typically increased it's lead the higher the resolution/AA/AF settings were turned, is wrong imo.

Yeah, NVidia's architecture is pretty poor in this respect and a useless baseline for ATI comparisons.

Jawed

Jawed · Sep 18, 2009

rpg.314 said:
Wouldn't the deferred shading, lighting features of dx10.1/11 help here?

They aren't features of D3D, but yes, any engine which attempts to be "more efficient" is going to help. Still doesn't solve the problem of existing games being really slow. And STALKER uses deferred shading quite heavily

I don't know where the bottlnecks are in that game, though.

Jawed

DeF · Sep 18, 2009

Creig said:
Not necessarily. It could simply mean ATi is placing more emphasis on keeping temps and sound levels down. Besides, how many add-in cards do most people have in their PC's these days besides a video card? In years past, everything was on its own discrete card (sound, network, USB, Raid, etc). Back then, it was possible to fill every available slot on your motherboard with a card of some type. So a dual-slot cooler on your video card could actually prevent you from having all the features you wanted on your computer and thus was not desirable. These days, all those features can be found on your motherboard, so having your video card take up two slots rather than one isn't really an issue for the vast majority of people.

Well big companies always work for profit and dual slot cooler means bigger BOM. This leads to assumption that dual slot cooling was needed if it found its way into reference design. But of course everythings possible.

Anyway i prefer dual slot cooling as well. And as you said everything's on your mobo so you dont need many expansion slots anyway. I got Asus M4A78-E mobo (weird slot layout - only bottom pcie-x is x16) and pci sound card and this causes my gpu to heat up a bit more but its not that bad.

mczak · Sep 18, 2009

neliz said:
5750?

Well according to Davros 5750 costs ~150, same as 4870. 4850 is only ~100. That's 50% more - sure it will be faster but it doesn't sound like a simple replacement to me (of course, launch price of 4850 was higher too but it means the 4850 could still be a viable option and AMD doesn't have any replacement part at 100 yet.).

3dilettante · Sep 18, 2009

Jawed said:
I can't find what you're referring to.

Tom Forsyth had a presentation for SIGGRAPH2008.

All I can find is figure 13 in Seiler et al. where "Pre-Vertex" is ~1-2% of workload.

Wouldn't allocating triangles to a bin require the rasterization portion of the workload as well?
Forsyth's slides apparently included this in the front-end estimate.

Seiler refers to the processing of a PrimSet as occurring on a single core, which effectively makes it a serialisation point. In general Larrabee will be working on multiple PrimSets in parallel (i.e. it can overlap processing of draw calls, so long as per-tile primitive ordering is respected).

Since this work is relatively undemanding you might argue that it simply isn't worth spreading around chips, but it doesn't sound intrinsically single-chip.

The actual cost I see is the creation of a bin and then having any core pick up a bin for processing. Both would be more expensive to do.

Forsyth's slides also indicated that a bin contains tris, shaded verts, and rasterized fragments.
I'm not sure if the fragments would be a concern for the distribution phase that might be passing over the interconnect.

Since the memory subsystem should maintain a coherent image of memory across the chips, there is no algorithmic reason why it would be single-chip.
The costs of this work have been evaluated as being sufficiently low only for a single-chip scenario, however.

Other multi-chip rendering methods often opt for duplicating setup work. This is the case for GPUs, and also for a number of distributed rendering schemes for CPUs, though in those case there is often a trip over a network interconnect that raises costs even further.

Tessellation is my biggest question mark with Larrabee, to the extent that a just-in-time approach is used for the creation of bins. i.e. create bins with non-tessellated triangles (patches) and then when rasterisation/shading/back-end starts, the bin is tessellated. This inevitably causes leakage across tile boundaries though (new vertices can't be constrained by a tile), which makes it seem unworkable. Dunno.

The flexibility of the software pipeline is the reason for Forsyth's estimate for front-end work being so wide.
It's 10% if deferring attribute, vertex, and tesselation work to the back end. It's variable because those three can be done in either front or back end.
Bin size would be the most amenable for sending to another chip if this work is deferred, but back-end burden and bin spread would be worse.

If done in the front-end, bin size becomes much larger and more costly to send to a remote pool of cores, though the bins themselves would be much better-behaved.

If the front-end is duplicated, we about double the computation required for PrimSet dispersal and front-end work, but with minimal increase in synchronization or bandwidth burden on the interface. The developer would be much more free to decide on where to put work between the front and back ends.

The PrimSet distribution by one core is actually well-suited to the likely ring-bus configuration Larrabee will use. If it is anything like the polarity-shifting method used by Beckton, only a small subset of cores will be available for the control core to send updates to in a given cycle. This is fine as the control core can only serially send out updates a for a handful of cores anyway.
Given the speed and bandwidth of the on-chip bus, the costs for this are probably safe to accept.
I'd be curious to see how this works for dispersing updates to an ever-increasing number of cores and then over a chip-to-chip link, which is both more constrained and higher latency than the ring-bus.

It's also the case that if a bin is set up and ready for back-end processing, a scheme that is not aware of multi-chip NUMA is going to have much more traffic over the interconnect--something that will not happen if the setup scheme has duplicate front-ends that specifically minimize inter-chip rendering traffic.

I'm sceptical Intel plans to brute-force tessellation...

Tesselation is about creating more triangles. At some level, amplifying the number of triangles and then turning them into a bandwidth+latency cost is a liability any scheme that apportions work heedless of chip location will take on.

It would be functional so long as Intel keeps inter-chip coherence, but Larrabee's bandwidth savings would be mitigated if the chip link is saturated, even if in absolute GB/s consumption is lower.

I guess in theory Intel could massively overspecify the inter-chip connections, but that sounds expensive.

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

rpg.314

psolord

Jawed

psolord

trinibwoy

Meh

Jawed

Jawed

Jawed

mczak

neliz

GIGABYTE Man

Jawed

rpg.314

Jawed

Creig

Jawed

Jawed

Jawed

DeF

mczak

3dilettante

Similar threads