AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
psolord I watched your other evergreen video with the Hitler as Jensen. Was hoping to see mentions of him selling NVDA stock or lousy yields. Oh well.
 
psolord I watched your other evergreen video with the Hitler as Jensen. Was hoping to see mentions of him selling NVDA stock or lousy yields. Oh well.

I wasn't aware of those when i was molesting the subtitles!

I heard about the yields just today (CJ) and about the stock selling just yesterday!:oops:
 
IIRC, they also blew the envelope on one of the 3dmarks as well....

And really, how hard do you think it is to get a sticker?

In 3DMark06 1600x1200 4xAA/16xAF, which xbitlabs still finds the best program to test power consumption on, at leas HD4870X2 stays well under the TDP, at 264W

Sticker was a figure of speech, they wouldn't get validated by PCI SIG
 
Yes, CCC temperature screenshots were just leaked on my home forum. Shows GPU being downclocked to 157Mhz core and memory to 300Mhz in 2D. Wasn't expecting these shots though. Was waiting for someone to leak the Supersampling option in CCC. :p

Wow, 75°C at full load and 30% fan speed, that's quite an improvement from RV770/790.
 
I just thought that since the memory capacity is doubling every 2 years and the resolution is changing with a much slower tempo
(In the future we will see..., but i don't think bezeliousfinity lol will change anything for the vast majority of the market)
don't you think that it makes sense the game engines to use the extra memory capacity (that their customer base has) over each cycle per resolution target?
Oh definitely, I simply hadn't noticed that 512MB really is too little on a significant number of games at resolutions such as 1680.

I think that AMD made a very good decision with the 256bit controller (if you think what are the positives and what are the potential negatives of a 512bit controller for ATI, definetely a good business decision...)
I suspect this GPU is way larger than would have been preferred (compare with RV770 and RV670) as 40nm is expensive. Maybe AMD will have a way to squeeze a 512-bit MC (or equivalent-bandwidth alternative) into this size some time.

Maybe the theory goes that under D3D11 developers will be able to create renderers that don't soak up so much bandwidth, while improving visuals and being faster. But that's not going to happen any time soon and it doesn't help the existing games that are still strangled (even if it can be argued they were programmed badly and/or unable-to-scale.

And performance scaling is still required to get people to upgrade.

I perceive the AMD slides regarding GDDR5 differently.
What do you perceive? I'm getting other feedback that beyond 5Gbps is troublesome.

Larrabee from what i can understand is not going to be competitive from a performance standpoint with the future high end GPUs that ATI & NV can make, but it doesn't have to be.
It probably won't be, on 45nm, but on 32nm?... While 28/32nm GPUs will be strangled by GDDR5 that might be 20-30% faster than now.

Intel to control the GPU market has to stay in the market at least 2 engines mini cycles (2+2 years) and must use such a pricing model, business tactics and a marketing strategy so to entice partners and consumers to their solutions...

Despite reason, i am not optimistic that Intel will achieve this, that's why my original plan when i heard about Larrabee's GPU was that Intel will implement a custom GPU socket solution lol and develop their strategy with this direction...
Seems unlikely. Between putting a GPU on the CPU die and a discrete board, a socket-only solution seems like unnecessary expense.

If you read a JPR report 1-1,5 month back, JP was talking that nearly half of the new PCs in 2012 will be sold with multi AIBs GPU solutions (scaling will be done with Lucid Tech chip solutions according to him) (lol).

I hope his prediction to be wrong.

I don't see how AMD/NV will like this direction...

I think that ATI & NV have the technical capability to make homogeneous multi-core designs like SGX543MP (I am not talking about the tile based rendering method...)
(that's how i see the progression of the ATI/NV future shared memory GPU designs)
so they will not need Lucid for perf. scaling (why NV/ATI to lose all the money that customers are going to pay for Lucid based solutions when this money can go directly to NV/ATI?)
These fixed-function GPUs are trying to unwind from the huge fixed-function trench they've dug into. Compute shader/OpenCL goes some way and should see significant usage in games, and in theory alleviate some of these harsh bandwidth issues. But it seems to me that basic forward rendering is too simplistic a pipeline model now, with the party thrown by the bandwidth gods coming to an end (just like the MHz party came to an end for CPUs years ago).

I was really hoping that the major compute shader features of D3D11 would result in some serious re-configuration of the GPU. Maybe NVidia will take this problem more seriously with GT300 (though, to be honest, based on Dally's comments I'm doubtful).

Jawed
 
Hmm, are the texture units still similar in filtering capability?
I always thought that nvidia could afford having near perfect AF filtering with their chips (since G80) since they comparatively had more texture units than AMD r6xx and newer chips. After all angle-independent AF increases the amount of samples (on average) you need to do filtering with.
This is part of the confusion with the texel rates that Vantage reports, I believe, something that's "never been fixed" because the numbers are right, I think. ATI can fetch way more fp16 or int8 texels than NVidia per clock per unit, because the hardware is designed to be full speed fetching fp32 texels.

Jawed
 
It is only one of the wastages of present day gpu's that will have to corrected. Ultimately, they'll have to go over to sw load balancing like lrb to become more area and power efficient.
Ultimately, yes, but in the meantime they (or at least AMD) are looking like they want to drag out the maximal-FF approach for as long as possible. Perhaps with improved multi-chip to give it a shot in the arm, in AMD's case. I wouldn't mind, but I suspect this misses a trick when it comes to CS/OpenCL.

Jawed
 
Not saying anything (yet, at least), but aren't we seeing a wee bit too much into what is basically a marketing oriented architecture diagram? It's not like there aren't other things present not shown there (or in such diagrams in general)...they could still be using dedicated interpolators, and simply not have shown them there.
It's not just the diagram that makes me think these things - i.e. information I've received plus Marco's comment.

Still, I generally don't let the possibility of being wrong stop me from following through with ideas, and there's only so much to say before just waiting and seeing what transpires.

Jawed
 
hemlock.jpg

(hosting it myself, original at tweakers.net forum)

Is it hemlock? It's not HD5870 for sure, nor the SIX model, only other option would be HD5850 but the length seems to match the old leak Hemlock-pic
 
Well, that's exactly my problem. Doubling the RBEs per MC has clearly (as can already be seen in RV740) brought a significant jump in efficiency, but at the same time the GDDR5 gravy train appears to have hit the buffers. So unless something radical happens and GDDR5 goes way above 6Gbps, the future is looking awful for this architecture - the entire forward-rendering concept needs a rethink.
Larrabee's bandwidth savings by keeing a screen tile on-chip until it is completed would be something most architectures will probably go towards.
I don't know if there are any good ways to do so without extending tiling throughout the GPU architecture.

The least elegant way to defer tiling would be to hope the foundry masters EDRAM and then slather a big ROP tile cache that holds the framebuffer up to some ridiculously high resolution.
A much larger cache in general would reduce bandwidth needs, as has been found in other computing realms.
It's not great, but it is a hack that is available that requires little more effort than allocating it die space.

It's just a question of the latency of RBE-Z updates for hierarchical-Z - if those latencies are long enough, does hierarchical-Z work? That latency could easily be thousands of cycles. Tens of thousands.
There is still a H-Z block per rasterizer. If each rasterizer is allowed to update its local copy as the arrows in the diagram indicate, maybe the design assumes that with an even distribution to each rasterizer each local H-Z will start to approach a similar representation in high-overdraw situations.
This would lead to an incremental decrease in effectiveness for short-run situations, and then there is the chance of a long run of pair-wise overlapping triangles pathologically alternating between rasterizers.
The cap would be the RBE z-update latency.

One way of looking at this multiple-rasteriser architecture is to imagine what happens if AMD is going to build a multi-chip solution where the multiple-rasterisers scale-up and work by communicating amongst themselves (i.e. non-AFR, instead something like super-tiled). If this is the basis of the design, then off-chip inter-rasteriser latencies are obviously far higher than on-chip - let's say 500 cycles for the sake of argument. Where does that lead? Dumb round-robin rasterisation? Still doesn't answer the question of how to apportion the vertex streams across multiple GPUs, or what to do with GS stream out from multiple GPUs (let alone append/consume buffers).
I don't think these problems have been fully solved for any mult-chip GPU solution.
Not even Intel has shown a path, as Larrabee's binning scheme has been defined for only for a single-chip solution.
The rasterizer portion of the scheme may need to have a local run on the full stream on each chip, with a quick reject of primitives that do not fall within a chip's screenspace.

On-chip solutions have much more leeway.
Cypress could send two triangles at the same time to be rasterized. Each rasterizer gets a copy of this pair, and an initial coarse rasterization stage can allow each rasterizer to decide if it will pick or punt each triangle.
Such a process would be much more expensive if crossing chips.
It's a matter of a duplicate data path and an additional coarse check if on-chip.
 
The cap would be the RBE z-update latency.
Why would you wait for back end Z to update hierarchical Z? Simply flag any shader which messes with it as producing a transparant surface ... depending on how frequent they are you might still want to loop the results back, but I imagine for most shaders you can determine Z before the pixel shader is done with them.
 
Why would you wait for back end Z to update hierarchical Z? Simply flag any shader which messes with it as producing a transparant surface ... depending on how frequent they are you might still want to loop the results back, but I imagine for most shaders you can determine Z before the pixel shader is done with them.

The worst-case would be the updates from the RBE, which is what I was saying when I stated that it was the cap.
Broadcasts from earlier in the process to both H-Z blocks could be done, but for any corner cases where they fail the RBE update would be the longest time expected.
 
Back
Top