AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
neliz, any more bench leaks?

Here you go:

fullscreencapture921200.jpg
 
Fuad is as neutral as Charlie is. If you cant see Fuad's slant in his last five articles then no convincing by me can make you see that now. Lets agree to disagree on this topic, others have well and truly see Fuad's colors, if they had also mistankely thought that he was "neutral". :rolleyes:

The last article outside of what I arlready stated, I don't see how his view's were anything else. Also if you read his previous articles, he is stating what I said, the new low- mid range cards are going to be more then the previous high end cards but they won't perform as well over all.

One about the gt300 being faster I saw nothing in there that was outright down on ATi cards, other then him stating one is faster then the other, not only that his article was just passing along info, nothing expounded upon or anything of the sort.

His Dx 11 article which was stating ATi is pushing Dx 11 seems pretty neutral.

And thats when I stopped reading your posts.

Fudo/Charlie are fanbois to the ridiculous, if you do anything but read them for humor value then I pity you.


Easy enough just have to wait to see which of the two actually comes out true :LOL:
 
Crysis is juicy and the 'tards on the net can shut up now. But f***. 3*2.76GFLOPs. Wow.



(They are retards for using the base config, that's all. ;) )
 
Finally a straight 4870 / 5870 comparison:

hd5870vs4870b.jpg


Unfortunately no information about resolution, settings, and absolute frame rates.
 
Do u know him or why r u insulting him like this? Fudo isnt an idiot and im shure hes much nicer than u.

Nice logic there: First you imply that one must not judge someone who one does not know personally and then you actually judge me even though you dont know me personally.

So before u talk about someone else like this make shure to look in a mirror!

Thats actually good advice and I have to admit I don't try too hard to be nice.
 
The anisotropic filtering still lacks samples on RV870 :(

ATI still gets away with this stuff after years...

I'm sorry?

AF lacking samples?
If you're talking about AA, you are grossly mistaken and will be pleasantly surprised.

---------------------------


why translate when it's in english?

http://channel.hexus.net/content/item.php?item=20332

Further helpings of sauce reveal that AMD AIB partners in EMEA, have (unbeknownst to each of their competitors) been thrilled by the size of their own portion of the pre-order pie. HEXUS was served up such sizzling quotes as "we can't believe the [high level of] demand", "I've never seen this in years" and "we could sell everything we can get our hands on".

And our inspection of the steamy kitchen even revealed one AMD chef, blinking in incredulity at their cookie of good fortune, who exclaimed "it's ********* amazing!"

Not wanting to waste a good metaphor, it looks like the Chez Radeon is fully booked for weeks to come. However, there still remains the small matter of serving up the dishes...
 
It's a tile-level rasterisation only. So very cheap in terms of rasterisation, but requires that all vertex shading that affects the position attribute is computed.

Slide 26 says front-end is ~10% of the entire compute effort.
I must have have mixed some of Seiler's description and Forsyth's together mentally. I had thought the full amount of the rasterization remained with the minimal front-end solution.

The amount of data in a bin varies, you've described a heavy-weight bin. A flimsy bin with nothing more than triangle IDs would be cheap in a multi-chip solution.
That would be what I was commenting about when I mentioned the lower range.
There would be additional costs in the case of tesselation spawning new triangles that might worsen bin spread, I would think.

It would also dictate a specific style of front-end.
Programmers could not opt to have a heavy front-end in the multi-chip case, and they may find themselves constrained by this.
Running the front end on both chips would possibly allow this choice to remain.

This would make the back-end more compute-heavy, which would hide some of the latency associated with NUMA.
Yes and no.
It depends on just how naive the solution is.
Worst-case, a lot of the necessary data resources do not have local copies on the second chip, and then the entire thing is throttled by the interconnect.
The way back-end cores eagerly snatch up the next available tile was not described as taking into account any locality.
This is probably fine in the single-chip case since it's all the same memory controllers and ring bus.
It can incur additional costs if this causes self-assignment to hop chips.

Consumption of vertex data is basically a streaming problem, i.e. quite latency tolerant, if you have some decent buffers.
Would these buffers be in local memory per-chip or in the L2 caches?

An additional concern is that this turns into a streaming problem when the remote cores are aware that the additional vertex work is available. (edit for clarity: so we must factor in synchronization costs that do not exist otherwise)
The most latency-tolerant methods would lean most heavily on bandwidth and buffers to work their magic, and we have an unknown ceiling in interconnect bandwidth. It's probably safe to assume interconnect bandwidth << DRAM bandwidth.

Vertex data, due to the connectivity of triangles, strips, etc., never neatly fits precisely into cache lines, so the best approach is just to read big-ish chunks rather than individual vertices/triangles.
I figured alignment issues and line problems as falling in the noise.
I think reading from a bin in memory would be done in chunks anyway.

So two chips (conventional or Larrabee) consuming from a common stream are going to be slightly more wasteful in this regard - this is similar to the wastages that occur with different vertex orders in PTVC.
I have my doubts about how similar they can be.
Any low-level event that leads to waste in the multi-chip case something that, even if rare, is in my figuring likely to be 2-10x as expensive to handle versus a waste problem that stays local.
This is why I am reluctant to assume that something that is 10% of the load in a single-chip scenario stays at 10% with multiple chips.
Assigning a PrimSet to a core can take possibly tens of cycles with one chip.
Assigning triangles to bins can take place at the full bandwidth of the chip DRAM bus.
A back-end thread detecting that a bin is ready would take tens of cycles, and reading bin contents in the back end can use full bandwidth.

About half the time with multi-chip situations, assuming completely naive assignment, these assumptions will not be true.


But Larrabee can run multiple render states in parallel. So most trivially you can have the two chips working independently. Whether two successive render states are working with the same vertex inputs (e.g. shadow buffer passes, one per light?) or whether they're independent vertex inputs, the wastage is down purely to NUMA effects.
This would come down to the complexity and amount of independent nodes in the dependency graph.
In the scale of granularity where per-frame barriers are coarsest and per-pixel or fragment is finest, coordinating at a render state might be medium-to-coarse synchronization. It would be additional synchronization, and this would be a performance penalty even with one-chip.

Some of this may be unavoidable regardless of scheme used, as those buffers eventually have to be used and so much of that data will need to cross the interconnect.
It might be that this can safely be done by demand streaming to each chip or perhaps a local copy of each buffer will exist for each buffer in each memory pool.

Bin spread should fall if flimsy bins are used, since the tiles can be larger (which reduces bin spread).

The bin sets are stored in main memory, though.
The Seiler paper posited the number of color channels and format precision as the factors in deciding tile size.
A bin's contents could be streamed from a chip's DRAM pool based on demand, so why would this impact the tile size?
Granted, if for some reason the bin were on the other chip's memory pool due to a non-NUMA-aware setup, costs in latency, bandwidth or memory buffering would be higher.
Actually, if the scheme is that naive, it wouldn't know to add additional buffering and the chips would just stall a lot.

I don't understand how you get double.
I'm talking about running the exact same front-end setup process on both chips simultaneously, so each operation would be performed twice. There would be two bin structures, one per chip, and each chip would get portions of the screen space so that each chip can avoid writing triangle updates for non-local bins, and cores would not try to work with data from the other chip's DRAM pool unless absolutely necessary.

I don't understand what you mean by PrimSet distribution. Each PrimSet can run independently on any core. The data each produces is a stream of bins. They consume vertex streams and, if they already exist, render target tiles.

Some scheduler, somewhere, must then assign bin sets to cores.
This is what I was talking about.
The control core that assigns PrimSets to other cores can do so in a manner that fits well with a possible implementation of the ring-bus.
The other cores don't magically become aware of the assignment without some kind of ring-bus transaction.

Overall, though, I would expect that a memory-bandwidth:link-bandwidth ratio of X would serve Larrabee better than traditional GPUs. You have a huge amount of programmer freedom with Larrabee to account for the vicissitudes of NUMA.

I've been discussing Intel's own software rasterizer, though. I'm not clear on just how much can be set by a programmer not messing with Intel's driver and all that.
Sure, if a developer rolled their own solution they could do what they want.

I think that any multi-chip solution from Intel would include software that was modified so that as much work gets done locally as possible, even if at the cost of duplicated computation.
 
Last edited by a moderator:
http://channel.hexus.net/content/item.php?item=20337&page=1

Our intel is that there will be between 50,000 and 100,000 HD 5800 series cards produced in the initial 40 nm TSMC run, with a ratio of around 4:1 in favour of the 5870 over the 5850. We also gather that the channel - as opposed to OEMs - is likely to only get around ten percent of this.

AMD deserves to be commended for the way it has wrested the graphics initiative away from NVIDIA in the past year and deserves all the rewards this should bring. But it seems to be staking a hell of a lot on one OEM by giving Dell so much of its initial allocation. Only time will reveal whether this gamble has paid off but AMD's graphics channel, other OEM partners and its consumers are unlikely to celebrate if it has once more sold its soul to Dell.
 
I'm sorry?

AF lacking samples?
If you're talking about AA, you are grossly mistaken and will be pleasantly surprised.

He means anisotropic filtering. It looks like there are no improvements for the filtering quality and all these pictures of the new AF flower (D3D AF Tester) are a bad joke.
It's the same game as every new release: With syntethic tools and colored mipmaps we see an amazing full trilinear quality, but in the real apps (games) it's still crap.
 
No, all tessellation is done on the GPU !
The CPU does very little work.

If you read the technical document you can find how it is done.
It is done with vertex morphing.
On my GTX280 I'm getting 400 million tessellated triangles a second.
Would be slightly impossible to do this with the CPU, especially as the mesh is not a triangle strip.

So that means that tesselation is not a DX 11 specific feature?
 
He means anisotropic filtering. It looks like there are no improvements for the filtering quality and all these pictures of the new AF flower (D3D AF Tester) are a bad joke.
It's the same game as every new release: With syntethic tools and colored mipmaps we see an amazing full trilinear quality, but in the real apps (games) it's still crap.

That's a bit of a simplistic assessment, to say the least.
 
Back
Top