ELSA hints GT206 and GT212

trinibwoy · Feb 4, 2009

Arun, what's your basis for 1 SFU per SM? I don't get the benefit there.

When I first heard of GT215 I assumed it was needed to fill a large gap between GT216 and GT214. I don't see how confusing the competition comes into play. It's not like the competition knows what GT214 is gonna look like, and if they already do then obviously confusing them is the least of your worries!!

Lukfi · Feb 4, 2009

On the contrary, if the competition does know, it's necessary to confuse them by making them think you've changed the design

trinibwoy · Feb 4, 2009

Lukfi said:
On the contrary, if the competition does know, it's necessary to confuse them by making them think you've changed the design

Well I meant that if they have the means to know in the first place, via those same means they will see through any diversionary tactics too

Arun · Feb 4, 2009

trinibwoy said:
Arun, what's your basis for 1 SFU per SM? I don't get the benefit there.

Die size?

Remember the SFU is really the 'SFU/Interpolator/MUL' unit. Since in graphics only half the MUL can be used anyway for scheduling reasons, this would also reduce the waste on that front to zero, which is very efficient.

When I first heard of GT215 I assumed it was needed to fill a large gap between GT216 and GT214. I don't see how confusing the competition comes into play. It's not like the competition knows what GT214 is gonna look like, and if they already do then obviously confusing them is the least of your worries!!

If they don't have an 'educated' hunch at this point, there's a problem. Look at RV670: NV knew it was 12 TMUs. And it was for some time; then AMD changed it, and NV never knew about it before it was much too late.

trinibwoy · Feb 4, 2009

Well it's nice to know what the competition is up to but how does that change anything? You should always strive to extract the most performance out of a given transistor budget. Your ability to do so isn't impacted in any way by what the competition is doing. That's exactly what ATi did with RV770 and it worked out great for them.

Arun · Feb 4, 2009

trinibwoy said:
Well it's nice to know what the competition is up to but how does that change anything? You should always strive to extract the most performance out of a given transistor budget. Your ability to do so isn't impacted in any way by what the competition is doing. That's exactly what ATi did with RV770 and it worked out great for them.

No, but it can help you know which chips to prioritize if they're on roughly the same schedule and, very very importantly, it helps you tremendously to manage your inventory situation. NVIDIA's G80 inventory surplus in late 2007 was because they thought they could keep selling 880GT at $299 and wait longer to launch the 8800 GTS 512MB. We all know how that turned out, but better competitive info would have helped them save a lot of money.

It also helps plan SKUs, since those can be changed quite a bit late into the design cycle, but not so late either that competitive info isn't useful. And when you get your info really early and if your team is really dynamic, you can even change your chip a tiny bit, but that isn't really the point generally.

Jawed · Feb 4, 2009

DegustatoR said:
As for the competetion between GT300 and an RV870-based AFR top-end from AMD, i'm not so sure that you have to have the same bandwidth on the one-chip top-end to be able to counter an AFR system which is quite ineffective in it's memory usage.

Agreed.

A smarter way is to have less costly solution with the same performance -- and 384-bit GDDR5 might do the trick here.

I think that's a step too far though, as GTX280 is outclassed by HD4850X2 - they're both using roughly the same grade of GDDR3, 512-bit versus 2x256-bit, 141.7 against 127.1GB/s,

Jawed

DegustatoR · Feb 4, 2009

Jawed said:
I think that's a step too far though, as GTX280 is outclassed by HD4850X2 - they're both using roughly the same grade of GDDR3, 512-bit versus 2x256-bit, 141.7 against 127.1GB/s,

I'm not sure that GT200(b) vs RV770 situation is a typical one. GT200 is too slow for its die size and not because of bandwidth shortage but more because of inefficient design (especially for MSAA 8x) and low transistor density.
While RV870 may be more or less RV770+DX11+more units (not much left to do there actually since it's already 10.1 and has a tesselator), GT300 supposedly have massively updated architecture (if it's not then i'll be wondering were all that R&D money went between G80 and GT300). So it's probably useless to try to guess GT300's performance from GT2xx numbers.

trinibwoy · Feb 4, 2009

Don't be so sure that AMD's DX11 architecture would resemble RV770 that closely. Who knows how they're going to rejig their shader core to better handle general computing workloads. They're definitely suffering in that regard at the moment in anything that isn't very coherent and multi-component to nicely fill those 5-way ALUs (It does quite well in FFT's for example).

ninelven · Feb 4, 2009

Lukfi said:
I am pretty sure CarstenS was talking about the SP:TMU ratios, which is 2:1 on G8x/G9x, 3:1 on G200 and he expects it to be 4:1 on GT21x. Since you were quoting his words, I supposed (and probably everyone else did as well) you're talking about the same thing. I still don't see how the 6:1 number is relevant here...

He is saying he expects 32 SPs per TPC for GT21x (32/8 = 4:1). I simply said 32 or 40 does seem likely, but that I will be disappointed if GT3xx does not have 48 SPs per TPC (48/8 = 6:1). Clear enough for you?

Jawed · Feb 4, 2009

DegustatoR said:
I'm not sure that GT200(b) vs RV770 situation is a typical one. GT200 is too slow for its die size and not because of bandwidth shortage but more because of inefficient design (especially for MSAA 8x) and low transistor density.

Yeah, the ROPs are in dire need of an overhaul - been saying it for years now.

Jawed

trinibwoy · Feb 4, 2009

ninelven said:
He is saying he expects 32 SPs per TPC for GT21x (32/8 = 4:1). I simply said 32 or 40 does seem likely, but that I will be disappointed if GT3xx does not have 48 SPs per TPC (48/8 = 6:1). Clear enough for you?

Disappointed from a technical or practical sense? Are games sufficiently shader bound to warrant that kind of increase in the ratio?

btw - looks like Hardware-info has put up a nice little table with KonKort's GT218 info. So basically a 9400GT with twice the shaders.

ninelven · Feb 4, 2009

trinibwoy said:
Disappointed from a technical or practical sense? Are games sufficiently shader bound to warrant that kind of increase in the ratio?

Both really.... For the high-end I would say yes it is warranted... 2560x1600 and 1920x1200 are quite demanding. Even then, it would be a less severe ratio than what AMD is currently using.

Jawed · Feb 4, 2009

trinibwoy said:
Don't be so sure that AMD's DX11 architecture would resemble RV770 that closely. Who knows how they're going to rejig their shader core to better handle general computing workloads. They're definitely suffering in that regard at the moment in anything that isn't very coherent and multi-component to nicely fill those 5-way ALUs (It does quite well in FFT's for example).

I seriously doubt AMD will be changing the 5-lane configuration any time soon.

Comparing performance per mm2:

http://forum.beyond3d.com/showpost.php?p=1260895&postcount=67
http://forum.beyond3d.com/showpost.php?p=1260970&postcount=69

HD4870 as a percentage of GTX285, both on 55nm:

float MAD serial - 68%
float4 MAD parallel - 327%
float SQRT serial - 265%
Float 5-inst. Issue - 287%
int MAD serial - 164%
int4 MAD parallel - 335%

Then there's double-precision.

I'm certainly intrigued to find out if LDS/GDS are enough for "high performance" in D3D11 CS and OpenCL shared memory. I suspect more work's needed, but since we know so damn little about these things...

And dynamic branching performance is still an open question when comparing the two architectures, as there's so little data

---

Rather than increasing the MAD:MI ratio as Arun keeps suggesting, I think (somewhat idly, of course) NVidia would be better off just deleting MI entirely and doing those calculations (transcendentals and attribute interpolation) in software. It would reduce the register file bandwidth problems they have and remove a whole load of instruction dependency complexity from both the compiler and the scoreboards. Use the area saved to add SIMDs...

Jawed

trinibwoy · Feb 4, 2009

ninelven said:
Both really.... For the high-end I would say yes it is warranted... 2560x1600 and 1920x1200 are quite demanding. Even then, it would be a less severe ratio than what AMD is currently using.

I've always considered an increase in resolution to be a linear increase in both shading and texturing workload. But maybe you're right.

Jawed said:
I seriously doubt AMD will be changing the 5-lane configuration any time soon.

Comparing performance per mm2:

http://forum.beyond3d.com/showpost.php?p=1260895&postcount=67
http://forum.beyond3d.com/showpost.php?p=1260970&postcount=69

HD4870 as a percentage of GTX285, both on 55nm:

float MAD serial - 68%

float4 MAD parallel - 327%

float SQRT serial - 265%

Float 5-inst. Issue - 287%

int MAD serial - 164%

int4 MAD parallel - 335%

Well a wide SIMD architecture is always going to look good in pure throughput tests like those. But what about more realistic workloads like these? And doesn't more general code have a lot more scalar dependencies by nature since it's not working against vectorized data as much as a typical 3D process would?

I'm certainly intrigued to find out if LDS/GDS are enough for "high performance" in D3D11 CS and OpenCL shared memory. I suspect more work's needed, but since we know so damn little about these things...

Well F@H performance seems to say that they're not enough....

And dynamic branching performance is still an open question when comparing the two architectures, as there's so little data

True, it's just that right now a branch costs AMD at least 5x what it costs Nvidia in terms of idle resources. That's gotta catch up to them at some point.

Lukfi · Feb 4, 2009

ninelven said:
He is saying he expects 32 SPs per TPC for GT21x (32/8 = 4:1). I simply said 32 or 40 does seem likely, but that I will be disappointed if GT3xx does not have 48 SPs per TPC (48/8 = 6:1). Clear enough for you?

It is now. Sorry for misunderstanding you the first time :embarassed:

=>Jawed & trinibwoy: Shouldn't ATI and nVidia focus on graphics in the first place, GPGPU in the second? Or did I miss the GPU transforming itself from "a graphics processor that can do some general computing by the way" into "a general purpose processor that can do graphics by the way"?

Arun · Feb 4, 2009

Jawed said:
Rather than increasing the MAD:MI ratio as Arun keeps suggesting, I think (somewhat idly, of course) NVidia would be better off just deleting MI entirely and doing those calculations (transcendentals and attribute interpolation) in software.

Have you seen the Larrabee graphs indicating how much of the processing power would go to interpolation (and I think that included transcendentals?) - it's terrifying. Something like 25% of the entire workload... So I'm not really convinced that's an option!

We discussed that combo SFU/Interpolator patent a lot way back in the day, and I really don't think it's easy to do that with sufficient quality and low enough cost without a dedicated unit.

However...

It would reduce the register file bandwidth problems they have and remove a whole load of instruction dependency complexity from both the compiler and the scoreboards. Use the area saved to add SIMDs...

Remember the register bandwidth problem isn't related to interpolation or SFU. In that case, it's just fine; that's what it was designed for! The problem is for the MUL which requires *two* register reads, instead of 0 (!!) for interpolations and theoretically 1 for the SFU.

An argument could easily be made for the removal of the MUL from the SFU/Interpolation unit, offloading that to the main ALU. Whether that is actually an ideal use of resources given the overhead of a programmable processor, I'm not sure. It depends on how expensive tricks to somehow still expose that unit are, and I have no idea there.

Jawed · Feb 4, 2009

trinibwoy said:
Well a wide SIMD architecture is always going to look good in pure throughput tests like those. But what about more realistic workloads like these?

Which of those is an ALU-specific test? I know 3DMark06 Perlin Noise is ALU-bound (just about).

And doesn't more general code have a lot more scalar dependencies by nature since it's not working against vectorized data as much as a typical 3D process would?

HD4870 can't get any slower than the serial MAD test I linked, i.e. 68% performance per mm2 or 37% of the absolute performance of GTX285.

As to the "nature" of more general code, the issue is really about the memory system. Some general code is so compute bound it barely uses any kind of memory resources, either video RAM or on-die shared RAM - just registers, basically. That code will be quite happy in naive scalar form.

But any time bandwidth/latency are part of performance you have to forget about programming a scalar machine in purely scalar terms. You're now programming a vector memory architecture. Gathers should be maximally coherent, you don't want to induce waterfalls in register/constant fetches and the memory system needs nice aligned operations to maximise memory controller and cache performance.

The SIMDness of the GPU, the 32-wide batches, is simply not enough to save you. By vectorising your use of data you're naturally making it work well on a vector GPU. It's why texturing is in quads, because the cost of not doing so is terrible.

Well F@H performance seems to say that they're not enough....

Eh? Until AMD re-writes the core to use LDS/GDS, F@H tells us precisely nothing.

True, it's just that right now a branch costs AMD at least 5x what it costs Nvidia in terms of idle resources. That's gotta catch up to them at some point.

I think large batch sizes are a far more pressing problem. Oh, by the way, I've realised that the batch size of RV770 is really double what I've been thinking. Because a pair of batches runs together in the ALUs in AAAABBBBAAAABBBB etc., any incoherency in either batch kills the other batch, too.

---

If GTX285's ALUs are ~25% of the die, that's about 118mm2. Meanwhile HD4870's ALUs are ~30% of the die, about 77mm2.

So a purely ALU-based comparison of performance per mm2 for HD4870 against GTX280:

float MAD serial - 57%
float4 MAD parallel - 273%
float SQRT serial - 221%
Float 5-inst. Issue - 239%
int MAD serial - 137%
int4 MAD parallel - 279%

Worst case, AMD's ALUs are 76% bigger than NVidia's when running serial scalar code. Most of the time they're effectively 50% of the size in terms of performance per mm2.

Jawed

Jawed · Feb 5, 2009

Lukfi said:
=>Jawed & trinibwoy: Shouldn't ATI and nVidia focus on graphics in the first place, GPGPU in the second? Or did I miss the GPU transforming itself from "a graphics processor that can do some general computing by the way" into "a general purpose processor that can do graphics by the way"?

GPUs as we know them are in a losing race with things like Larrabee. Just working out where the losing line is the fun bit

2012?

Meanwhile I'm hoping that the irregular data-structures and read-modify-write pixel shader functionality of D3D11 will force GPUs to rapidly junk a load of fixed function hardware: let's get rid of colour operations in the ROPs, pretty please. I admit, I dunno how much space that'd save (or how many extra FLOPs the GPU would gain using that space), but the infrastructure requirements, i.e. caching and data-paths that reach further into the GPU should speed-up the generalisation of GPUs.

Jawed

Jawed · Feb 5, 2009

Arun said:
Have you seen the Larrabee graphs indicating how much of the processing power would go to interpolation (and I think that included transcendentals?) - it's terrifying. Something like 25% of the entire workload...

Are you referring to the "Pixel Setup" data point in figures 13 and 14 in the Siggraph paper? That's about 10%.

And, I'm still looking, high and low, for any sign of a transcendental ALU in Larrabee. I'm assuming the Pixel Setup cost is running on a simulation of Larrabee's vector ALU without any dedicated interpolation/transcendental functionality.

So I'm not really convinced that's an option! We discussed that combo SFU/Interpolator patent a lot way back in the day, and I really don't think it's easy to do that with sufficient quality and low enough cost without a dedicated unit.

I should go digging through Nick's description of his software renderer to see what he said about this stuff running on CPUs - but not tonight...

However...
Remember the register bandwidth problem isn't related to interpolation or SFU. In that case, it's just fine; that's what it was designed for! The problem is for the MUL which requires *two* register reads, instead of 0 (!!) for interpolations and theoretically 1 for the SFU.

Well, forgetting the register file for a second, all ALU operands have to come through the operand collectors, whether they're from the register file, shared memory, the constant cache, video memory or attribute parameter buffer.

Regardless, the operand collector is still bigger simply to deal with the increased bandwidth of a MAD+MI configuration.

An argument could easily be made for the removal of the MUL from the SFU/Interpolation unit, offloading that to the main ALU. Whether that is actually an ideal use of resources given the overhead of a programmable processor, I'm not sure. It depends on how expensive tricks to somehow still expose that unit are, and I have no idea there.

Really the argument comes down to how often is graphics bottlenecked on interpolation or transcendental operations. Currently NVidia has a 4:1 MAD:SF ratio - you're proposing an 8:1 ratio. ATI's ratio is lower since interpolation has dedicated ALUs, while transcendentals are 1/4 rate.

The way I see it both are legacies of GPU history, accelerated interpolation was a key part of getting texturing to work when most rendering cycles were texturing bottlenecked and fast transcendentals were needed to get vertex shading at decent speeds (especially given how few vertex pipes there were). I wonder if there's any analysis of this stuff out there?

http://www.crhc.illinois.edu/TechReports/2008reports/08-2208.visarch.pdf

Will read properly tomorrow.

Jawed

ELSA hints GT206 and GT212

trinibwoy

Meh

Lukfi

trinibwoy

Meh

Arun

Unknown.

trinibwoy

Meh

Arun

Unknown.

Jawed

DegustatoR

trinibwoy

Meh

ninelven

PM

Jawed

trinibwoy

Meh

ninelven

PM

Jawed

trinibwoy

Meh

Lukfi

Arun

Unknown.

Jawed

Jawed

Jawed

Similar threads