Nvidia GT300 core: Speculation

Scali · Apr 10, 2009

TimothyFarrar said:
Adjusting texture format from uncompressed to compressed can in cases be really really bad.

I wasn't trying to argue that it's bad. I personally think it's a great thing (assuming the textures are actually analysed for quality beforehand). I've always enabled AI on my ATi cards. Performance increase with little or no visible differences, great.
I wasn't even trying to argue that ATi is the only one who does this (although the ATi fanboys here ofcourse assumed that for trolling's sake)...
I was just pointing out that there are documented cases of this happening. One of the first occurences was with 3dmark2000 I believe... we noticed that some videocards scored better on the fillrate tests than their theoretical specs allowed... These videocards introduced texture compression... Further investigation pointed out that the driver forced texture compression on during the fillrate tests.

Bottom line is just that if you design a benchmark where you assume a certain pixelformat, and the driver does something else, your bandwidth calculations will be inflated, and won't represent the actual hardware capabilities.

Silent_Buddha · Apr 10, 2009

Scali said:
I wasn't trying to argue that it's bad. I personally think it's a great thing (assuming the textures are actually analysed for quality beforehand). I've always enabled AI on my ATi cards. Performance increase with little or no visible differences, great.
I wasn't even trying to argue that ATi is the only one who does this (although the ATi fanboys here ofcourse assumed that for trolling's sake)...
I was just pointing out that there are documented cases of this happening. One of the first occurences was with 3dmark2000 I believe... we noticed that some videocards scored better on the fillrate tests than their theoretical specs allowed... These videocards introduced texture compression... Further investigation pointed out that the driver forced texture compression on during the fillrate tests.

Bottom line is just that if you design a benchmark where you assume a certain pixelformat, and the driver does something else, your bandwidth calculations will be inflated, and won't represent the actual hardware capabilities.

The problem with that is, if it does it in all applications, then by not benchmarking it's default behavior you are in fact not representing the real world hardware capabilities.

Granted you wouldn't be respresenting the theoretical capabilities of the cards.

The problems that arise is when you start significantly altering the quality of what's presented in a commonly accepted "bad" way.

For example alterning an image by compressing textures such that texture compression is noticable is generally agreed to be bad.

On the other hand, alterating textures/edges through Aliasing (and transparency aliasing) is generally considered good, even though it noticeably alters what is presented.

I think the problem comes in when people misunderstand whether a benchmark is supposed to be gauging the "Real World" performance of a card in certain tests or whether it's supposed to be gauging the "Theoretical" performance of a card in certain tests.

And then add in having to determine whether any optimizations are overall good or overall bad...

If someone can make an optimization that is unnoticable to the naked eye then awesome. If someone can make an optimization that actually increases IQ (touchy one since it's subject to personal taste) than awesome. If someone makes an optimization that noticeable degrades quality though, uh...yeah...

Regards,
SB

Scali · Apr 10, 2009

Silent_Buddha said:
I think the problem comes in when people misunderstand whether a benchmark is supposed to be gauging the "Real World" performance of a card in certain tests or whether it's supposed to be gauging the "Theoretical" performance of a card in certain tests.

That is exactly what started this discussion.
Someone referred to benchmark results, claiming that ATi had nearly the same level of performance as nVidia with only about half the physical texturing hardware.
Then someone else remarked that because of AI, you aren't measuring the physical hardware.

Jawed · Apr 10, 2009

Scali said:
It's just an example... You get the point... the assumption of a benchmark's memory bandwidth won't hold if the videocard isn't actually doing what you think it's doing.
Stop trolling.

I was under the impression that ATI does selective-texture angle-dependent optimisations. e.g. normal maps will have lower quality aniso than "uppermost" albedo textures.

I hadn't heard of what you were talking about so was curious to hear more.

Jawed

Scali · Apr 10, 2009

Jawed said:
I was under the impression that ATI does selective-texture angle-dependent optimisations. e.g. normal maps will have lower quality aniso than "uppermost" albedo textures.

AI does a number of things, including optimizations for texture filtering and shader replacement.
[MOD: TONE IT DOWN]
There's nothing secret about AI, since it can now be disabled by the user. Basically it's just the same stuff that was considered 'cheating' before the user had any choice. I believe it is documented somewhere, either in the control panel help, or on ATi's site.

Jawed said:
I hadn't heard of what you were talking about so was curious to hear more.

Your question wasn't focusing on the technique though, but rather on applications. So this explanation makes no sense whatsoever.
Also, if you've never heard of it, you must not have been around very long. As I say it's very old, around 3dmark2000 days.

Jawed · Apr 10, 2009

Scali said:
I already answered it.

You said they've got a 2 year lead, that's why their GPU is bigger. Bizarre.

While I agree that double-precision may be a strong-point of Larrabee, I don't think dp will be very relevant for most GPGPU applications. If they mostly rely on single-precision, then this advantage won't do Larrabee much good in practice.

Interestingly, as far as I can tell folding@home cannot use GPUs for all projects because precision is a problem.

A recent change in the ATI client was made to improve the precision, which slows it down. Very vague - I'm not trying to imply that double-precision calculations are being done, merely that precision is an issue.

Additionally:

http://www.brightsideofnews.com/new...ome-meets-the-power-of-graphics.aspx?pageid=1

BSN*: Are you planning to release a NVIDIA client as well? Why not?
Gipsel: Not at the moment and there are several reasons for that. First of all, the ATI application still needs some polishing like multiple GPU or Linux support. Furthermore, the project itself is working with nVidia on CUDA-powered version. Apparently, nVidia gives a lot of support to BOINC projects that want to port their applications to CUDA. Together with the mature CUDA SDK, it shouldn’t take long until MW@H also gets a GPU application that supports the latest nVidia cards.
The reason I started with ATI in the first instance was the quite massive performance advantage ATI has on current hardware for the kind of calculations done at Milkyway [Dual Precision format - Ed.]. I hope, it will increase the interest of getting GPGPU applications ported also to ATI hardware, which is in a lot of cases at least as capable as comparable nVidia offerings. The fact that I’m a member of Team Planet3DNow!, a BOINC team associated with an AMD oriented website, has no influence whatsoever.
BSN*: What do you recommend to other distributed computing projects? ATI or Nvidia?
Gipsel: I would recommend support both Without going to much into the details there are different advantages to both. Basically, one can use a very simple high level programming model for ATI that may be enough for simple problems. If not, one has to resort to harder to program low-level approaches, but gets very solid performance in return.
If you need to use a lot of double precision calculations, there is simply no way around ATI from a performance standpoint, at least with current hardware. On the other hand, Nvidia has created quite a mature environment with CUDA, enabling relatively easy creation of high performing GPU applications. From what I hear they offer also great support to BOINC projects. But we should overcome the need to create two version of a GPGPU application with the advent of OpenCL that will get support by both [AMD & Nvidia - Ed.] as well as Intel. Actually, OpenCL has a lot of resemblance to CUDA.

Jawed

I.S.T. · Apr 10, 2009

Scali said:
AI does a number of things, including optimizations for texture filtering and shader replacement.
MOD EDIT

OK, this is false. He has slammed the R600 more than once in the last few pages. Why would an ATI/AMD fanboy do that? It wasn't GFFX level bad, a chip so bad that no one can deny it.

Scali · Apr 10, 2009

Jawed said:
You said they've got a 2 year lead, that's why their GPU is bigger. Bizarre.

No I didn't.
You could at least have the decency and respect to properly read my posts and not to misrepresent them.
What I said was:
"In nVidia's case, the chip is larger because of the implementation they chose... We are now about to find out if this implementation is going to pay off or not, in GPGPU tasks."
Now unless you want to go down the dead-end line of argument that nVidia's implementation is equal to ATi's, what I said makes perfect sense.

Jawed said:
Interestingly, as far as I can tell folding@home cannot use GPUs for all projects because precision is a problem.

There will always be isolated cases. Doesn't mean the majority of GPGPU software needs to be double-precision... or even that the majority of calculations in software like folding@home need to be double-precision.
If you need double-precision in some places, but only spend a few % of the total processing time in those places, it still isn't going to be a significant advantage for Larrabee.

So my point still stands.

Jawed · Apr 10, 2009

Scali said:
Your question wasn't focusing on the technique though, but rather on applications. So this explanation makes no sense whatsoever.
Also, if you've never heard of it, you must not have been around very long. As I say it's very old, around 3dmark2000 days.

Knowing the applications I could have gone and searched. Back in the 3DMark2000 days I was playing Quake3 and Counter Strike. In fact, I still do

I think about a year ago I ran it for the first time because I wanted to see what it looked like and the web videos looked like crap.

Jawed

AlexV · Apr 10, 2009

Yes, and how about we get back on topic? If one wants to discuss optimizations present in current drivers a new thread would be far more adequate, wouldn't it?

TimothyFarrar · Apr 10, 2009

Jawed said:
What's interesting is that only 1 operand per instruction can come from shared memory in GT200. The same rule applies in Larrabee, if we label L1 cache as "shared memory".

So with a combination of cache-line locking and cache-prefetching it seems to me that using OpenCL local memory on Larrabee will be extremely flexible. Obviously that requires Intel to write a decent run time compiler.

Or reverse this in the context of OpenCL, and label NVidia's shared memory as software managed L1 cache. Both serialize bank conflicts on scatter/gather to this "cache". The difference to me is that LRB has an L2 backing, perhaps less non-cached memory bandwidth (guessing here on that), and less latency hiding, and NV with perhaps more memory bandwidth and better latency hiding and thus better at non-cache bandwidth limited cases (IMO the more important case). This situation is like SPU programming on the PS3. Good SPU (software managed cache) practices map well to non-SPU code (ie processors with a cache). Often writing code as if you had a software managed cache is ideal for a cached CPU (hint why LRB has cache line management).

Infact, if you were really crazy you could do software rendering into OpenCL shared memory with 1/8 micro tiles (DX11 CS/OpenCL 32KB shared is only 1/8 the size of Larrabee L2)

Jawed · Apr 10, 2009

Scali said:
No I didn't.
You could at least have the decency and respect to properly read my posts and not to misrepresent them.
What I said was:
"In nVidia's case, the chip is larger because of the implementation they chose... We are now about to find out if this implementation is going to pay off or not, in GPGPU tasks."
Now unless you want to go down the dead-end line of argument that nVidia's implementation is equal to ATi's, what I said makes perfect sense.

I'm sorry, I did miss that you'd answered

So my point still stands.

I agree - I was merely fleshing out.

Jawed

Ailuros · Apr 11, 2009

Don't know if it has been mentioned before but Jon Olick's presentation at Siggraph08 had quite a few interesting points about next generation parallelism in games (after page 91):

http://s08.idav.ucdavis.edu/olick-current-and-next-generation-parallelism-in-games.pdf

Interesting dilemma "duplicate the GPU into each core" or add a triangle sorting stage?

Page 203/4 has an interesting performance prediction for next generation platforms.

trinibwoy · Apr 11, 2009

Jawed said:
Because currently, it's more marketing than useful?

Based on what metric? As far as GPGPU goes it's been a remarkable success compared to the non-existent competition.

If it's a half-way step to providing a super-efficient ALU design whose performance is substantially like MIMD instead of SIMD then it'll turn out to have been really cool. In the meantime it's just something else that bloats the chip.

I'm not really sure why you think it's just bloat. What alternative architecture do you propose would have put them in a similar or better position than they are in today?

I'm still in search of any demonstration of a performance benefit here, particularly because in theory the one useful bit of NVidia's design is the relatively small warp size of 32.

I'm baffled as to how can you draw these conclusions with nothing to compare against? Where is AMD's scheduler light architecture excelling exactly?

One thing is definitely clear, AMD's toolset in this environment is comparatively immature and obstructive.

Yeah that's unfortunate because that point renders these discussions moot. The fact that not even AMD has been able to produce something that highlights their architectures strengths is pretty telling to me.

AMD does dynamic clause scheduling - it's the instructions that are fixed in VLIW.

I was referring to clause demarcation. That's done at compile time as well no?

How does VLIW affect viability?

Well I didn't mention VLIW. I thought we were talking about scheduling. And that doesn't answer the question. Where are the apps that prove out the viability of AMD's approach as a general compute solution? Is CUDA's success solely a function of Nvidia's dollar investment and marketing push or is there something to the technology too?

trinibwoy · Apr 11, 2009

Jawed said:
Larrabee programmers will have the choice to use almost any amount of local memory(shared memory in CUDA) per work-item in comparison with relative small and fixed amounts in any GPU - unless the GPUs are radically re-designed.

TimothyFarrar said:
Or reverse this in the context of OpenCL, and label NVidia's shared memory as software managed L1 cache. Both serialize bank conflicts on scatter/gather to this "cache". The difference to me is that LRB has an L2 backing, perhaps less non-cached memory bandwidth (guessing here on that), and less latency hiding, and NV with perhaps more memory bandwidth and better latency hiding and thus better at non-cache bandwidth limited cases (IMO the more important case). This situation is like SPU programming on the PS3. Good SPU (software managed cache) practices map well to non-SPU code (ie processors with a cache). Often writing code as if you had a software managed cache is ideal for a cached CPU (hint why LRB has cache line management).

Yeah I don't see a big difference here either. Larrabee doesn't get a free ride just because it has an L1. Only one cache line can be read per clock so in order to avoid starving the ALUs software is gonna have to manage data carefully to maximize aligned reads from "shared memory". Whereas you get this for free with CUDA. Sure Larrabee's L1/L2 will be bigger but that's no guarantee at all that they'll be faster.

Ailuros · Apr 11, 2009

trinibwoy said:
Sure Larrabee's L1/L2 will be bigger but that's no guarantee at all that they'll be faster.

Bigger compared to what?

compres · Apr 11, 2009

Scali said:
There will always be isolated cases. Doesn't mean the majority of GPGPU software needs to be double-precision... or even that the majority of calculations in software like folding@home need to be double-precision.
If you need double-precision in some places, but only spend a few % of the total processing time in those places, it still isn't going to be a significant advantage for Larrabee.

From what I have seen, double precission as a requirement is more the normal case than the isolated case. That's why x86/RISC SMP/MPI systems are still being used in spite of the GPUs higher throughput. Naturally there are other issues like maturity of high performance libraries and compilers, IEEE compliance, etc.

DP is IMO a very strong advantage for ATI, the problem is the inmaturity of CAL.

bowman · Apr 11, 2009

Ailuros said:
Bigger compared to what?

The little read-only caches in GT200 and RV770.

Ailuros · Apr 11, 2009

bowman said:
The little read-only caches in GT200 and RV770.

I'm trying to bounce back the debate to GT3x0

trinibwoy · Apr 11, 2009

Ailuros said:
I'm trying to bounce back the debate to GT3x0

Hehe, nice try.

Ailuros said:
Bigger compared to what?

Well LRB is gonna have 32KB L1 and 256KB L2 right? GT200 has 16KB. It's really unlikely that GT300 is gonna expand on that in a big way if it sticks to the current CUDA model.

Nvidia GT300 core: Speculation

Scali

Silent_Buddha

Scali

Jawed

Scali

Jawed

I.S.T.

Scali

Jawed

AlexV

Heteroscedasticitate

TimothyFarrar

Jawed

Ailuros

Epsilon plus three

trinibwoy

Meh

trinibwoy

Meh

Ailuros

Epsilon plus three

compres

bowman

Ailuros

Epsilon plus three

trinibwoy

Meh

Similar threads