32 GByte/sec with 128 Bit DDR?

Colourless · Mar 27, 2002

Design 2 will be better than Design 1, IF they are bandwith limited. If they are fillrate limited, than in a case such as Doom 3 and stencil shadows, the extra bandwith when not texturing will remain unused anyway.

nAo · Mar 27, 2002

pascal said:
Tons of untextured polygons? Are you talking about the multiple passes per pixel needed by some old hardware ?

I'm talking about future games with heavy use of stencil buffer

edited: AFAIK the major problem Doom3 has is the texture bandwith. JC had to use compressed texture and disable anisotropic with GF3.

Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation, that's a fact.
Or do u think u'll have at full use your splitted buses architecture?
If don't u like my stencil buffer examples, what about heavy texturing with low fillrate (limited by 'campled' texture bandwith) ? Your fb bandwith will be wasted meanwhile your texture bandwith will be saturated.
2 words: load balancing

With a crossbar architecture and multiple controllers u have a fine grain access with high aggregate bandwith and a dynamic load balance.
Do u need more fb bandwith or more texture bandwith? U have it on gf4-like architecture at least till total bandwith is not saturated.

All this crossbar talk will not improve the real memory latency and page hit, memory controllers and cache both architectures have

Sorry, but imo u're plain wrong. Those memory architectures do help latency and page hits.

Also many problems designers have today is the high frequency used by memory, right? And it will get worse.

Umh..don't talk about design problems, u just doubled bus width

edited: suppose we have an improved GF4 core (8 level multitexturing in an single pass) and at lower frequency (200MHz core) my guess the design1 could be done today and very cheap with great performance almost all situations.

If we have a GF4 core like that coupled with a working frequency like that my guess is nvidia engineers are drunk

ciao,
Marco

nAo · Mar 27, 2002

Colourless said:
Design 2 will be better than Design 1, IF they are bandwith limited. If they are fillrate limited, than in a case such as Doom 3 and stencil shadows, the extra bandwith when not texturing will remain unused anyway.

Not if the next generation hw allocates more resources (ie <virtual> pixel pipes) devoting texture iterators and stuff to fill more pixels at the same time. And even if next designs will not be as flexible, a 350 mhz 8 pixel pipes core can theoretically write more than 10 Giga/s to memory (1 pixel = 24 bit zbuffer + 8 bit stencil), so Design1 loose anyway, where a single 128 bit bus coupled with a fast (800+) mhz ram could satisfy even 8 pipes in such a case.

ciao,
Marco

Teasy · Mar 27, 2002

Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation

13.3gbps vs 9.6gbps, almost equal? Design 1 has 40% more bandwidth then design 2.. I wouldn't call that almost equal.

nAo · Mar 27, 2002

Teasy said:
13.3gbps vs 9.6gbps, almost equal? Design 1 has 40% more bandwidth then design 2.. I wouldn't call that almost equal.

Umh, were I wrote I was referring to Design2? In fact I didn't. With current technology one could expect a single 128 bit DDR bus to reach more than 12 Giga/s. Why one hw design should implement a 256 bit splitted asynchronous bus vs a single fast 128 bit bus I don't know, but that wasn't my idea

If one want to go with a 256 bit ddr bus, just do it with 4 or 8 memory controllers. Increasing costs doubling the bus and then trying to reduce them using slow memory doesn't make any sense at all. Splitting buses could make sense if one is a LOT larger and faster than the other bus like in gpu with edram + slow external memory, but I wouldn't call that a splitted bus architecture.

ciao,
Marco

Entropy · Mar 27, 2002

Teasy said:
Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation

Click to expand...

13.3gbps vs 9.6gbps, almost equal? Design 1 has 40% more bandwidth then design 2.. I wouldn't call that almost equal.

The comparison is fundamentally flawed. in order to be meaningful, Design 1 should be compared to a datapath of 256 bits which can be used for either purpose. This would be directly comparable in terms of GPU pinouts and board traces. It is difficult to see that divvying up the available bandwidth in the proposed manner would be an overall win, but comparing it to an obviously lower cost model is pretty pointless. And even then Design1 could still come out inferior....

Entropy

pascal · Mar 28, 2002

nAo:
IMHO, you're wrong. Design1 will be more expensive than Design2 (128+128 bit buses vs 128 bit bus) and in some circumstances even slower,

Can you proof that? Numbers please

The idea is have a architecture with sustained performance with:
- 1024x768x32, vsync on at 75Hz
- Sthocastic Multisampled FSAA (2 to 5 samples out of 16)
- Single pass up to 8 level multitexture (a bit better than Radeon 8500

)
- Overdraw of 5

Probably most games from 2002 to 2005 will be happy with this spec.
By 2005 we will have the next id engine and the next EPIC engine.
It is not designed to have 400fps with Q3

I'm talking about future games with heavy use of stencil buffer

Please, could you explain? thanks

Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation, that's a fact.

I am comparing performance AND price.

If don't u like my stencil buffer examples, what about heavy texturing with low fillrate (limited by 'campled' texture bandwith) ? Your fb bandwith will be wasted meanwhile your texture bandwith will be saturated.
2 words: load balancing

four words: performance with low price

Sorry, but imo u're plain wrong. Those memory architectures do help latency and page hits.

Off course they help.
Yeah, total latency = l1latency x l1hit + l2latency x l1miss , etc, etc, etc.

I am talking about the latency and page hit BEFORE it, the real one in the memory access.

Umh..don't talk about design problems, u just doubled bus width

But at lower frequency

If we have a GF4 core like that coupled with a working frequency like that my guess is nvidia engineers are drunk

They are drunk now. The simplest GF3 is completelly bandwith limited and we all know that. It is not capable to sustain the MARVELOUS / FANTASTIC / UNBELIEVABLE 1.4GTexels/s

IMHO bandwith is more important than the high theoretical, never sustained fillrate.

Entropy:
The comparison is fundamentally flawed. in order to be meaningful, Design 1 should be compared to a datapath of 256 bits which can be used for either purpose.

I am comparing what could potentially be a low priced 256bits GPU with a today high end 128bits design

Today these GF4 have almost 1 thousand pins and still dont have a 256 bits memory access. Usually they design a very fast, low latency core without enough bandwith to sustain it.

Today they are selling low core latency, but not sustained high performance.

The idea is to have more balance between the GPUÂ´s core and its memory bandwith with overall low price, not have the fastest Q3 benchmark

nAo · Mar 28, 2002

pascal said:
nAo:
IMHO, you're wrong. Design1 will be more expensive than Design2 (128+128 bit buses vs 128 bit bus) and in some circumstances even slower,

Click to expand...

Can you proof that? Numbers please

Oh my god. Take a current gpu, expand is area to fit extra bonds, add 200 pins on package and traces on the PCB, then return back to me to say it's cheap. If it was cheap u'd see 256 bit bus now.
Unfurtunately, I can't see them.

I'm talking about future games with heavy use of stencil buffer

Click to expand...

Please, could you explain? thanks

Next generations games will do a lot of geometry passes with untextured polygons to create correct shadow volumes

I am comparing performance AND price.

Yeha, and I'm saying to you that a 128 bit ddr bus coupled with fast memory it's a lot cheaper than a splitted 256 bit bus, and in most cases it's even faster.

four words: performance with low price

Your solution has no low price.

Sorry, but imo u're plain wrong. Those memory architectures do help latency and page hits.

Click to expand...

Off course they help.
Yeah, total latency = l1latency x l1hit + l2latency x l1miss , etc, etc, etc.

Dunno how u came up with that formula, but it's a non-sense.
Multiple controllers have access to multiple banks concurrenty. Calculating mem efficiency it's not an easy task, there are a lot of variables to analyze.

I am talking about the latency and page hit BEFORE it, the real one in the memory access.

It can be mostly hidden, on a good memory controller, via banks precharge. A lot of memory accesses on 3d chips can be prescheduled and so u know in advance what u'll need and when u'll need it.

Umh..don't talk about design problems, u just doubled bus width

Click to expand...

But at lower frequency

Current designs has shown that this is not a huge problem, meanwhile we have still to show a 256 bit bus consumer card. talking about vapour...

They are drunk now. The simplest GF3 is completelly bandwith limited and we all know that. It is not capable to sustain the MARVELOUS / FANTASTIC / UNBELIEVABLE 1.4GTexels/s

Every 3d card is bandwith limited. And rolling your eyes 10 time doesn't give to your words more credibility.

ciao,
Marco

pascal · Mar 28, 2002

Dunno how u came up with that formula, but it's a non-sense.

This a simple/basic formula to calculate latency with a memory hierarchy.
Get some computer architecture book. Study before you say nonsense.

Every 3d card in bandwith limited. And rolling your eyes 10 time doesn't give to your words more credibility.

No, just show my feeling about your words.

Just have to finish my work now. I will post more tomorrow.

pascal · Mar 28, 2002

I am at home and I will respond it now.

Oh my god. Take a current gpu, expand is area to fit extra bonds, add 200 pins on package and traces on the PCB, then return back to me to say it's cheap. If it was cheap u'd see 256 bit bus now.
Unfurtunately, I can't see them.

They have already added more than that to go from 200MHz to 300MHz DDR with fast unusable core

Next generations games will do a lot of geometry passes with untextured polygons to create correct shadow volumes

It doesnt explain how/why the Doom3 (is it enough next generation for you) performance doubled with compressed textures and disabled anisotropic. Simple AmdhalÂ´s Law.
Just in case you dont know it:

Speedup = 1 / ((1 - Fraction Enhanced) + fractionEnhanced/SpeedupEnhanced))

Can you say nonsense again?

Yeha, and I'm saying to you that a 128 bit ddr bus coupled with fast memory it's a lot cheaper than a splitted 256 bit bus, and in most cases it's even faster.

Adding more pins to power, ground, shielding, driving fast memory, some expensive packging (almost one thousand balls), etc...

Your solution has no low price.

Prove that

It can be mostly hidden, on a good memory controller, via banks precharge. A lot of memory accesses on 3d chips can be prescheduled and so u know in advance what u'll need and when u'll need it.

At what cost?

Current designs has shown that this is not a huge problem, meanwhile we have still to show a 256 bit bus consumer card. talking about vapour...

What vapour ???

pascal · Mar 28, 2002

One more thing about this Wonderfull 128bits LMA and LMA II.
http://www.3dvelocity.com/reviews/3dblaster/ti4400_2.htm

They say it has 128bits data bus right?

How many bits to address bus ?????
If it is really capable to do four indenpendent memory access it need at least four memory bus. Each one with at least 27 bits (128MB), total 108bits for addressing

Is that right???

3dcgi · Mar 28, 2002

Much of the discussion has been about the cost of design 1 vs. design 2. It's not easy to say how much the cost difference will be between the two solutions. In general a wider bus is more expensive, but design 1 also has lower speed memory which is cheaper. How much cheaper the memory is depends on market conditions so it makes the cost differences between the designs difficult to forcast. I personally think a wider bus will still be more expensive, but like every one else I don't have any numbers to back up this hunch.

Design 1 might also waste memory as has been said, but 3dlabs uses this approach with their Wildcat card because a little wasted memory doesn't matter for a card that has a ton. It's obvious there is no general concensus in the industry (and in this thread) because both unified and independent architectures currently exist.

In scenes with lots of untextured polygons like shadow volumes the card with higher framebuffer bandwidth (design 2) will be faster for part of the scene. However when the final geometry is rendered with textures it might be slower. So then there is the tradeoff of which part takes longer the untextured polygons or the visible scene data. In general I believe textures will still be the bottleneck, but it's hard to predict how the next generation engines will actually perform.

3dcgi · Mar 28, 2002

pascal said:
One more thing about this Wonderfull 128bits LMA and LMA II.
http://www.3dvelocity.com/reviews/3dblaster/ti4400_2.htm

They say it has 128bits data bus right?

How many bits to address bus ?????
If it is really capable to do four indenpendent memory access it need at least four memory bus. Each one with at least 27 bits (128MB), total 108bits for addressing Is that right???

Internally there might be this many address signals, but I don't see why that many would be needed externally. I believe each individual memory controller will only access a quarter of the memory at a time.

pascal · Mar 28, 2002

The only way to get four independent data is to have four independent address. Probably they are addressing four 64bits banks independentlly.

It looks like a quad symettric crossbar UMA

It means 128pins for data and 108pins for addressing. It helps explain the so large pin count we have now.

Sometimes the address are the same, sometimes not

It means up to 1.2Giga access/ second, each with 8 bytes.

Probably the Radeon 8500 have a 128bits data and 54 bits address (dual 128bits architecture).

These people are improving the access eficiency using smaller access granularity.

MfA · Mar 28, 2002

Just curious, does anyone know how exactly data is distributed accross the 4 sections? (Maybe someone tried to analyze pathological cases, or the people with the expensive logic analyzers perhaps?

Based on tiling? And how about vertices?

nAo · Mar 28, 2002

This a simple/basic formula to calculate latency with a memory hierarchy. Get some computer architecture book. Study before you say nonsense.

That's your problem Pascal, having read some formula on a book or on the internet doesn't make u an expert. Maybe u should understand before use math where that math makes sense. In this case it doesn't, cause we're speaking about multiple memory controllers, no l2, and we have no informations on how the L1 design works. A GPU is not a CPU, on a GPU u have very different hit/miss ratios if u are speaking about textures, or zbuffer, or framebuffer, etc..
The same 4k cache could have 75% hit ratio for textures and 25% hit ratio for the framebuffer (and u'll end with very different cache design, customized ot each kind of data)

They have already added more than that to go from 200MHz to 300MHz DDR with fast unusable core

How do u know all the added pins are devoted to memory access I don't know. U're making a lot of assumptions.

It doesnt explain how/why the Doom3 (is it enough next generation for you) performance doubled with compressed textures and disabled anisotropic. Simple AmdhalÂ´s Law.

I don't pretend to use uncertain data about a single unfinished game engine as a benchmark, and u shouln't too.

Adding more pins to power, ground, shielding, driving fast memory, some expensive packging (almost one thousand balls), etc...

Sometimes I believe u live on another planet

Pascal, tell me cause 256 bit DDR bus GPUs are only in high end market at the moment, and cause all those smart engineers at Nvidia/Ati/etc...are just dumb and they started to develops faster mem controllers when it does exist a simpler, cheaper and faster (to you) solution.

Your solution has no low price.

Click to expand...

Prove that

U made an extraordinary statement, u have to prove it, not me.
I'm just neglecting that statement.

It can be mostly hidden, on a good memory controller, via banks precharge. A lot of memory accesses on 3d chips can be prescheduled and so u know in advance what u'll need and when u'll need it.

Click to expand...

At what cost?

Very low cost. Prefetching is not a martian technology. I believe all the current GPUs already does that. You can do it on CPU where u can just try to analyze memory accesses patterns, u can do it much better on a GPU where u can know almost predict everything but visibilty.

They say it has 128bits data bus right?
How many bits to address bus ?????
If it is really capable to do four indenpendent memory access it need at least four memory bus. Each one with at least 27 bits (128MB), total 108bits for addressing Is that right???

No, it's not. This is the clear evidence that you should cut&past from books less and start to think with your brain more, cause u haven't understood the basics here. What's the purpose to have N memory controllers+crossbar if a single mem controller has access to ALL the addressable memory? Obviously this is not the case, each controller has access to 2 memory modules at time,with the same address lines shared.

ciao,
Marco

nAo · Mar 28, 2002

3dcgi said:
So then there is the tradeoff of which part takes longer the untextured polygons or the visible scene data. In general I believe textures will still be the bottleneck, but it's hard to predict how the next generation engines will actually perform.

I agree. That's why you would use all the physically available bandwith to load balance rendering. Like u said, splitted bus makes sense when u have tons of bandwith to waste (ie, not on consumer cards).

ciao,
Marco

pascal · Mar 28, 2002

nAo:
That's your problem Pascal, having read some formula on a book or on the internet doesn't make u an expert. Maybe u should understand before use math where that math makes sense. In this case it doesn't, cause we're speaking about multiple memory controllers, no l2, and we have no informations on how the L1 design works. A GPU is not a CPU, on a GPU u have very different hit/miss ratios if u are speaking about textures, or zbuffer, or framebuffer, etc..
The same 4k cache could have 75% hit ratio for textures and 25% hit ratio for the framebuffer (and u'll end with very different cache design, customized ot each kind of data)

Dont worry, I have a formal education and I think by myself.
I was not talking about a l2 cache, I was talking about the general latency formula for hierarchical memory access. I still say that the Principle of Locality favours independent buses with lower latency and page hits, and you can not negated that.

How do u know all the added pins are devoted to memory access I don't know. U're making a lot of assumptions.

Can you tell what they are doing with those pins?

edited: also I didnt say that all new pins are devoted to memory access, but some are.

I don't pretend to use uncertain data about a single unfinished game engine as a benchmark, and u shouln't too.

But you were talking about next generation games. Do you have a better data?

edited: I dont pretend anything.

You are probably saying "dont try to use".

Sometimes I believe u live on another planet
Pascal, tell me cause 256 bit DDR bus GPUs are only in high end market at the moment, and cause all those smart engineers at Nvidia/Ati/etc...are just dumb and they started to develops faster mem controllers when it does exist a simpler, cheaper and faster (to you) solution.

Oh nAo, I am really tired of you. You really want it to get personall.
First, I live in the sunny South America

Second they are not me

Serious, the way they did that was ok for the past, but we are talking about the future.
edited: Do you usually worship Nvidia and ATIs engineers ???

U made an extraordinary statement, u have to prove it, not me.
I'm just neglecting that statement.

And I am neglecting yours

Very low cost. Prefetching is not a martian technology. I believe all the current GPUs already does that. You can do it on CPU where u can just try to analyze memory accesses patterns, u can do it much better on a GPU where u can know almost predict everything but visibilty.

Precharge is not the only issue. Open a new page is.

edited: Higher latency means bigger cache.

No, it's not. This is the clear evidence that you should cut&past from books less and start to think with your brain more, cause u haven't understood the basics here. What's the purpose to have N memory controllers+crossbar if a single mem controller has access to ALL the addressable memory? Obviously this is not the case, each controller has access to 2 memory modules at time,with the same address lines shared.

From the article above:
The idea of NVIDIA'a crossbar memory architecture is that rather than have a single "lane" along which data can travel, four "lanes" are used, each of which is able to carry 64 bits of data at a time. This way, if say four 64 bit data chunks need to be transferred, it can be done in a single transfer while the traditional approach would mean four separate transfers

Well to do four different reads simultaneouslly (different locations) they cannot share the address bus. It must be four independent address bus.

edited: About PCB cost. The high cost is the PCB manufacturing cost. IIRC the standard GF4Ti PCB has six layers but the Quadro4 PCB has 8 layers. My guess some high frequency problems (just guess).

nAo · Mar 28, 2002

MfA said:
Just curious, does anyone know how exactly data is distributed accross the 4 sections? (Maybe someone tried to analyze pathological cases, or the people with the expensive logic analyzers perhaps? Based on tiling? And how about vertices?

Just my guesses

Everything but geometry is stored in a tiled fashion.
How data are distribuited isn't easy to figure out. I believe they uniformly split data on each controller/section, but frame buffer, textures and zbuffer are not kept spatially togheter. One can think of an arrangement where one 32 bits pixel + 32 bits zbuffer/stencil are stored in the same 64 bits word that can be read in one clock, but I think this is not the case. It would make onchip caches less efficient.

ciao,
Marco

pascal · Mar 28, 2002

Mfa:
Just curious, does anyone know how exactly data is distributed accross the 4 sections? (Maybe someone tried to analyze pathological cases, or the people with the expensive logic analyzers perhaps? Based on tiling? And how about vertices?

No info about that, only the cache info.

Quad Cache architecture uses four independent high speed cache memory locations to store primitive, vertex, texture, and pixel information.

Each cache is connected with a unique memory controller.
What is this primitive store? thanks.

32 GByte/sec with 128 Bit DDR?

Colourless

Monochrome wench

nAo

Nutella Nutellae

nAo

Nutella Nutellae

Teasy

nAo

Nutella Nutellae

Entropy

pascal

nAo

Nutella Nutellae

pascal

pascal

pascal

3dcgi

3dcgi

pascal

MfA

nAo

Nutella Nutellae

nAo

Nutella Nutellae

pascal

nAo

Nutella Nutellae

pascal

Similar threads