32 GByte/sec with 128 Bit DDR?

pascal said:
Can you tell what they are doing with those pins? :rolleyes:
edited: also I didnt say that all new pins are devoted to memory access, but some are.

More power, clock and ground for a huge chip. Some for that and some for memory, but you used that argument like additional pins were all used in memory related tasks.

But you were talking about next generation games. Do you have a better data? :rolleyes:

No, I don't. Bu that lack of informations doens't make a single unreliable point plot in your graph fit in your dream curve.

Oh nAo, I am really tired of you. You really want it to get personall.
Oh yeah, I'm not the one that said I should read books so please stop play this games with me.

Do you usually worship Nvidia and ATIs engineers ???

I don't worship engineers, but I have respect for some smart and more experienced person than me. I don't pretend to teach them how to do their work, cause the know better than me.

And I am neglecting yours :LOL:

Kind of twisted logic. Try another time with Occam's razor.

Precharge is not the only issue. Open a new page is. :LOL:

That statement trig me a question, are u sure to to know what precharge is?


Well to do four different reads simultaneouslly (different locations) they cannot share the address bus. It must be four independent address bus. :rolleyes:

Pascal, u've got it wrong another time. I said each controller is coupled with 2 mem chips that share the same address bus. Do u believe a gf4 mounts only 2 mem chips?
In this way each controller can see 32 meg of ram on a 128 meg configuration, like on gf4 ti. That's why is very probable that each controller is assigned to one or more screen coloumns.

ciao,
Marco
 
If nothing else, it's refreshing to not be involved in this type of discussion myself for a change. ;)

Now back to your regularly scheduled debate...
 
Pardon me Joe, but I have to finish this OT with nAo ;)

More power, clock and ground for a huge chip. Some for that and some for memory, but you used that argument like additional pins were all used in memory related tasks.
And almost one thousand BPGA right?
Dont you think it is too much for a GPU using much less power/frequency than some CPUs like Athlons?

No, I don't. Bu that lack of informations doens't make a single unreliable point plot in your graph fit in your dream curve.
AFAIK this is the only available info about the (probably) game benchmark during the 2003 to 2005 period.
Then how to start a new design?

Oh yeah, I'm not the one that said I should read books so please stop play this games with me.
After some "nonsense" and other not very polite ways.

I don't worship engineers, but I have respect for some smart and more experienced person than me. I don't pretend to teach them how to do their work, cause the know better than me.
I dont too. But it will not stop me to question/speculate things.

Kind of twisted logic. Try another time with Occam's razor.
Give me a break :LOL:

That statement trig me a question, are u sure to to know what precharge is?
Since my first microcomputer design in 1984 :D

Pascal, u've got it wrong another time. I said each controller is coupled with 2 mem chips that share the same address bus. Do u believe a gf4 mounts only 2 mem chips?
In this way each controller can see 32 meg of ram on a 128 meg configuration, like on gf4 ti. That's why is very probable that each controller is assigned to one or more screen coloumns
I will trust you,

OK, it means that each 32MB needs an independent 25bits addressing bus after the crossbar going to the memory. Four 32MB means 100pins for addressing in the GPU.

But each memory controller is capable to read 64bits, 128bits or 256bits (from the article). Then each memory controller use 27bits for addressing plus some control bits before the crossbar. Now it is becoming interresting

We are doing some progressing here :D

Now see that it means:
- peak 1.2Giga access per second (each from 8bytes)
- or peak 300Mega access per second (each with 32bytes)

It looks good but use 128+100pins=228pins for 9.6GB/s

With dual assymetric it could be: 256+26+25=307pins for 13.3GB/s

Probably for many games we will use more than 1.6GB/s for framebuffer (9.6 - 8GB/s).

Lets have a world peace :LOL:
 
pascal said:
this is the only available info about the (probably) game benchmark during the 2003 to 2005 period.
So? This doesn't give to that single data point any superior meaning, in fact, it has no meaning at all for our purposes.

Then how to start a new design?

Who design these chips has tons of info that u and me don't have.

I dont too. But it will not stop me to question/speculate things

You can speculate what u want to, but if you want to stay in the realm of reality you should keep your feet grounded.

That statement trig me a question, are u sure to to know what precharge is?
Since my first microcomputer design in 1984 :D

Maybe it's too back in time :) And I was learning how to read and to write at that time...
OK, it means that each 32MB needs an independent 25bits addressing bus after the crossbar going to the memory. Four 32MB means 100pins for addressing in the GPU.

No :) According to this memory layout the hw needs 24 addressing bits per controller, cause they are shared between 2 memory chips (16+16).
So there is a total of 96 address lines. This figure could be further reduced with some tricks...

But each memory controller is capable to read 64bits, 128bits or 256bits (from the article).

which article? Of course a single controller can't read a word wider than 32 bits. But the crossbar could link memory requests..

It looks good but use 128+100pins=228pins for 9.6GB/s

With dual assymetric it could be: 256+26+25=307pins for 13.3GB/s

Probably for many games we will use more than 1.6GB/s for framebuffer (9.6 - 8GB/s).

I don't know why u are using that 9.6 giga/s figure when now Gf4 has 10.5 GB/s and with 400 mhz ddr samsung just announced.
Anyway, the first one is way more efficient than the second one :)
Speaking about wasting bandwith...

ciao,
Marco
 
While we are on the 256 bit subject...


Are there any emerging technologies as far as PCB's or chip packaging that would make a much wider bus more practical cost wise?

What about four 32 bit 1200Mhz DRDRAM channels? RAMBUS seems like it may just become a decent alternitive to keep the bandwidth going up.
 
comparing it to an obviously lower cost model is pretty pointless. And even then Design1 could still come out inferior....

I didn't compare them, I just pointed out that they are nowhere near equal in bandwidth.
 
nAo:
So? This doesn't give to that single data point any superior meaning, in fact, it has no meaning at all for our purposes.
Well without any point of reference IMHO you can not criticize my HO about hardware.
Who design these chips has tons of info that u and me don't have.
Then what you say is that we cannot think by ourselves, but only say Amen to them. This is not a good position.
You can speculate what u want to, but if you want to stay in the realm of reality you should keep your feet grounded.
I keep my rights to dream whenever/whatever I want. It is part of my creative process. Usually I say that I have the head in the clouds and the feet on the ground.

Dont worry about that. And the fact that someone think different doesnt mean he/she is wrong.

No According to this memory layout the hw needs 24 addressing bits per controller, cause they are shared between 2 memory chips (16+16).
So there is a total of 96 address lines. This figure could be further reduced with some tricks...
Well you still need to diferentiate the two chips, then we are back to 25bits.
But I was wrong because 3 bits are used to address the individual bytes inside the 64bits chunks, right?
Then it means only 22bits going from the GPU to the memories per channel. The total is 88bits externally with a maximum 128MB addressable space.
The tricks you are talking about is multiplexing.
which article? Of course a single controller can't read a word wider than 32 bits. But the crossbar could link memory requests..
The article I linked above. I will link it here: http://www.3dvelocity.com/reviews/3dblaster/ti4400_2.htm
The idea is the microntrollers ready 64bits chunks, two chunks in the rise and two in the fall (128bits DDR).

The eficiency come from the fact that a memory controller can request 64bits, 128bits or 256bits and multiple requests from multiple controllers are combined to maximize the memory eficiency.
The idea is to maximize the number of possible access.

It looks like a quad variable size crossbar UMA.

It means that the caches work with 64bits chunks.

Anyway, the first one is way more efficient than the second one
Well, doing what? It will depends on the game behaviour. So you cannot confirm that.

Of course some people will want to play at 640x480x16 but the vast majority today will play at 1024x768x32, sometimes with FSAA. Just segregate the trafic to this specific kind of usage.

Speaking about wasting bandwith...
Speaking about wasting money :D

ciao
Pascal

Now back to the topic Joe ;)

[speculation mode on]
IIRC the GF4 are manufactured using a 4 metal layer silicon process and modern CPUs are manufactured using 6 metal layers process.
Probably the dataflow inside a GPU is more intense than in the CPUs and very dependent in bus width (low frequency, high bandwith) then maybe the GPUs designers are forced to use extra (or much more) pins than CPUs designers to distribute power, ground, clock and others things, because they dont have choice.
Maybe with a better (more metal layers) manufacturing process a 256bits could be easilly done. Remenber that today a GF4 use almost one thousand balls packaging, then packaging is not really the problem.

About PCB the bus frequency will be a higher problem in the future than the bus width.
[speculation mode off]

Happy Easter everybody :D
 
Well without any point of reference IMHO you can not criticize my HO about hardware.

Don't make generalizations. I haven't said there aren't point of reference, I said you cannot use that single point of reference, cause it's not reliable and it has not statistical meaning.

Then what you say is that we cannot think by ourselves, but only say Amen to them. This is not a good position.

Umh?! Pascal, please stop to make this kind of assumptions. I haven't said anything like that, in fact I'm consider myself quite a dreamer.

I keep my rights to dream whenever/whatever I want. It is part of my creative process. Usually I say that I have the head in the clouds and the feet on the ground.

Oh my god :) Obviously u've the rights to dream whatever u want to :)
What I clearly said is that a pure speculation without any factual basis is a non-sense, and potentially a waste of time. We're discussing about future bus architectures for GPUs, we're not talking about hidden variables in quantum physics or about god :)

Well you still need to diferentiate the two chips, then we are back to 25bits.

No, you don't. the hw could tell between the 2 cause data buses are joined

But I was wrong because 3 bits are used to address the individual bytes inside the 64bits chunks, right?

That is one of the tricks I had in mind..

The tricks you are talking about is multiplexing.

Yes, and other stuff..

The idea is to maximize the number of possible access.

Idea that you avoided in your design1

It means that the caches work with 64bits chunks.

How do u know this? stop assumptions please :)
In fact, just to make an example, according nvidia patents their
texture cache works with 256 bits chunks, either for compressed or uncompressed texture :)
Anyway, the first one is way more efficient than the second one
Well, doing what? It will depends on the game behaviour. So you cannot confirm that.

When I talk about design inefficiency I'm talking about its uneffective use of bandwith. A single 128 channel for the frame buffer and another one for textures gives you (imho, unbalanced) bandiwith, but not fine grain access to memory. With future shaders we'll know each pixel will be generated by several fragments composition. Many of these fragments will be taken from textures. Ho do u believe to satisfy a heavy textures workload with just a single channel to memory? Do u really want to open a different page so often?

Speaking about wasting money :D

The one is speaking about (more) expensive and unproved solution is you,
and not me. Gf3/4 and Radeons docet.

ciao,
Marco
 
Don't make generalizations. I haven't said there aren't point of reference, I said you cannot use that single point of reference, cause it's not reliable and it has not statistical meaning.
Now you want statistics to discuss??? Hahaha :D
Umh?! Pascal, please stop to make this kind of assumptions. I haven't said anything like that, in fact I'm consider myself quite a dreamer
First, STOP to say stop to me.
What I clearly said is that a pure speculation without any factual basis is a non-sense, and potentially a waste of time. We're discussing about future bus architectures for GPUs, we're not talking about hidden variables in quantum physics or about god
If it is dificult for you then I am sorry. Waste of time is continue this fruitless discussion with you :(
You really dont want to contribute. You only say things like "nonsense".
No, you don't. the hw could tell between the 2 cause data buses are joined
Do you mean two 16MB chips combined in an 32bits bus. Then it means two 4Mx16bx2 chips or 22bits for addressing like I finally said :)
Idea that you avoided in your design1
??????? No, the dual channel assymetric described above may go up to 833 Mega access and the address space maybe be continuos, just use some crossbar to add some flexibility.

But you looks like that only want to say "nonsense"

How do u know this? stop assumptions please
In fact, just to make an example, according nvidia patents their
texture cache works with 256 bits chunks, either for compressed or uncompressed texture
Sorry nAo, I will NOT stop.
Then why they need 64bits chunks??????????
Probably some caches use 64bits chunks.

When I talk about design inefficiency I'm talking about its uneffective use of bandwith. A single 128 channel for the frame buffer and another one for textures gives you (imho, unbalanced) bandiwith, but not fine grain access to memory. With future shaders we'll know each pixel will be generated by several fragments composition. Many of these fragments will be taken from textures. Ho do u believe to satisfy a heavy textures workload with just a single channel to memory? Do u really want to open a different page so often?
Well it is not ineficient, it is less efficient with small words from the bandwith usage point of view, but can be very efficient from the economics point of view. Also the eficiency will depend on the cache word size. Balance can be added by having a single address space with two regions (one 64MB and the other 32MB) selected by a crossbar if needed.
But again people will want to play at 1024x768x32 with some FSAA.
The one is speaking about (more) expensive and unproved solution is you,
and not me.
You can say unproved, but cant say more expensive unless you prove it.
Gf3/4 and Radeons docet.
?????????

Well nAo, I believe we are finished. There is no point to continue this discussion with you.

ciao,
Pascal

Another possibility for an assymetric memory design is 192bits as below.
64MB or 128MB 128bits using a dual or quad addressing schema and a 32MB 64bits using a single addresing schema. The different regions could share the same adddress space in two different and continous regions (first 64MB/128MB and the following 32MB or vice versa).
The usage could be determined by segment pointers depending on the specific need.

Using some 300MHz DDR it means:
14.4GB/s and 1.8Giga access per second 8)

Using some 400Mhz DDR-II:
19.2GB/s and 2.4Giga access per second :eek: 8)

edited: xbitlabs says that the 400 and 500MHz DDR-II chips will be at least 512Mb, it means 64MBytes chips to start.

The edram is a big Xmas wish but some new memory design using DDR is possible. We just should not be dominated or limited by old paradigm.
 
nAo

My sincere apologies.
You was only expressing your opinion and I did not accept it.

In fact I contradict myself when trying to add some flexibility in the assymetrical idea.
But I learned from my persistence with the assymetry idea and some kind of assymetrical UMA (I call aUMA) come to my mind last night, but I think is too early to talk about it. Maybe another thread.

As a gamer is hard to accept the limitations imposed by current technology. :(
I hope there is no hard feelings. ;)

Anyone please just a few questions:
- With this DX9 or OpenGL 2.0 programmability there is a need for some programming cache and memory/bandwith?
- What is nvidia primitive cache?

Thanks
Pascal
 
I'm not convinced that embedded DRAM technology advancement is keeping pace with "external ram" advancement. To me, someone someone needs to produce a chip with about 24 MB embedded DRAM, with an internal bandwidth of close to 20 GB/sec by the end of 2003. Of course, on the same die, they also deliving the logic (features) capable of competing with the latest and greatest at that time. I may be pessimistic, but I just don't see it happening.

Well Sony's already produced limited quantities of their 32MB GS on a 180nm process (last year) that leaves it only a few sq./mm larger than the original 4MB die. Granted it's a rather simple rasterizer, but on today's 130-150nm processes, I don't see it being too unreasonable to fab a complete GPU with an equivalent sized local buffer...
 
Joe DeFuria said:
If nothing else, it's refreshing to not be involved in this type of discussion myself for a change. ;)

Now back to your regularly scheduled debate...
Debate?!! More like an argument and the full half hour at that.
Makes having "hit over the head lessons" seem pleasant.

Livecoma said:
While we are on the 256 bit subject...
Are there any emerging technologies as far as PCB's or chip packaging that would make a much wider bus more practical cost wise?
What about four 32 bit 1200Mhz DRDRAM channels? RAMBUS seems like it may just become a decent alternitive to keep the bandwidth going up.
IIRC, the problem with RAMBUS is that although it has high bandwidth, it also has high latency. That is, it takes a long time from when you decide that you need a piece of data at 'random' location X, to when you can actually get your grubby paws on the contents. This isn't that much of an issue for a CPU system that has massive secondary (and sometimes tertiary!) caches and usually requires a big chunks of data (which overcomes the latency issues), but in a rendering system it's a nuisance.
 
Lessard said:
Can't you hide latency more easily with pipelines which have hundreds of stages ?
It all depends on whether you can do other work on the data while you are waiting. In 3D graphics, there's a limit to how much of this can go on and so you either have the HW sit around waiting for the data to arrive, or put in a truckload of FIFO stages which increases the size of your chip.
 
Back
Top