PDA

View Full Version : EDRAM in GPUS


Briareus
16-May-2005, 01:15
Can someone explain the advantages and disadvantages of embedded DRAM? Why is it used in several consoles but not in any of the PC based GPUs?

Laa-Yosh
16-May-2005, 01:25
Up until now, there weren't proper methods to handle a frame buffer that couldn't fit into the EDRAM you could reasonably put into a GPU.
Since current gen consoles only had to support resolutions up to 640*480@32bit, their framebuffer was small enough to even spare some EDRAM for texture memory.

Bitboys has been working on several iterations of EDRAM-based GPUs but their attempts never reached the market. As far as I now the main reason was that they couldn't manufacture a chip with enough EDRAM. So it was a matter of timing - how soon will we have a manufacturing process that can give us enough on-die memory for the currently used screen resolutions?

Now that ATI has developed methods to support larger resolutions, it's quite possible that they'll leverage this technology into their highend videocards as well.

Chalnoth
16-May-2005, 02:47
The other reasons are:
1. You have to sacrifice processing power to add in the RAM (which makes this only viable if you would have pretty bad efficiency otherwise...such as if you had decided to make a tradeoff of eDRAM vs. expensive external memory).
2. eDRAM makes it a bit more challenging to clock the parts as high, so you lose some fillrate again for implementing it.

So, in the end, high-end PC parts are about balls-to-the-walls performance, whereas economics is a much larger concern for a console. It makes sense to just make a more complex chip for the console, because economies of scale and possible future die shrinks will help to combat the added cost of the eDRAM, but more powerful memory results in more complex PCB's and more expensive memory chips (which already have high economies of scale), and thus it often makes more sense to use eDRAM on the console.

Blazkowicz
16-May-2005, 04:06
on the low end PC cards you now have Turbocache and the like, this very loosely looks like eDram to me :D

bloodbob
16-May-2005, 04:25
on the low end PC cards you now have Turbocache and the like, this very loosely looks like eDram to me :D
Ehh seems like quite the oposite to me.

ddes
16-May-2005, 06:13
Up until now, there weren't proper methods to handle a frame buffer that couldn't fit into the EDRAM you could reasonably put into a GPU.
Since current gen consoles only had to support resolutions up to 640*480@32bit, their framebuffer was small enough to even spare some EDRAM for texture memory.

Note that the X360 maximum resolution is 1280x720. At that resolution you can fit a 32-bit front and back buffers and a 24-bit Z-buffer into the EDRAM.

Bitboys has been working on several iterations of EDRAM-based GPUs but their attempts never reached the market. As far as I now the main reason was that they couldn't manufacture a chip with enough EDRAM. So it was a matter of timing - how soon will we have a manufacturing process that can give us enough on-die memory for the currently used screen resolutions?

I do remember that there was a silicon prototype with 12 MB of memory on-chip?

jpr27
16-May-2005, 07:55
Chalnoth,

Wouldnt EDRam be faster then external Ram? I thought fetching information (in this case textures) would be much faster on die then having to fetch from an external memory? Could you clarify this for me?

Thanks

Laa-Yosh
16-May-2005, 08:13
Note that the X360 maximum resolution is 1280x720. At that resolution you can fit a 32-bit front and back buffers and a 24-bit Z-buffer into the EDRAM.


That's right, but... have I said anything that contradicts this? Is the X360 a current-gen console? ;)

Jawed
16-May-2005, 08:54
[Might as well post what I wrote in another thread here, too]

It's looking pretty certain now that ATI's GPU + EDRAM architecture for R500 consists of 2 chips.

I dare say the key thing was to create an architecture for a GPU which enables this split, in order to free the GPU from otherwise being restricted, as Chalnoth says, by on-die EDRAM.

ATI's patents on this architecture go back to 1998, with the current form determined in 2000. I dare say it's been a matter of waiting until such an architecture can meet the constraints of the PC gaming business, with the support for legacy games and coding techniques creating a sizable overhead in GPU resources.

We don't know what the die area for 10MB of high performance EDRAM is - we only know that low-power EDRAM would consume about 225mm squared at 90nm. (Bad memory alert: breakfast still settling, maybe it's 150mm squared, anyway whatever it is, it's a big package).

Looking forwards to a PC GPU with EDRAM, it would prolly need 32MB of RAM to cater for the high end PC resolutions. Either that or it would be forced to settle for rendering a frame in portions (I won't call them tiles, because we're talking about a half or quarter of the frame).

Jawed

DemoCoder
16-May-2005, 09:23
What are you smoking Jawed? XB360 eDRAM is on a separate chip? That's a contradiction in terms. Besides, to get 256gb/s bandwidth from an external chip, you'd need an insanely wide bus and clock rate. 512-bit bus and 4Ghz RAM, or 1024-bit bus and 2Ghz ram, or 2048-bit bus and 1Ghz ram. Try to imagine the chip packaging and PCB layout for a 1024-bit bus.

Either it's onchip, or it's not embedded. Unless you think they are using that Proximity Bus technique that Sun is pursuing, or some kind of optical link. It's just beyond reason.

Laa-Yosh
16-May-2005, 09:38
Or the 256GB/sec is some multiplied value of the real bandwith, because of compression/optimiziation?

BTW I think it's a seperate piece of silicon, but in the same package, much like the first PPros and their cache.

Inane_Dork
16-May-2005, 09:53
Note that the X360 maximum resolution is 1280x720. At that resolution you can fit a 32-bit front and back buffers and a 24-bit Z-buffer into the EDRAM.The X360 is not limited to that, and why would you possibly want to store the front buffer in cache anyway?

Chalnoth
16-May-2005, 10:08
Chalnoth,

Wouldnt EDRam be faster then external Ram? I thought fetching information (in this case textures) would be much faster on die then having to fetch from an external memory? Could you clarify this for me?

Thanks
Well, yeah, but the problem is: what if external memory isn't your primary limiter of performance? Improving memory performance in this scenario won't help you much. But adding in eDRAM will end up reducing your fillrate (die size kept the same), and thus isn't going to be a good solution much of the time.

Nappe1
16-May-2005, 10:36
Up until now, there weren't proper methods to handle a frame buffer that couldn't fit into the EDRAM you could reasonably put into a GPU.
Since current gen consoles only had to support resolutions up to 640*480@32bit, their framebuffer was small enough to even spare some EDRAM for texture memory.

Note that the X360 maximum resolution is 1280x720. At that resolution you can fit a 32-bit front and back buffers and a 24-bit Z-buffer into the EDRAM.

Bitboys has been working on several iterations of EDRAM-based GPUs but their attempts never reached the market. As far as I now the main reason was that they couldn't manufacture a chip with enough EDRAM. So it was a matter of timing - how soon will we have a manufacturing process that can give us enough on-die memory for the currently used screen resolutions?

I do remember that there was a silicon prototype with 12 MB of memory on-chip?
two different revisions in fact...

http://www.kyamk.fi/~ohj8laka/pics2/panoramas/picture2.jpg
http://www.kyamk.fi/~ohj8laka/pics2/panoramas/picture1.jpg

;)

EDIT:
okay, so deal with Bitboys eDRAM system was not to have whole back buffer at once in eDRAM. the scene was split in the tiles and only tile being rendered was needed to fit in eDRAM. On case of Matrix Anti-Aliasing being enabled, the AA was applied during eDRAM -> back buffer transfer. (The guy who was working on the rasterizer implementation of this chip is one of the regulars here, but so far he have decided not to show this side of his talents here. so, I am not going to tell you who he is. it's up to him, if he decides so.)

Images above show two different revisions of chip codenamed AXE. It has DX8 feature set VS 1.0 and PS 1.1 with 4 pipelines and 2 tmus per pipe. planned clocks were 175MHz Core / 175 MHz memory. If everything would have gone like planned, AXE would have been released As Avalanche 3D in christmas 2001. The chip is capable working as dual mode as well, so Avalanche Dual would have had around 46GB/s memory bandwidth and 8 dx8 pipelines.

after this and before moving hand held / pda side, Boys had another project called Hammer, which had some interesting things coming. it had eDRAM too, but only 4 MB and it incorporated their own occ. culll. tehcnology. all technology ment to be in hammer was licenseable after the project died and someone was interested enough at least their occ. culling technology, because all material relating to it was removed from their website soon after adding it to there. Only thing I heard was that it was removed because customer wanted it to vanish. at least so far I have no idea who the customer was.

so, need of eDRAM to fit whole frame buffer on it? no, I don't think so. all new cards already works now on pixel quads because of several reasons and ATI has even higher level Super Tiling that is used for big off line multicore rendering sollutions. As long as we can divide screen space to smaller parts, what's the reason for keeping frame buffer as one big rendering space? everytime the render finishes the tile, it's takes a small time before next one starts and there's no traffic in external memory bus during that time, so you could basically use that time for moving finished tile from eDRAM to frame buffer.

Jawed
16-May-2005, 10:55
I posted this message on Saturday:

http://www.beyond3d.com/forum/viewtopic.php?p=519513#519513

In an interview, Rick Bergman, senior vice-president and general manager of ATI's PC Group, said the XBox 360 will contain an ATI- designed graphics processing unit, the 360 GPU, as well as a companion-memory chip.

As well as that, if you read the relevant patents you will see that the Raster Output architecture that ATI has put together is designed around EDRAM for the back frame buffer's pixels only (i.e. excluding AA samples). AA sample data is not kept in EDRAM, because it is too voluminous.

It all adds-up to a GPU architecture in which EDRAM shares a die with a blend/filter/query unit, pipelined in a loop with the GPU so that the overall ROP is unaffected by the latency of fetching AA sample data from (slow) local memory (system memory in XBox 360).

The GPU shares some of the ROP workload, generating/blending AA samples as instructed by the EDRAM blend/filter unit and fetching/writing AA samples to local memory.

The GPU and the EDRAM unit work on different fragments/AA sample data, with the GPU both feeding the EDRAM's pipeline and accepting the results back from that pipeline that need further work.

http://www.cupidity.f9.co.uk/b3d16.gif

Jawed

Guden Oden
16-May-2005, 11:00
Or the 256GB/sec is some multiplied value of the real bandwith, because of compression/optimiziation?
Yeah, it IS a multiplied value, just like Nvidia's claim of 64GB/sec bandwidth for NV30 due to 4x framebuffer compression, way back when.

At first you'd think MS would have learned not to lie like that, but then you have to remember they're aiming for Joe Consumer with this number and those marketroids don't pull any punches when it comes to making the goods they're pushing look as good as possible. Lies or not.

loekf2
16-May-2005, 21:39
[Might as well post what I wrote in another thread here, too]

We don't know what the die area for 10MB of high performance EDRAM is - we only know that low-power EDRAM would consume about 225mm squared at 90nm. (Bad memory alert: breakfast still settling, maybe it's 150mm squared, anyway whatever it is, it's a big package).



See my other post. If NEC puts the right info on their website it shouldn't be more than 0.22 um2 / cell = 1 bit, which gives you less than 20 mm2 for 10 MB.

This could sound small, but remember it's basically one trannie / bit, so much smaller than SRAM.

ATI could have done two things for the R500: put the DRAM on die or put it into the package. Even inside the package has speed advantanges (next to power consumption). From the NEC press release we've seen Microsoft opted for on-chip DRAM and the GPU is hence manufactured at NEC.

loekf2
16-May-2005, 21:42
I posted this message on Saturday:

http://www.beyond3d.com/forum/viewtopic.php?p=519513#519513

In an interview, Rick Bergman, senior vice-president and general manager of ATI's PC Group, said the XBox 360 will contain an ATI- designed graphics processing unit, the 360 GPU, as well as a companion-memory chip.



Hmm... companion chip ? Why is it then called eDRAM in the first place ?

Back to plan B.... two dies into one package ? I don't think 10 MB is that large to put it onto a seperate die.

Dave Baumann
16-May-2005, 21:48
DC, its not that widely known yet, but yes, the eDRAM is a separate chip - there is the shader core (produced by TSMC) and then sitting alonside it, but on the same package, is the eDRAM chip produced by NEC. "eDRAM" is probably not the right term, although the ROP's are in here, not the shader core, they are probably dwarfed by the silicon of the memory itself. This is another explaination why the memory bus width is 128-bit rather than 256-bit - it has to deal with the pads for connections to both the eDRAM, host, and main memory.

nAo
16-May-2005, 21:56
This explain also why edram bandwith is so 'low', PS2 GS edram has the same bandwith (48 GB/s) as R500, but in year 1999 on a 0.25 um at 150 Mhz.

Jawed
16-May-2005, 22:12
Hmm... companion chip ? Why is it then called eDRAM in the first place ?

http://www.cupidity.f9.co.uk/b3d16.gif

As you can see, Custom Memory 40 has some extra stuff going on.

You can read the patent:

http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6873323

Jawed

Chalnoth
17-May-2005, 01:43
Those boxes aren't separate chips.

Dave Baumann
17-May-2005, 02:12
Well, they are in the case of Xenon.

nutball
17-May-2005, 07:40
If they're separate chips, could they be on some sort of MCM package then, or would that not give any performance benefits over a more usual two-black-things-on-the-motherboard approach?

Nappe1
17-May-2005, 10:25
Hmm... companion chip ? Why is it then called eDRAM in the first place ?


someone said it's called enchanned DRAM. So, another excelent example of misuse of terms.

and what comes to claimed 256GBit/s transfer rate to eDRAM, they would need a 5120 bits wide bus to achieve it with 500MHz clock rate. If it's DDR bus then it's half of that, but still way too much to be put in between two cores. (although it goes inside single package, but still just too much.)

tEd
17-May-2005, 12:45
Hmm... companion chip ? Why is it then called eDRAM in the first place ?


someone said it's called enchanned DRAM. So, another excelent example of misuse of terms.

and what comes to claimed 256GBit/s transfer rate to eDRAM, they would need a 5120 bits wide bus to achieve it with 500MHz clock rate. If it's DDR bus then it's half of that, but still way too much to be put in between two cores. (although it goes inside single package, but still just too much.)

it's probbaly called eDRAM as it includes addtional logic to do z-compare and blending

_xxx_
17-May-2005, 15:14
And me thoughts it's says extreme DRAM! :wink:

DegustatoR
17-May-2005, 16:06
Hmm. And what about bus width b/w shader core and memory chip then? And what frequency is memory chip running?

Jawed
17-May-2005, 16:34
The leak for XB360 says 4gigapixels per second and the presumption is that's 8 bytes per pixel, i.e. 32GB/s.

Who knows, eh :?:

Jawed

DemoCoder
17-May-2005, 17:04
Is it called eDRAM because it's eDRAM that is "embedded" on a helper chip. But I still think it's a little disingenuous labeling, whether intentional or not, because everyone's first thought is that the eDRAM is on-chip on the R500 just like VRAM on PS2 and that it would have an amazingly large bus-size.

Moreover, using "effective" bandwidth numbers in the specs further lead to confusion, since the only way one could get such high numbers (if you didn't read the fine print "effective") would be from some amazing large 2048bit bus to eDRAM inside the core.

DegustatoR
17-May-2005, 17:23
Moreover, using "effective" bandwidth numbers in the specs further lead to confusion, since the only way one could get such high numbers (if you didn't read the fine print "effective") would be from some amazing large 2048bit bus to eDRAM inside the core.
What about real bandwidth between R500 and memory chip then? Got any numbers?

DemoCoder
17-May-2005, 17:49
You know as much as I. I've heard numbers from people anywhere from 32gb/s to 64gb/s.

Xmas
17-May-2005, 18:55
The way MS is calculating is simple: two quads with 32bit color read and write (blending) and Z-stencil test and write with 4xMSAA at 500MHz.
2 * 4 * (4 + 4 [read] + 4 + 4 [write]) * 4 * 500M = 256GB/s

This however is a bit misleading on several levels. First, do they mean bandwidth between the two chips, or bandwidth from ROPs to memory like in "traditional" architectures? Presumably the former, but that isn't comparable to external memory bandwidth figures.

Assuming they mean bandwidth between the two chips...
color and Z data is never read across that connection, because only the ROPs need this data. For the same reason, stencil data is neither read nor written, because a fragment has no associated "stencil data", that only exists in the framebuffer.
Furthermore, color data is identical for all samples in a pixel and Z data can be encoded as gradient per quad. The only thing that's needed additionally for AA is a coverage mask.

So disabling blending, Z-test, Z-writes, stencil test, stencil writes, or AA practically "saves" no bandwidth - the connection can be considered a multitude of dedicated channels.

Overall, what is required per quad are 4* 32bit color, compressed Z (3 * 24bit at most) and 4* 4bit of coverage mask, meaning less or equal to 216bit per quad. And since there may be two quads per clock with color, that's at most 432bit required for the connection between the two chips, which equals 27GB/s.

Assuming they mean bandwidth from ROPs to the memory array...
then 64bit (32bit color + 31bit Z/stencil + 1bit flag) need to be read and written per pixel. That means 512bit for reads and writes each are required, equalling 32GB/s for reads and 32GB/s for writes.

(That's the best case with no additional AA samples involved. If that happens, additional bandwidth to sample memory is required)

If you want to compare "effective bandwidth" figures, take a X850XTPE that has 6:1 color compression and 24:1 Z compression when 6xAA is enabled, which means 9.6:1 compression rate overall for the framebuffer. That's 362.5GB/s "effective" if you could use up all bandwidth for the framebuffer. And 256GB/s if 70% is used for framebuffer access.


P.S. don't take that last paragraph too seriously. ;)
R480 can't even output that many compressible quads with 6xAA...

Jawed
17-May-2005, 19:43
Xmas - in this diagram:

http://www.cupidity.f9.co.uk/b3d16.gif

I believe that the 256GB/s "effective bandwidth" is what happens between the Data Path 48 and the Memory Array 46.

It is these two components that are involved when a pixel is blended, or a z-test is performed or when AA samples are filtered.

The bus represented by 30/32 appears to be capable of an effective 32GB/s (2 quads read or write per clock). Whether it's bi-directional (actual 64GB/s) is unclear. But this data is in compressed form, so the actual bandwidth here is unknown...

My understanding of the patent is that AA samples are re-calculated by the Sample Memory Controller, 24. So when a new fragment representing a triangle edge is overlayed on a pixel that represents an existing triangle edge (e.g. two triangles that share an edge), it is the Sample Memory Controller that evaluates the existing pixel (fetched by the Data Path 48) and determines how to recalculate the AA sample set, updating it in Sample Memory 25.

So, not all of the AA workload is carried out within the Data Path 48.

Jawed

Xmas
17-May-2005, 20:22
That's the way I understand that patent, too.

Jawed
17-May-2005, 20:27
Jaws, on another thread, has just pointed out that the read bandwidth 32 is only half the write bandwidth 30. :oops:

Jawed

Per B
17-May-2005, 20:52
This eDRAM sounds somewhat similar to Mitsubishi's 3D-RAM which was used by E&S and Sun ages ago. That memory had some logic on-board to handle blending and Z-compare.

Per, Sweden

RoOoBo
18-May-2005, 08:37
I wonder how (if?) they implement early Z test with this kind of configuration. If Z and stencil test is performed inside the enhanced memory chip early Z would require a feedback bus to send back the per quad masks with the result of the tests. Hierarchical Z also requires feedback from the Z test. If I'm correct in my assumptions and readings of ATI patents and the old HOT3D presentation the per block representative Z value for the HZ is calculated when Z cache lines are evicted and compressed.

Not implementing early Z and/or Hierarchical Z would be a very bad idea on terms of utilization of the shading power. Every fragment that can be removed before shading counts a lot.

Special video memory (VRAM) with blending or other acceleration features is a very old history, however I didn't even bothered much to study it as the trend lately has been to use common DRAM. The most extreme implementation of this approach could be something like the Pixel Plane architecture where each pixel had the whole hardware to perform all the tasks from rasterization (but only after setup, geometry is also performed in the CPU or another processor) to blend and color write.

mboeller
18-May-2005, 14:21
This eDRAM sounds somewhat similar to Mitsubishi's 3D-RAM which was used by E&S and Sun ages ago. That memory had some logic on-board to handle blending and Z-compare.

Per, Sweden


That's how i understand it too.

So in the end the Xenon-GPU is just an upgraded PS2-GS with 10MB + a shader-core attached on the side. :D

Well not really, but IMHO it combines the advantages of the eDRAM-enhanced blending-monster GS with an sea of ALU's for very high performance shading (textures, pixel-shaders, vertex-shading ).

Xmas
18-May-2005, 15:24
I wonder how (if?) they implement early Z test with this kind of configuration. If Z and stencil test is performed inside the enhanced memory chip early Z would require a feedback bus to send back the per quad masks with the result of the tests. Hierarchical Z also requires feedback from the Z test. If I'm correct in my assumptions and readings of ATI patents and the old HOT3D presentation the per block representative Z value for the HZ is calculated when Z cache lines are evicted and compressed.
HierZ doesn't need feedback from the Z test. You simply store the farthest Z value per tile. If the incoming tile is beyond that, it is discarded. If it is half-filled and in front, it is accepted. If it is filled and in front, its farthest Z value is written to the tile.
If any other tests are enabled (alpha, stencil, kill), writes to the tile are disabled. If the Z comparison changes, hierZ is disabled.

Basic
18-May-2005, 17:19
If you render a high-poly mesh, all tiles with internal poly borders will have a Z-far equal to the background. That's because when you render the first tri, part of the tile will be far away. And when you render a tri next to it (filling the tile with close Z-values), hier-Z has "forgotten" the exact Z-values in the first part. So you can't update the low res Z.

So further rendering behind the high poly mesh won't benefit as much as it could from hier-Z.

With feedback from the high res Z, that could be solved.
But is there anything that says that the custom memory chip can't calculate max of a tile, so it can do the feedback?

Xmas
18-May-2005, 18:17
Sure that could help, though it adds complexity. But it isn't required for hierZ to work.