...while using 128bit memory bus?
The HD4770 has no trouble and there will be much faster GDDR5 (or 6) available by the time the next gen consoles hit.
...while using 128bit memory bus?
Yes, this is like R600->RV670 - the ALU/TU/RBE counts were unchanged and clocks got bumped by 4%, the bus got chopped in half and the GDDR3 clock was raised by 36%.
Except if the bus got chopped in half and memory clock was raised, Evergreen would have ~45mm² to fill with stuff. That's a hell of a lot of stuff, since I estimate that RV740's clusters are around 52mm².
Or, that's 45mm² of D3D11-specific additions
Or, that's 45mm² of D3D11 stuff + architectural re-jigging.
It's conceivable that the architecture needs a shake-up to handle the memory-intensive nature of D3D11.
It seems to me that D3D11 is making buffers (resources) of indeterminate count and size a more finely-grained component of rendering. Previously rendering consisted of consuming some fixed-size buffers whilst writing to other fixed-size buffers.
Now D3D11 gives the developer access to their own arbitrarily sized buffers to be used pretty much whenever they feel like it (PS or CS seem the most likely places, and arguably CS is a distinct rendering pipeline all of its own) - though it seems there is still a hard limit on the population of these buffers bound to the rendering pipeline at any one time.
HD4770 has no trouble, but is it at the performance level, which Microsoft requires for their next-gen console?The HD4770 has no trouble and there will be much faster GDDR5 (or 6) available by the time the next gen consoles hit.
The question is, if 192bit bus would be enough...ninelven said:So what costs more, enough e-dram to do 4x AA @ 1920x1080 with HDR or a 192bit memory bus?
HD4770 has no trouble, but is it at the performance level, which Microsoft requires for their next-gen console?
The question is, if 192bit bus would be enough...
So what does cost more? EDRAM for 1920x1080 / MSAA 4x, 128bit bus + cheap slow low-power memory modules, cheap routing, no Z-compression hardware, simplified ROPs...
or (at least) 256bit interface with more PCB routing, more expensive and power hungry memory modules, bigger GPU with z-compression hardware, more complex memory bus, more complex ROPs...?
There is one more advantage - you can shrink the GPU using new manufacturing process without reducing memory bus width. Xenos was designed for 90nm production, now is produced on 65nm and 45nm version is on the way. No need to change memory bus or use faster (more expensive) memory parts for compensation.
I'm not convinced that we will see this 180mm2 GPU in future Xbox console...
That's what I'm talking about. The entire idea of RV840(?) GPU as multi-platform product (PC/XBOX) seems to be completely unrealistic to me.Desktop GPU != Console GPU.
That's what I'm talking about. The entire idea of RV840(?) GPU as multi-platform product (PC/XBOX) seems to be completely unrealistic to me.
I don't know where you got that idea; I wasn't talking about this particular chip, rather the architecture in general (since it is quite efficient per mm^2). Obviously, it would be a custom design.The entire idea of RV840(?) GPU as multi-platform product (PC/XBOX) seems to be completely unrealistic to me.
This document:Was going through Bill Bilodeau's Presentation again, and realized I had missed some important info from slide 20 saying that the "3 times faster and 1/100th the size" is comparing ATI DX9 (actually DX10, with vertex texture fetch) tessellation (requiring at least one pass to export per edge tessellation factors, and a second to tessellate) to running the high poly mesh through the standard DX9 pipeline, also that "DX11 Tessellator algorithms can usually be done in one pass".
Not sure what that "usually" is referring to other than perhaps with DX11 sometimes they really do spill to main memory if they cannot fit the data flow between the pipeline on-chip.
Guess the other point here is that with DX10 tessellation, the spill to main memory was the edge factors, all the amplified (ie expensive) stuff was always on-chip (GS excluded). Point being that it doesn't look too hard to still have a majority of the data on-chip.
I dare say that the underlying model here is that a UAV is simply a fixed-size render target where arbitrary pixel locations can be written by any number of fragments, in any order. This is how scatter works in GPUs already, as far as I can tell.It seems very enlightening to realize that DX11 PS stage has a RTV+UAV limit of 8, and CS also has a UAV limit of 8. And that even though UAV (unordered access views) can access a texture, this is still "unordered". Seems rather natural to assume that at least on one DX11 arch, that render targets and UAVs are the same interface!?
Yes I neatly dodged around the question of read/write to a single buffer in my earlier posting, since I didn't want to touch on the reading of a buffer "shortly" after writing it. I suspect there'll be limited use of this under D3D11 for append/consume buffers - they seem to be explicitly multi-pass centric, geared for ease of scheduling. This is the basic problem with the single extant draw state model of the current GPU pipeline as far as I can tell.BTW, SRV (shader resource views) are read only.
As for Append/Consume there are 2 options, Raw or Structured, and I believe now that buffers are mapped as either Append or Consume but NOT both in one shader! So wouldn't be surprised at all if Append/Consume is direct to global memory!!!! Also I haven't seen limits posted yet, but I'd bet that there is a strict limit as to how many of these one gets in a shader.
Me too. The lack of pre-emption is, to me, a big mistake. It means everything's forced down a single logical pipeline and it means that unruly applications can kill the system with a meaty CS. The latest instance of this appears to be the people playing with ATI's PCI Express bandwidth tester, reporting that VPU recover is kicking-in or the system is rebooting - all because they're copying huge amounts of dataThe DX11 hardware transition is looking much more mild to me now.
I'm curious that nobody commented on the MVI_2932.AVI file that was linked earlier. (or is that on the G300 thread?)
The speaker clearly states "...you need those 800 shader processors enabled by the compute shader capabilities of DX11"
Could just be talking about the SM4.1 variant of compute shaders & running the demo on RV770 I guess?
Is it the standard Froblins demo or is it supposed to be a DX11ised version?
HD4770 has no trouble, but is it at the performance level, which Microsoft requires for their next-gen console?
OK, this is pretty weird, the article neliz linked earlier no longer has the picture (which can still be seen in neliz's post) and the article actually says "We can't provide you a close up right now but judging from the looks of it, the card is about 8.5" long, dual slot and requires a 6-pin power."http://vr-zone.com/articles/amd-dx11-r800-card-exposed/7154.html?doc=7154
Cooling setup imho is reminiscent of RV740.
I dare say that the underlying model here is that a UAV is simply a fixed-size render target where arbitrary pixel locations can be written by any number of fragments, in any order. This is how scatter works in GPUs already, as far as I can tell.
The limit of 8 implies that the hardware that controls addressing has limited capability - after all the RBEs/ROPs are having to cooperatively work with the same set of 8 base addresses, so this implies a very simple/small/fast mechanism for generating addresses within each unit, according to the named buffer.
...
If you want to read and write concurrently then you're stuck with UAVs and CS it seems. Although now I think about it, I was under the impression that PS could read and write the render target(s). Hmm.
So if the developer wants to have multiple independent scatter buffers they have to do their own addressing math with offset base addresses.
768MB Ram then?Meanwhile, this is what Device Manager says:
http://resources.vr-zone.com//uploads/7154/System.JPG
Jawed
OK, this is pretty weird, the article neliz linked earlier no longer has the picture (which can still be seen in neliz's post) and the article actually says "We can't provide you a close up right now but judging from the looks of it, the card is about 8.5" long, dual slot and requires a 6-pin power."
And it seems to have been replaced by:
http://vr-zone.com/articles/amd-dx11-rv8xx-card-exposed/7154.html?doc=7154
notice the slightly different URL?
Meanwhile, this is what Device Manager says:
http://resources.vr-zone.com//uploads/7154/System.JPG
Jawed
768MB Ram then?
I think only RTs can support R/W and this "works" because the fragment can only access its pixel. Don't know where to look to find this out for sure, though.UAV read/write and RT read/write in PS will be interesting on DX11. Unlike CS where blocks can be made to directly correspond to vector memory accesses, this isn't likely to happen is PS stages other than when a triangle fully covers a SIMD sized coarse tile. And outside Larrabee, that means collecting pixel quads differently from coarse fully filled SIMD sized tiles and non-fully filled SIMD sized tiles (collection of various 2x2 pixel quads). Small tessellated triangles surely doesn't help here.
On Larrabee (and perhaps ATI and NVidia depending on actual hardware design), there could be a bunch of cases (I'm thinking some very strange deferred shading and particle stuff) were it is actually better to roll your own "rasterizer" in a CS pass to insure good pixel grouping for vector cacheline aligned and coalesced accesses and to avoid the overhead of the triangle raster for non-triangle stuff (like lights).
If PS UAV/RT read/write is going to be a fast path on ATI (and NVidia) we are bound to be looking at something quite a bit different than say how CUDA accesses global memory (because that isn't going to be fast for PS) and rather something like a read/write version of an ROP.
Hmm, well, I guess this is like being told to write a program with only one malloc(). Then again multiple buffers will start to fragment memory if they have varying lifetimes. Virtualised memory (everything paged) is a partial solution as contiguousness no longer becomes a prerequisite of being able to create the buffer. But I'm well out of my depth on this whole subject, gladly so.Yeah, but that (ie own addressing math) is the way it should be, because anything else isn't going to scale. A small number of UAVs isn't a limitation in my eyes.
768MB Ram then?
no, addressing doesn't work like this. there's usually a 256MB window of addressing space per GPU, regardless of video memory size. Other addressing space is allocated to other peripherals (the likes of system clock, PCI devices, DMA stuff..); a bunch is reserved anyway by the OS, that's why 3.25GB memory is the most common number under windows.