Wii U hardware discussion and investigation *rename

Laa-Yosh · Jan 7, 2013

Hornet said:
Please, correct me if I am wrong. With deferred shading don't you only run pixel shaders once per pixel*? How many texture reads do the average game perform for each pixel?

I'm not an expert but Sebbi for example could help you to calculate actual texture data traffic.

The basic logic should be, why does the Xbox need almost twice the bandwidth on top of the EDRAM, and the PS3 almost four times the total bandwidth?

Hornet · Jan 7, 2013

Grall said:
You might want some memory B/W for CPU, audio, I/O, video scanout (which at 1080P isn't peanuts when you only have 12ish GB/s to work with) etc...

Not sure about the bandwidth that might be required to feed the CPU and audio. Hopefully the larger caches on the CPU reduce the amount of main memory traffic compared to the Xbox 360. Video scanout at 1080p / 60 Hz should take ~0.5 GB/s Considering the limited speed of the mass storage device and of the disc drive I expect the bandwidth required for I/O to be quite negligible.

According to the calculation above (assuming I'm not missing something obvious), even with only half the bandwidth available for texturing (6.4 GB/s), a 60 fps game at 720p can do ~31 texture reads per pixel shaded (in a deferred renderer without accounting for transparency). It does not sound that terrible... Not that good either.

Laa-Yosh said:
I'm not an expert but Sebbi for example could help you to calculate actual texture data traffic.

This would be very interesting.

Laa-Yosh said:
The basic logic should be, why does the Xbox need almost twice the bandwidth on top of the EDRAM, and the PS3 almost four times the total bandwidth?

Well, it is quite hard to compare to the PS3, which doesn't have EDRAM, since we don't know how much of the bandwidth is used for framebuffer read/writes. It's also hard to say whether Xbox 360 games are frequently limited by main memory bandwidth.

I am not trying to defend the Wii U design though. I wonder how much it will take for tablet to reach its performance. Probably a couple of years on the GPU side and on the CPU side they might already be there.

function · Jan 7, 2013

Hornet said:
Not sure about the bandwidth that might be required to feed the CPU and audio. Hopefully the larger caches on the CPU reduce the amount of main memory traffic compared to the Xbox 360. Video scanout at 1080p / 60 Hz should take ~0.5 GB/s Considering the limited speed of the mass storage device and of the disc drive I expect the bandwidth required for I/O to be quite negligible.

According to the calculation above (assuming I'm not missing something obvious), even with only half the bandwidth available for texturing (6.4 GB/s), a 60 fps game at 720p can do ~31 texture reads per pixel shaded (in a deferred renderer without accounting for transparency). It does not sound that terrible... Not that good either.

Have you accounted for geometry BW?

That would be about 32 Bytes/vert uncompressed, or about ... I don't know what that would be with all these new fancy-pants compression schemes. XNA has a 24 Bytes/vert scheme though so I guess that would be a starting point.

Anyway, factor in reading a couple of million vertices a frame and that's going to make a fair dent in your main memory BW.

Hornet · Jan 7, 2013

function said:
Have you accounted for geometry BW?

That would be about 32 Bytes/vert uncompressed, or about ... I don't know what that would be with all these new fancy-pants compression schemes. XNA has a 24 Bytes/vert scheme though so I guess that would be a starting point.

Anyway, factor in reading a couple of million vertices a frame and that's going to make a fair dent in your main memory BW.

I didn't realize games were already pushing so many vertices per frame... Isn't it extremely wasteful to rasterize with 2 million vertices when rendering less than 1 million pixels (at 720p)? Even fragment shaders run inefficiently when triangles are too small, don't they? Maybe 2M is the total number of vertices in a scene, but I hope LOD and culling cut them down that by a significant margin.

By the way, 2M vertices would be 2.8 GB/s worth of bandwidth, considering 24 bytes/vertex, at 60 fps.

SoreSpoon · Jan 7, 2013

I'm still not 100% convinced that everything is as black and white as it seems. The assumption that the GPU must be very close to Xenos is based entirely on the logic that developers already have full understanding of how to take advantage of the GPU due to its R700 architecture. However, using that logic, we'd have to assume that the GPU is basically off-the-shelf. I really don't know. I think the GPU is definitely weak, but I'm really not the type of person who likes to ignore variables for simplification purposes.

function · Jan 7, 2013

Hornet said:
I didn't realize games were already pushing so many vertices per frame... Isn't it extremely wasteful to rasterize with 2 million vertices when rendering less than 1 million pixels (at 720p)? Even fragment shaders run inefficiently when triangles are too small, don't they? Maybe 2M is the total number of vertices in a scene, but I hope LOD and culling cut them down that by a significant margin.

By the way, 2M vertices would be 2.8 GB/s worth of bandwidth, considering 24 bytes/vertex, at 60 fps.

The numbers I recall seeing for a couple of AAA games were around 1 ~ 3 million polys per frame at 30 fps, and I think that's submitted to the GPU. Instancing would presumably save some geometry BW, and culling would mean that not all are actually rasterised.

That's about all I know, hopefully a dev can chip in on the matter. Suffice to say, I don't think geometry requirements on main memory BW will be insignificant.

Regarding the WiiU memory bandwidth, a look at some (edit: Ivy)bridge memory benches shows read and write bandwidth being about 75% of theoretical max. How meaningful those benchmarks are I don't really know, but assuming they do represent the BW typically available, and assuming that the Wii U memory controller is as efficient (unlikely?) then the Wii U could in practice have rather less than 12.8 GB/s actually available.

(IVB memory benches: http://hexus.net/tech/reviews/cpu/38421-intel-core-i5-3570k-22nm-ivy-bridge/?page=4)

Grall · Jan 7, 2013

Hornet said:
Not sure about the bandwidth that might be required to feed the CPU and audio.

CPU would vary. Instruction caching is quite effective in modern CPU designs (cue joke about the ancient 1990s-era nature of wuucpu...

), but data consumption can get from zero to the sky's the limit, depending on workload. Audio won't be much, granted, but would add a lot of scattered accesses causing additional DRAM page miss penalties.

Video scanout at 1080p / 60 Hz should take ~0.5 GB/s

Don't forget the framebuffer resolve to main memory... If running at 60fps (yeah, right!

) we've now blown roughly a gig per second of precious, precious wuu RAM bandwidth just to put an image up on the TV.

That's assuming you can't keep front buffer in eDRAM the whole time of course - and that you have room in eDRAM to keep it there - in which case scanout would be deducted from eDRAM bandwidth budget instead of main RAM...or that wuugpu doesn't use some kind of YUV compression trick for the front buffer like gamecube/wii did. It's hard to estimate these kind of things when you don't have any friggin' clue how the hardware actually works, or what its specific capabilities are! Nintendo secrecy is damn frustrating, wouldn't you say?

Considering the limited speed of the mass storage device and of the disc drive I expect the bandwidth required for I/O to be quite negligible.

Yeah, except for the tablet of course... 800*640 or whatever the hell weirdo rez it uses, times however many FPS it updates at, times bits per pixel (probably 24, packed to save space/BW...unless the software has use for Z and/or destination alpha in which case it would be more but I dunno if that'd ever be the case; most games will probably avoid rendering heavy stuff for the tablet as much as possible.)

According to the calculation above (assuming I'm not missing something obvious), even with only half the bandwidth available for texturing (6.4 GB/s), a 60 fps game at 720p can do ~31 texture reads per pixel shaded

Well we haven't accounted for memory efficiency, which for DRAM in PCs is quite low. Even straight linear access benchmarks typically only cracks about 50% of theoretical performance, if that much, and in a unified memory architecture you'll have tons of accesses going on all the time for every device in the system. Your 50% estimate up there probably isn't even worth the paper it's written on.

Hard to say how well just a GPU and nothing else would utilize its framebuffer; I ran an OpenCL memory test program on my old-ish Radeon 6970s and efficiency was LOW as hell. But grud knows how effective OpenCL really is running on my boards or how well optimized that program was so it might not mean much.

Laa-Yosh · Jan 7, 2013

Hornet said:
I didn't realize games were already pushing so many vertices per frame... Isn't it extremely wasteful to rasterize with 2 million vertices when rendering less than 1 million pixels (at 720p)?

Geoemtry is rendered multiple times - each shadow casting light requires its own render, or for example large cascaded shadow maps may require submitting each object twice - once for the buffer, once for the final render.
Hidden surface removal and culling techniques can of course be used every time.

Also, some games utilize vertex colors for masking texture repetition, which means additional data for the environment geometry. On the other hand, those vertices are usually not skinned (not sure how animated geometry is submitted to the GPU though, shouldn't that do the skinning?)

Exophase · Jan 7, 2013

Grall said:
Audio won't be much, granted, but would add a lot of scattered accesses causing additional DRAM page miss penalties.

Scattered accesses why? Data access for audio code tend to be very linear..

Grall said:
Well we haven't accounted for memory efficiency, which for DRAM in PCs is quite low. Even straight linear access benchmarks typically only cracks about 50% of theoretical performance, if that much, and in a unified memory architecture you'll have tons of accesses going on all the time for every device in the system. Your 50% estimate up there probably isn't even worth the paper it's written on.

These days memory benchmarks on PCs can do a lot better than 50%. Lower end platforms are another story. I don't think the DRAM technology has much to do with this.

I see function linked one showing 75% for AIDA64 - Sandra gets around 80% (http://www.tomshardware.com/reviews/ivy-bridge-benchmark-core-i7-3770k,3181-10.html)

This of course doesn't necessarily mean that the left over cycles are wasted on the memory bus and can't be used by other devices.

Grall · Jan 7, 2013

Exophase said:
Scattered accesses why? Data access for audio code tend to be very linear..

I'd expect patterns to be regular over time perhaps (on short timescales, naturally), but why would you expect all PCM data for individual sound effects playing at any one time to be packed together nice and tight? They're going to be crossing page boundaries by neccessity all over the place, and audio playback buffers are typically only a few hundred ms in size. The sound playback thread (or DSP, in this case as it would seem) would have to read in new data regularly from semi-random locations.

These days memory benchmarks on PCs can do a lot better than 50%.

Yah, on a sandy or ivy bridge CPU perhaps, linearly and with heavy prefetching. Hardly a realistic simulation of a UMA system.

This of course doesn't necessarily mean that the left over cycles are wasted on the memory bus and can't be used by other devices.

Also doesn't mean they CAN be.

Exophase · Jan 7, 2013

Grall said:
I'd expect patterns to be regular over time perhaps (on short timescales, naturally), but why would you expect all PCM data for individual sound effects playing at any one time to be packed together nice and tight? They're going to be crossing page boundaries by neccessity all over the place, and audio playback buffers are typically only a few hundred ms in size. The sound playback thread (or DSP, in this case as it would seem) would have to read in new data regularly from semi-random locations.

A sound effect lasting even a fraction of a second will consist of thousands of samples. I think being afforded thousands of contiguous data accesses qualifies as very linear.

You're talking about incurring full latency when a new sound effect starts. How many sound effect start at peak.. a few dozen a second? Are you going to argue on the merit of a few dozen cache misses a second? Because if so I'm not sure you're really thinking this argument through.

Grall said:
Yah, on a sandy or ivy bridge CPU perhaps, linearly and with heavy prefetching. Hardly a realistic simulation of a UMA system.

You said that a PC couldn't achieve 50% in an ideal bandwidth benchmark. Did you have something else in mind for a PC?

Grall said:
Also doesn't mean they CAN be.

We do know DDR3 can handle a lot of accesses in flight though..

Grall · Jan 7, 2013

Exophase said:
You're talking about incurring full latency when a new sound effect starts.

No, I'm not.

How many sound effect start at peak.. a few dozen a second? Are you going to argue on the merit of a few dozen cache misses a second?

The hardware wouldn't read in a number of entire samples instantly, or even one sample for that matter. Modern audio implementations generally have one or more output buffers located somewhere in RAM, and this buffer is typically quite short, to avoid audio and video becoming (noticeably) desynched. The audio software mixes short bits of samples and place the finished data in the buffer to be played back. This means parts of each sample being played are read in in staccato fashion multiple or even many times per second, and this will be on top of all other memory traffic going on of course. Audio load on its own would be completely negligible of course, but everything piles up.

You said that a PC couldn't achieve 50% in an ideal bandwidth benchmark.

No I didn't. You must be hard of reading or something, this is the second time you've grossly misinterpreted my posts.

We do know DDR3 can handle a lot of accesses in flight though..

Do we? What's your source for that? AFAIK DDR3 handles one outstanding read or write request at a time, in the order each command is received.

AzaK · Jan 7, 2013

Grall said:
No, 360, like any other system using standard commodity DRAMs cannot both read and write to main memory at the same time. Don't get confused by the CPU-to-GPU bus having separate read and write lines, that's just a convenience to avoid having bus turnaround penalties slapping you and increasing what is already quite high main memory latency (from the CPU's standpoint, that is.)

OK thanks, that sounds likely where I'm getting my wires crossed.

Exophase · Jan 7, 2013

Grall said:
No, I'm not.

You're talking about random accesses. Same thing.

Grall said:
The hardware wouldn't read in a number of entire samples instantly, or even one sample for that matter. Modern audio implementations generally have one or more output buffers located somewhere in RAM, and this buffer is typically quite short, to avoid audio and video becoming (noticeably) desynched. The audio software mixes short bits of samples and place the finished data in the buffer to be played back. This means parts of each sample being played are read in in staccato fashion multiple or even many times per second, and this will be on top of all other memory traffic going on of course. Audio load on its own would be completely negligible of course, but everything piles up.

Yeah, "quite short" is at least tens of ms and thousands of output samples. And it doesn't matter how small the buffer is, a sample you play isn't going to be read several times.

The discussion here was about random access. If the audio thread or DSP can prefetch the samples, including what you get just from locality of reference on the cache, it doesn't constitute as a bunch of random accesses. This would even apply across dumps of the circular buffer to the DAC hardware.

Grall said:
No I didn't. You must be hard of reading or something, this is the second time you've grossly misinterpreted my posts.

You sure as hell did say it. Don't give me this hard of reading crap without even bothering to check your own posts. THIS is what you said.

"Well we haven't accounted for memory efficiency, which for DRAM in PCs is quite low. Even straight linear access benchmarks typically only cracks about 50% of theoretical performance, if that much"

Please, explain to me how I was supposed to interpret this if not you saying that memory benchmarks on PCs can achieve 50% theoretical bandwidth utilization at best. Please.

Grall said:
Do we? What's your source for that? AFAIK DDR3 handles one outstanding read or write request at a time, in the order each command is received.

DDR3 allows 8 open banks for concurrent accesses. See page 21 here for instance (or look at a datasheet):

http://www.slideshare.net/itzjishnu/ddr3

Grall · Jan 8, 2013

Dude, you're just annoying the shit out of me so I'm not gonna bother to reply to all of your drivel other than to say "typically" does not mean "cannot do more than". And DDR3 can still only SERVICE one request at a time. End of fucking story.

Entropy · Jan 8, 2013

Grall said:
Dude, you're just annoying the shit out of me so I'm not gonna bother to reply to all of your drivel other than to say "typically" does not mean "cannot do more than". And DDR3 can still only SERVICE one request at a time. End of fucking story.

Cool it Grall. You're right about DDR3.
We need someone who is actively developing to give us an estimate of how overall memory access patterns look like for "typical" games.

Exophase · Jan 8, 2013

Entropy said:
Cool it Grall. You're right about DDR3.
We need someone who is actively developing to give us an estimate of how overall memory access patterns look like for "typical" games.

Look, I don't know what kind of semantics one is using for "service" but if the device has a bunch of banks all firing up accesses to different addresses simultaneously who would not consider that as qualifying?

Pipelining on memory has been in development ever since the original EDO, and hit major strides with SDRAM and has only improved since then. You can start new accesses before the ones you've previously started finished. If that's not multiple outstanding requests then I don't know what is.

Think about Intel's boasting about Atom's L2 cache controller being able to handle more outstanding misses in flight than Cortex-A9's (which can of course itself handle multiple). How much could you really leverage this if your LPDDR2 is purely serial? How could you really hope to hide access latency this way?

Still don't believe me? There are obviously people here who know more about the memory technology than I do, why not ask them?

function · Jan 8, 2013

Y'all seem to know more about this than me, but Wikipedia says this:

Wiki wiki wild wild pedia said:
Pipelining means that the chip can accept a new command before it has finished processing the previous one. In a pipelined write, the write command can be immediately followed by another command, without waiting for the data to be written to the memory array. In a pipelined read, the requested data appears after a fixed number of clock cycles after the read command (latency), clock cycles during which additional commands can be sent. (This delay is called the latency and is an important performance parameter to consider when purchasing SDRAM for a computer.

This sounds like a dram chip can have several reads or writes in flight at once ("concurrent access"?), much like a CPU pipeline can have several instructions in flight at once along the pipeline ("concurrent execution"?). Is this right, or am I being a techno-pug and derping up the wrong tree?

Edit: and I guess if your CPU (running its memory bench, for example) was unable to fully saturate all stages in the pipeline, then something else using the same bus might be able to slip in access requests that could effectively increase the total amount of data being readed or writted over a period of time ... ?

Exophase · Jan 8, 2013

function said:
Y'all seem to know more about this than me, but Wikipedia says this:

This sounds like a dram chip can have several reads or writes in flight at once ("concurrent access"?), much like a CPU pipeline can have several instructions in flight at once along the pipeline ("concurrent execution"?). Is this right, or am I being a techno-pug and derping up the wrong tree?

Edit: and I guess if your CPU (running its memory bench, for example) was unable to fully saturate all stages in the pipeline, then something else using the same bus might be able to slip in access requests that could effectively increase the total amount of data being readed or writted over a period of time ... ?

Well that certainly would seem to align with what I'm claiming, but I'll welcome whoever can show this is wrong

BrainCraters · Jan 9, 2013

testing, testing...

Wii U hardware discussion and investigation *rename

Laa-Yosh

I can has custom title?

Hornet

function

None functional

Hornet

SoreSpoon

function

None functional

Grall

Invisible Member

Laa-Yosh

I can has custom title?

Exophase

Grall

Invisible Member

Exophase

Grall

Invisible Member

AzaK

Exophase

Grall

Invisible Member

Entropy

Exophase

function

None functional

Exophase

BrainCraters

Similar threads