Nintendo 3DS hardware thread

Linear transform is n^2, matrix multiply is n^3, operations like vertex shading are very data linear and easy to prefetch for and therefore hide latency of (PSP does have prefetch)... and you think that memory latency is making it impossible to realize the arithmetic capabilities of the VPFU?

Not impossible at all however on a practical level the memory architecture of the system made it extremely hard to get any useful gains out of the chip..

For example one of the guys here spent a great deal of time trying to get our animation blending running on the VFPU to free up some room on the GE but in the end we just couldn't get it fast enough where our efforts weren't constantly outperformed by the GE at the same task & it was all due to memory latency...

I guess if you were expecting to do something like physics on it for example (where your only alternative would be to do it on the allegrex) then I can see how you'd be able to make good use of it however it our case it just wasn't a win at all...
 
The 3DS appears to have a similar relation to the Wii as the NGP to the PS3 or the PSP to the PS2. The Conduit guys said they got the Wii version running on 3DS looking similar to the Wii version but it took quite a bit of tweaking. With Epic already having stated that UE3 games can run on both HD consoles and NGP, for all intents and purposes the NGP seems close enough to be able to share the brunt of development costs, being Art assets.
On the other hand, Konami and Capcom said they could use PS3 (quality) assets on the 3DS, and several studios that are not named Epic already brought their HD engines over or are looking into it.
 
Linear transform is n^2, matrix multiply is n^3, operations like vertex shading are very data linear and easy to prefetch for and therefore hide latency of (PSP does have prefetch)... and you think that memory latency is making it impossible to realize the arithmetic capabilities of the VPFU?

Those kinds of matrix operations are effectively always bandwidth limited.
(However, latency is a part of the effective bandwidth calculation, prefetching can't hide latency completely. Maybe that is getting too technical.)

But the bottleneck is still there - the colossal ALU resources of GPUs, and for that matter SIMD units on CPUs, are only applicable on problems that fit very narrowly confined memory access patterns. On most problems I encounter, effective throughput is pretty much always bandwidth limited.

Concerning latency, if we bastardize Amdahls law a bit, we can state it as "the part of your code that you cannot optimise will be performance limiting" - which, oftentimes in general code, is conditional memory access. And there low latency helps enormously. Something I would like to see generally when architecture is discussed is more attention being paid to data flow/data paths and communication, rather than the almost exclusive focus on ALU resources which is common. Sonys PSP2 info is a great example - we know the number and kind of CPU cores, and the number and kind of GPU cores. But cache hierarchy, sizes, latencies and bandwidth? Even the main memory properties isn't spelled out, never mind the less obvious such as CPU <-> GPU communication, internal GPU cacheing, or how coherency/interlocking issues are dealt with, or...
 
On the other hand, Konami and Capcom said they could use PS3 (quality) assets on the 3DS, and several studios that are not named Epic already brought their HD engines over or are looking into it.

Who knows. The display itself is far from HD though. From the impressions I'm reading on GAF, if you're looking at the game in 2D, you're just getting half the resolution (e.g. 400x240). I guess that perhaps there is a mode where you can use 800x240 and not support 3D, but from the looks of it all games that support 2D and 3D, you'll get 400x240 in 2D mode. This suits also the impressions from someone on GAF who brought his PSP with Ridge Racer that the 3DS version actually looked a little lower res - the PSP is 480x272. .

Also of note from people visiting events seems to be that games have 64MB avialable to them - when a 3DS crashed it showed 64MB memory available, which seems to confirm earlier suggestions that the system has 64MB for games, and 32MB for the OS probably.

Of course, in this case it's just a port of a PSP game, and probably little effort has been made to make good use of the 3DS capabilities and I am certain the 3DS can do better than that, but I do question how much better with this combination of memory and resolution constraints. Certainly, saying you can port 'HD' stuff almost seems moot. Perhaps you can get a similar level of shading quailty, but you'll definitely need to downscale your graphics and textures both for the smaller screen and for the smaller available memory.

(I'm a big 3D fan, so I'm more interested than usual)

EDIT: to further support games on the 3DS sticking to 400x240 in 2D, there are comments like the one from the Dead or Alive developers saying that their game will run at 30fps in 3D or 60fps in 2D.
 
The 3DS is obviously nowhere near as powerful as a PSP2, but neither comes anywhere close to current generation home consoles, so there's always some level of downgrading involved either way. Having the same engines running on consoles and handhelds is definitely a plus, though, even if it makes absolutely no sense to just port games over as is most of the time in my opinion, no matter how easy it is. The ability to reuse assets on handhelds is nice, but that's possible on both systems to some degree. It's rather obvious that no game shown on PSP2 actually uses the exact same assets as used on PS3. They're downgraded, just like the Street Fighter and Resident Evil assets are downgraded on 3DS.
 
Nonetheless, the 3DS seems to be a lot closer to the PSP2 than the DS was to the PSP1. At least graphically (we're yet to see what CPU is in the 3DS) . It doesn't look like the 3DS is 2-3 generations behind the PSP2, for example. It's still far away, though.
 
Last edited by a moderator:
Those kinds of matrix operations are effectively always bandwidth limited.
(However, latency is a part of the effective bandwidth calculation, prefetching can't hide latency completely. Maybe that is getting too technical.)

But the bottleneck is still there - the colossal ALU resources of GPUs, and for that matter SIMD units on CPUs, are only applicable on problems that fit very narrowly confined memory access patterns. On most problems I encounter, effective throughput is pretty much always bandwidth limited.

Matrix multiplication and linear transformations against fixed matrices that stay resident in registers scale with FLOPs, not bandwidth. So asymptotically they're FLOP limited. That's why stuff like Linpack is what everyone turns to when trying to exhaust the theoretical FLOP performance of an architecture. For small vertexes you might hit memory bandwidth first, but I really doubt that is happening on PSP or for that matter most platforms where you'd do this sort of thing. PSP has a 32-bit 166MHz DDR bandwidth to main memory and while the VFPU is crunching ideally it'd be about the only thing accessing it. That's 32-bits per cycle - you would need 3 cycles to get, say, 3 32-bit coordinates and then you spend around 9 FMUL + 6 FADD + 3 FADD for a typical transform.. or about oen dot product per cycle. Sounds like it matches pretty well to the VFPU..

How is latency part of the effective bandwidth calculation? How can prefetching not ever hide latency completely? Do you mean the latency from L1 cache, which is usually non-blocking?

Concerning latency, if we bastardize Amdahls law a bit, we can state it as "the part of your code that you cannot optimise will be performance limiting" - which, oftentimes in general code, is conditional memory access. And there low latency helps enormously. Something I would like to see generally when architecture is discussed is more attention being paid to data flow/data paths and communication, rather than the almost exclusive focus on ALU resources which is common. Sonys PSP2 info is a great example - we know the number and kind of CPU cores, and the number and kind of GPU cores. But cache hierarchy, sizes, latencies and bandwidth? Even the main memory properties isn't spelled out, never mind the less obvious such as CPU <-> GPU communication, internal GPU cacheing, or how coherency/interlocking issues are dealt with, or...

The problem with applying Amdahl's law in this situation is that it assumes that everything can otherwise be parallelized. People always figure there's "a bottleneck", but if you can't run two things at the same time it always helps to make one of those things faster. Even if you can't make the other thing any faster.

However I agree, memory latencies aren't talked about a lot and do have a great impact on a lot of operations. PSP has surprisingly bad memory latency to main memory (and only somewhat better to eDRAM from the CPU), OMAP3530 has surprisingly bad latency to main memory and much worse than advertised (by ARM, for Cortex-A8) latency to L2 cache, Cortex-A9 caches likely having worse bandwidth in addition to potentially worse latency than A8 ones, the list goes on.. Unfortunately both bandwidth and low latency consumes more power. Waking up DRAMs that were sleeping takes time.

For PSP2 I think we just have to wait longer, the thing isn't even coming out until nearly a year. We'll probably see some convention slides from Sony and third parties alike that go into more detail, like we have for previous Sony platforms. What we do know is the maximum amount of CPU L1 cache (it'll probably be 32KB I/D), some ideas of L2 cache if they use the standard ARM offerings (they likely will and I imagine it'll be 1MB or 2MB), and the associativity of the caches.

Nonetheless, the 3DS seems to be a lot closer to the PSP2 than the DS was to the PSP1. At least graphically (we're yet to see what the 3DS . It doesn't look like the 3DS is 2-3 generations behind the PSP2, for example. It's still far away, though.

I wonder. How do you determine this stuff? DS was sort of N64 level, PSP was below PS2 level.. sounds about like one generation off to me. Let's say 3DS is about Gamecube level and PSP2 is kind of like to PS3 as what PSP was to PS2.

The take-away from this might be that subjective image quality evaluation doesn't scale like performance does, so DS to 3DS looks like a huge leap because DS was sitting too far below our expectation thresholds.
 
Nonetheless, the 3DS seems to be a lot closer to the PSP2 than the DS was to the PSP1. At least graphically (we're yet to see what the 3DS . It doesn't look like the 3DS is 2-3 generations behind the PSP2, for example. It's still far away, though.

In terms of raw power, it there may be a gap similar to how the DS was to the PSP. The differences is that Nintendo cleverly chose the PICA200 for its GPU, which features fixed shaders that allow developers to use many of the modern effects seen in console games without the 3ds needing a higher-performance CPU. Its hardware artitecture doesn't appear to be too far-removed from the NGP and other mobile phones neither, so porting down does seem to be difficult. In comparison, the way the DS rendered polygons was very different from normal, and its cap limit of 2048 polygons per scene made a problem making any 3D game presentable this day and age.
 
In terms of raw power, it there may be a gap similar to how the DS was to the PSP. The differences is that Nintendo cleverly chose the PICA200 for its GPU, which features fixed shaders that allow developers to use many of the modern effects seen in console games without the 3ds needing a higher-performance CPU. Its hardware artitecture doesn't appear to be too far-removed from the NGP and other mobile phones neither, so porting down does seem to be difficult. In comparison, the way the DS rendered polygons was very different from normal, and its cap limit of 2048 polygons per scene made a problem making any 3D game presentable this day and age.

I don't think I follow you..

Having fixed function pixel shaders doesn't lessen CPU workload. Pixel shading never involves the CPU. Hardware T&L removes a potential need for CPU involvement, but most 3D game consoles have had hardware T&L. Having programmable T&L (vertex shading) means you won't have to use the CPU to achieve the same results, but a lot of 3D consoles already had vector coprocessors for this.

Fixed function pixel shading is definitely a limiting factor for ports and certainly makes it removed in some sense. If you want portability you have to target 3DS style shaders and make equivalent shader programs for platforms like NGP. How big of a deal this is depends on just what the fixed function shaders are and how much it fits what developers want to do. More complex shader techniques like deferred shading certainly won't work on 3DS (although I doubt you'd want to do such a thing on NGP either)

I agree that the DS was probably the most far out there of any remotely 3D arch released since PS1.. it had to have been frustrating not being able to (very easily anyway) scale geometry for a lower frame rate. You did always get a nice 60Hz, at least.
 
The RE4 Progenitor Virus trailer running on GC hardware looks better than anything coming out for 3DS. The Wii has more memory and a faster CPU than GC. From what I've seen, 3DS has lower realworld polygon pushing power which makes the graphics kinda ugly despite fancy lighting.

You're going to judge 3DS vs GC based on a trailer for a demo that never turned into a game?..

We don't know that Wii has more memory and we don't know how the CPU's compare. But I think its pretty likely that Wii is more powerful in raw terms, however it seems that 3DS better effects mean it can produce better looking graphics. Not many seem to share you're view about polygon counts being everything.
 
Last edited by a moderator:
Matrix multiplication and linear transformations against fixed matrices that stay resident in registers scale with FLOPs, not bandwidth.
Well, yes, but read what you write again "fixed matrixes that stay resident in registers". As I implied, I'm not doing graphics programming, so that case just doesn't happen. Basically, we tend to get limited by the number of operands we can read and results we can write per unit time rather than the FLOPs, partly of course because FLOPs tend to be very high, and secondly because our data-sets aren't small.
So asymptotically they're FLOP limited. That's why stuff like Linpack is what everyone turns to when trying to exhaust the theoretical FLOP performance of an architecture.
Well, yes, but as you say - theoretical.
Actually, the way I've used Linpack is to gradually let the data set grow from very small to very large, in order to get my own first order real data on a memory subsystem.

How is latency part of the effective bandwidth calculation? How can prefetching not ever hide latency completely? Do you mean the latency from L1 cache, which is usually non-blocking?

Prefetching hides latency insofar as you are not up against the actual bandwidth limits of the subsystem (and assuming it works optimally). But it doesn't get faster than continous data reading, and memory systems are not capable of continous single cycle bursting.

The problem with applying Amdahl's law in this situation is that it assumes that everything can otherwise be parallelized. People always figure there's "a bottleneck", but if you can't run two things at the same time it always helps to make one of those things faster. Even if you can't make the other thing any faster.
I went too far in my bastardization. :) Basically, what I wanted to say was that low latency main memory, even though it doesn't do a lot for pure data streaming, is pretty damn handy to have for pretty much any general purpose computation and in a system where like the 3DS where the CPU is clocked so low vs. the main memory response time you will have a situation similar to running out of cache. I'm involved in scientific computing, and for the codes we want to run that are performance critical, managing memory access patterns is where it's at. Not having to deal with a complex hierarchy of memory speeds and sizes, and a bunch of threads that love to step on each others cache feet and fight for the same resources, makes a pool of fast memory seem wonderful in its simplicity. And not having uncached reads kill performance seems very liberating in terms of algorithm choices.

For PSP2 I think we just have to wait longer, the thing isn't even coming out until nearly a year. We'll probably see some convention slides from Sony and third parties alike that go into more detail, like we have for previous Sony platforms. What we do know is the maximum amount of CPU L1 cache (it'll probably be 32KB I/D), some ideas of L2 cache if they use the standard ARM offerings (they likely will and I imagine it'll be 1MB or 2MB), and the associativity of the caches.
For the people who will work with the systems a lot of the data will presumably be available. Will they ever make it out to the general public out of a marketing department that wants to manage information flow? That's not a given. The number of "cores" of whatever is the new "MHz", that is, the new figure of merit that you try to impress potential customers with. Beyond anything that marketing believes helps sales, someone like me is likely to have no other source than unconfirmed rumors or speculation. Which is aggravating if you're actually curious about something, and I can't help feeling that it generally sells the accomplishments of engineers short to have information about their work confined to a small circle. Nintendo never made even rudimentary Wii specs public, I doubt we'll see much solid data on the 3Ds out of them. A pity.
 
Exophase said:
Sounds like it matches pretty well to the VFPU.
Under a hypothetical scenario where CPU is 100% alone on the bus.

How is latency part of the effective bandwidth calculation? How can prefetching not ever hide latency completely?
Access latency IIRC was about double that of PS2 in CPU cycles (on lower Mhz to boot), making it non-trivial to hide. But the real problem is how often you get ahold of the bus in the first place - in real-world scenarios that is.
As an example, at equivalent clock, VU0 in macro mode handily outperforms VFPU, even though it's got a considerably less flexible ISA, their 'FMACs' are roughly equivalent and master CPU has half the DCache.
Micro mode can go up to double that (if you spend the time to really hide memory accesses) - with the same Flops rating, that's the lost potential of VFPU right there.
 
Under a hypothetical scenario where CPU is 100% alone on the bus.

This isn't that strange of a situation. The VFPU isn't running in parallel to the CPU, it's a coprocessor extension, so it wouldn't be fighting with the CPU. And in a typical scenario the GPU should be staying local to its eDRAM. The Media Engine might take cycles but it has eDRAM too, and I really don't know what the code running on it is like.

Access latency IIRC was about double that of PS2 in CPU cycles (on lower Mhz to boot), making it non-trivial to hide. But the real problem is how often you get ahold of the bus in the first place - in real-world scenarios that is.

333MHz is < ~300..? I agree it's non-trivial to hide, but I'm pretty sure there is enough prefetch capability to do it. Even if it can't in this particular case, you have to understand the comment that was made - that prefetching can NEVER completely hide latency. I don't see how this is the case.

If anything I would say the reverse comment is what holds; that bandwidth is part of latency. In some real world cases this is actually relevant, like sending data over a particularly slow serial line - that first bit might come a lot more quickly than the time between the first and last bit of whatever can be considered large enough to take off the shift register. But in most interesting cases these days the latency is much higher than the bandwidth but doesn't actually impact the bandwidth so long as you can keep issuing non-blocking requests ahead of the latency.

As an example, at equivalent clock, VU0 in macro mode handily outperforms VFPU, even though it's got a considerably less flexible ISA, their 'FMACs' are roughly equivalent and master CPU has half the DCache.
Micro mode can go up to double that (if you spend the time to really hide memory accesses) - with the same Flops rating, that's the lost potential of VFPU right there.

Look, you're completely missing the point. This isn't about whether or not memory latency has a negative effects on the VFPU's typical realizable throughput. This is about a statement being made that the VFPU's ALU capabilities are rendered pointless due to memory latency always being a bottleneck, which is simply not the case. I'm quite certain that, even if for burst periods, you will be using the FLOP rate on the VFPU, and therefore if the FMACs were say, at half the capability, you would be getting work done more slowly.
 
Look, you're completely missing the point. This isn't about whether or not memory latency has a negative effects on the VFPU's typical realizable throughput. This is about a statement being made that the VFPU's ALU capabilities are rendered pointless due to memory latency always being a bottleneck, which is simply not the case.

I looked back at the thread, and I couldn't see any such statement.
Only people providing personal testaments that they had experienced real issues.
Of course there will be cases where the VFPU's ALU capabilities can be well utilized.
 
I looked back at the thread, and I couldn't see any such statement.
Only people providing personal testaments that they had experienced real issues.
Of course there will be cases where the VFPU's ALU capabilities can be well utilized.

"But you loose all the benefits of that power when you realise you're getting raped on load/store latencies & can't get data in/out of the VFPU fast enough to process it.."

Does this help you? Or do you have a different interpretation of what "loose all benefits" means?
 
Does this help you? Or do you have a different interpretation of what "loose all benefits" means?
"Loose all benefits" means to "free all benefits" which would seem to be the opposite of what you are claiming :rolleyes:

I also hope that the person you quoted actually meant to write "rapped", as in "rapped on the knuckles".
 
"But you loose all the benefits of that power when you realise you're getting raped on load/store latencies & can't get data in/out of the VFPU fast enough to process it.."

Does this help you? Or do you have a different interpretation of what "loose all benefits" means?
Exophase, from my position of an observer, I think this argument went into some diametrical-opposition mode some time ago. The way I read it, archangelmorph tried something which everybody would have been tempted trying on the unit, namely morphing, and saw no benefits from the attempt, which he attributed to mem access issues. From failing such a poster-child use-case, he became of the mindset that latencies and/or bandwidth issues severely hamper the unit. Is he absolutely correct in his wording, i.e. 'lose all benefits'? I'd say no, as surely there would be use cases when all that ALU power could come in just handy. But at the same time, reading beyond the verbatim meaning of his first post, you'd see he means certain use cases, namely vertex streaming for the sake of transformations (aka 'vertex shading'). Now, that shading could be lighter or heavier in terms of ALU - it all depends on the sort of transformation, but I'd say morphing should be a fairly ALU-intensive case (subject to all bone matrices being pre-loaded in VFPU, of course), and it's easy to imagine much lighter shading scenarios where that ALU power could indeed be wasted.

You say the bus should be able to feed the VFPU's flops just right bw-wise, to which Fafalada responded you don't get the bus for yourself. And indeed, the GE is sitting on the same bus, and it could be doing host-mem reads over it in full force (i.e. for non-local tex buffers and/or vertex/index buffers), so I'm quite skeptical about this 'VFPU being alone on the bus' scenarios you refer to. Now, I've never done my mem access tests on the PSP, but from what I've heard from people working on the platform, the memory subsystem is reportedly really bad, so I could imagine that situations where a unit would be hypothetically fed 'just right' by the bus, to be non-viable in practice. Remember we are talking about multiple units streaming data, so it's not a matter of isolated burst accesses on either side.

Otherwise yes, I'm a fan of the VFPU ISA myself - it was one of the first ISAs to successfully tackle long-standing SIMD issues like swizzling and transposing.
 
Otherwise yes, I'm a fan of the VFPU ISA myself - it was one of the first ISAs to successfully tackle long-standing SIMD issues like swizzling and transposing.

AltiVec (vec_merge and full vector permute operations for instance) is another well featured early SIMD ISA.
 
I'm noticing more than one PSP ports in the 3DS lineup. We probably shouldn't forget that to do a PSP port in 3D, the 3DS still needs to be about twice as powerful. I also have no doubt that the 3DS has better pixel shading capabilities than the PSP, and it clearly has more RAM available.

As others say, the 3DS is quite a step up from DS in terms of 3D rendering, no question about that! But discussing the 3DS in terms of HD is almost pointless - the 3D screen has a 400x240 resolution. That barely holds up to the previous generation's SD resolutions. Combined with the limited memory, I'm going to go out on a limb here and state that the 3DS will only ever really be able to match Wii graphics (and I'm willing to bet many games will be released that look worse).

Which may well be plenty good enough, mind you, as for most of the target audience, that is a) a big enough step up from the DS, and b) pretty much 'console graphics' on a handheld, as most of them also have a Wii. I'm not being sarcastic here, by the way.

There is also nothing in currently available videos of the actual games to be released that suggest otherwise.
 
I'm noticing more than one PSP ports in the 3DS lineup. We probably shouldn't forget that to do a PSP port in 3D, the 3DS still needs to be about twice as powerful. I also have no doubt that the 3DS has better pixel shading capabilities than the PSP, and it clearly has more RAM available.

As others say, the 3DS is quite a step up from DS in terms of 3D rendering, no question about that! But discussing the 3DS in terms of HD is almost pointless - the 3D screen has a 400x240 resolution. That barely holds up to the previous generation's SD resolutions. Combined with the limited memory, I'm going to go out on a limb here and state that the 3DS will only ever really be able to match Wii graphics (and I'm willing to bet many games will be released that look worse).

Which may well be plenty good enough, mind you, as for most of the target audience, that is a) a big enough step up from the DS, and b) pretty much 'console graphics' on a handheld, as most of them also have a Wii. I'm not being sarcastic here, by the way.

There is also nothing in currently available videos of the actual games to be released that suggest otherwise.

Worse then what? The worst looking Wii game? :D

IMO the best looking launch games on 3DS have already at least matched the best on Wii.
 
Back
Top