How are things on the homebrew front for DS?

Guden Oden said:
Then there's the superFX thingy as well. Not sure how well that would run, maybe Nintendo would have to source-port FX games to use the 3D hardware of the DS instead of rasterizing in software through the FX. The DS doesn't have THAT much CPU muscle after all.
That should not be that big of a difficulty - Nintendo did quite successfully port the SuperFX2 equipped Yoshi's Island onto the GBA.
 
akira888 said:
That should not be that big of a difficulty - Nintendo did quite successfully port the SuperFX2 equipped Yoshi's Island onto the GBA.

1. A port is quite different than an emulation, GBA offers roughly the total power of the snes core system (better or worse depending what is being done), and things were cut out of the yoshi's island port. I'm quite surprised any snes emulation is possible on the gba at all. That said, I believe the gba's polygon pushing ability is about on par with a snes with the fx2 chip (which implies it's a lot more powerful), yet its sound and 2d special effects fall slightly short of the snes.
 
The Arm CPU of the GBA is much much faster than the 16-bit-ish modified MOS 6502 variant of the SNES, and not just because of the clockspeed difference. The SuperFX was essentially just a faster-clocked 6502 also I seem to remember reading, or at least very similar. It probably was 16-bit though rather than 8 like the original. Thus it's not that surprising GBA might have roughly the same CPU oomph as SNES+FX2 chips combined. GBA'll probably perform a bit better actually, as the FX had to work across the cartridge slot interface, and that connection was probably not ideal from a performance standpoint.

Unfortunately, sound in GBA was much much worse than SNES, and emulating the sound CPU/DSP combo fully is probably rather costly in terms of processing power, at least for a portable device... Perhaps short cuts can be taken, cut down on the special effects like reverb for example or lower the samplerate.
 
Guden Oden said:
The Arm CPU of the GBA is much much faster than the 16-bit-ish modified MOS 6502 variant of the SNES, and not just because of the clockspeed difference. The SuperFX was essentially just a faster-clocked 6502 also I seem to remember reading, or at least very similar. It probably was 16-bit though rather than 8 like the original. Thus it's not that surprising GBA might have roughly the same CPU oomph as SNES+FX2 chips combined. GBA'll probably perform a bit better actually, as the FX had to work across the cartridge slot interface, and that connection was probably not ideal from a performance standpoint.

Unfortunately, sound in GBA was much much worse than SNES, and emulating the sound CPU/DSP combo fully is probably rather costly in terms of processing power, at least for a portable device... Perhaps short cuts can be taken, cut down on the special effects like reverb for example or lower the samplerate.

The donkey kong country ports had a downgrade in both graphics and sound, though maybe the graphics were just downgraded to make some attempt at decent sound.

I thought the FX chip was like a math coprocessor, similar to the math coprocessor on a 486. I remember the listed polygon specs for the fx2 chip were similar to the 32x, which is also similar to the gba's. (and the gba does seem about capable of 32x quality 3d)
 
Fox5 said:
The donkey kong country ports had a downgrade in both graphics and sound, though maybe the graphics were just downgraded to make some attempt at decent sound.
Not really sure for what technical reason the GBA would need a graphics downgrade, considering it has a much faster CPU and ALL the special effects hardware of the SNES and then some (it has TWO mode7 playfields and not just one).

I thought the FX chip was like a math coprocessor, similar to the math coprocessor on a 486.
From what I read, it's more like a separate CPU. It has terrible 3D performance by the way, did you see DOOM for SNES? Awful! And that was the much higher clocked FX2 chip too by the way.
 
Guden Oden said:
What is the DS's LCD rez? Almost all SNES games ran at 256*240, so this might possibly be a slight issue...
256x192. Equal for both screens.
 
AndY1 said:
If I remember well, Sinclair's ZX Spectrum 48k had the same resolution.
When you multiply it out, you'll notice that 256x192 bytes is exactly 48kiB. Coincidence? :D
It's kinda neat because you can just bit-shift your y-coordinate by 8/9/10 (for 8bpp/16bpp/32bpp respectively), add the framebuffer base pointer and voila, you have the memory location of the line.
That's probably a lesson learned from the GBA. The ARM7TDMI (main CPU in GBA, "secondary" CPU in the DS) has very strong multiplication support, it's basically free if one of the operands is <=255. So they probably figured when they did the GBA that they don't need to rely on shifts if they can have proper multiplication at the same performance.

The GBA ended up with a 240x160 framebuffer, which isn't totally shift-friendly. *240 can be constructed out of two shifts and a subtraction, or a single multiplication, obviously.

However ... the same ARM7TDMI doesn't just have strong multiplication. It also has crazily strong bit-shifting, so that kind of balancing really isn't valid. You'll seldom need to have the bit-shifts standing alone as separate instructions, but you can just fold them into whatever other operations you'll need to do anyway (here: adding the framebuffer base address, adding the x to width*y).

*cough*
ARM assembly code snippets for writing a 16bit color to a framebuffer at an arbitrary x/y-coord.
Code:
@ common assumptions about register contents
@ r0: color to write
@ r1: x, r2: y
@ r3: pointer to start of first line in framebuffer

@ line width:=240 pixels=480 bytes
@ using shifts
rsb r2,r2,r2,lsl #4     @ r2:=y*16-y = 15*y
add r1,r1,r2,lsl #4     @ r1:=x+16*15*y = x+240*y
strh r0,[r3,r1,lsl #1]  @ address:=r3+2*r1=fb+2*x+480*y, and we're done

@ line width:=240 pixels=480 bytes
@ using multiply-accumulate
mov r4,#240             @ ARM7TDMI can't MUL/MLA by an immediate
mla r1,r2,r4,r1         @ r1:=x+240*y
strh r0,[r3,r1,lsl #1]  @ same as in first snippet
Even though the thing really has a strong multiplier, the MLA snippet takes two cycles longer than the shift snippet and clobbers another register which can be a pita.

And now for a line width of 256 pixels=512 bytes.
Code:
add r1,r1,r2,lsl #8     @ r1:=x+y*256
strh r0,[r3,r1,lsl #1]  @ address:=fb+2*r1=fb+2*x+512*y, done
It really doesn't get any more optimal than this, at least not until some CPU maker decides that store address calculations should take three operands with separate bit-shifts.

Would someone please hand me a cookie? :)
 
Last edited by a moderator:
Impressive! Well, at least it would have been if I'd been able to interpret ARM assembly opcodes! :LOL:

Perhaps if you did some comparison of some other CPU architecture(s), x86, or MIPS (since PSP is MIPS-based), we'd (even those perhaps that like me lack most in the way of coding skillzz) would get a greater understanding of what makes ARM's implementation so powerful.
 
Off-topic ftw :D
Guden Oden said:
Impressive! Well, at least it would have been if I'd been able to interpret ARM assembly opcodes!

Perhaps if you did some comparison of some other CPU architecture(s), x86, or MIPS (since PSP is MIPS-based), we'd (even those perhaps that like me lack most in the way of coding skillzz) would get a greater understanding of what makes ARM's implementation so powerful.
First off, it is very fast per clock but that's an opportunity ARM had because they didn't aim for very high clocks anyway. I don't know how far it would go, but to give at least some perspective the particular ARM core we're talking about runs at 33MHz in the DS. You won't see a single-cycle integer multiplication instruction in an architecture that needs to hit GHz clocks. Calling it "so powerful" is okay, but keep the overall design targets in mind when you do :)

The opcodes are almost straight-forward if you've ever seen assembly listings.
ARM opcodes follow the usual "RISC" conventions. Each opcode has a destination register and source registers. For an addition you have this basic form:
add result_register,source_register1,source_register2

Registers are named r0,r1 ... to r15. So this is an actual ARM instruction:
add r1,r2,r3
This will add the integers that are currently in r2 and r3 and place the result in r1. It's standard procedure for most "RISC" architectures, including MIPS I believe.


Now for the bit-shifting, they apparently found that they had so many free bits in the instruction encoding (each ARM instruction is 32 bits wide), that they started blowing them on operand modifiers. Almost every ARM instruction allows an extended form of the second operand, instead of just the contents of a register. You can use:
a)The contents of a register shifted left or right, or rotated, by a 5-bit immediate value (that is a value embedded into a field of the instruction itself).
b)"" "" by a value sourced from another register (aka "register controlled shift")
c)The contents of a register rotated right by one, with the added side effect that the bit "rotated out" is placed into the carry flag ...
d)An eight-bit immediate value that can be rotated (e.g. you can express 0xF000000F as 0xFF rotated right by four bits, and thus it's a valid "eight bit" immediate value).

In "add r1,r1,r2,lsl #8", which is an instruction from the listing I gave in the last post, you have an example of this extended second operand. The instruction adds the current contents of r2 shifted left by eight bits to r1 and places the result back in r1. "lsl" as in logical shift left.

This can be done in almost any instruction. Exceptions are, curiosly, multiply, understandably multiply-accumulate (this one has three source registers, so they need more bits to encode that), loads and stores which have their own address-generation rules and some system stuff (coprocessor interfacing, software interrupts, the like).

It goes so far that they actually don't have a free-standing bit-shift instruction. If you only want to shift a register around, you must use a "dummy" move instruction (all archs have this in some form or another, but it usually only copies a register's contents to another register).

The competitive "problem" with the multiply instruction, when compared to the extreme shift capabilites, is that it does not allow an immediate operand. I.e. as seen above, if you want to multiply a number by 240, you can't use this:
mul r1,r2,#240

The number 240 must be moved to a temporary register first.
mov r1,#240
mul r1,r2,r1

=====================

The other instructions in the snippet were rsb and strh.
RSB is reverse subtract, i.e. it subtracts the first operand from the second operand, which is useful because only the second source operand may have the modifiers outlined above. The "normal" subtraction is simply called sub, and of course works the other way 'round.
STRH stores a halfword to memory, and can also compute a semi-complex address as a base address+shifted offset. An ARMish "halfword" is 16 bits, because an ARMish word is 32 bits, which is in contrast to x86 circles, where a word is historically 16 bits and a 32 bit quantity is called a double-word.

If it wasn't obvious, the "at"-sign starts comments in ARM assembly source code, so whatever follows after these have been my attempts at explaining what's going on and where.

=====================

TBH I don't know much about MIPS. However, I know a bit about x86 :)

The x86 instruction set is very ... relaxing, because they didn't have to care much about how a certain amount of information can be packed into a fixed-length instruction encoding. Instructions have variable length, so no worries at all. You can have wide immediates everywhere for one, and you totally can do this:
imul eax,eax,240
(integer-multiply eax by 240, write result back to eax)
There's no multiply-accumulate instruction in x86.

x86 has free-standing bit-shift instructions with immediate or register-controlled shift values.
SHL eax,4
This shifts the value in eax by four bits to the left.

x86 doesn't usually allow separate source and destination registers. I.e. usually one of the source operands is overwritten with the results, this is also true for the shift operation. Multiplication is just about the only case where this rule can be broken.

x86 does allow inline shifts in address generation, but nowhere else.
mov [edi+4*ecx],eax

The above excercise in x86 would look something like this:
Code:
; common assumptions about register contents:
; ax: color to write
; ebx: x, ecx: y
; edi: pointer to start of first line in framebuffer
; line width:=240 pixels=480 bytes
; using shifts

shl ecx,4                ; ecx:=16*y
mov edx,ecx              ; copy 16*y
shl ecx,4                ; ecx:=256*y
sub ecx,edx              ; ecx:=256*y-16*y = 240*y
add ecx,ebx              ; ebx:=240*y+x
mov word [edi+2*ecx],ax  ;store ax at fb+2*(240*y+x)
Code:
; using multiply
imul ecx,240             ; ecx:=240*y
add ecx,ebx              ; ebx:=240*y+x
mov word [edi+2*ecx],ax  ;store ax at fb+2*(240*y+x)

Done :)
 
Uh, well.... Thank you... Uhm. ...For that, erm, insightful explanation. :LOL:

Sorry, I didn't follow you entirely everywhere, but at least you tried, heh! Thanks for spending the effort trying to educate me. :p Perhaps if I'd had more than 14-15 hours of sleep total for the last three nights I'd been better able to interpret your explanation. Meh, unprovoked shoulder injuries are teh suck. :(
 
I was wondering, has there been a DS Lite flashrom cart announced (or even marketed already)? I wouldn't want a GBA cart that sticks out of my nice pretty new DS (once I friggin get my paws on one that is :LOL:)...

Also, anyone know some good DS homebrew websites? Dcemu.co.uk is loaded to the brim with ads and looks kind of ugly/unprofessional. I dunno, is that the best there is? ;)
 
i believe the makers of the m3 are coming out with a new small adapter...but i dont know. i sold my ds with supercard and dont follow it anymore as much as before
 
It is a guide on how a person can reflash his DS to allow homebrew using only a nintendo wifi dongle? I tried to read the guide, but it wasn't really clear on what it was accomplishing I thought, but rather just a guide on how to do it; whatever "it" happened to be... :)
 
It's a guide (check the bottom part (my contribution); it's much easier using pure Knoppix Live CD and a few files needed to download (rar and actual .nds ROM demo)) on how to make you Nintendo USB dongle as a download-play station.

With it you can run demos on your DS via Download Play and using -w parameter you can run WiFiMe.
 
I just got a SuperCard mini SD today.

It's as big as a GBA cartridge, it has a mini SD memory card port in the front, so it doesn't need to be pulled in and out to change memory card:
http://eng.supercard.cn/

I ordered it at www.supercardstore.com (UK site) and got it delivered in 2 days for 44€ :)

I never received items from UK online stores in 2 days. It usually lasts a week.

Moonshell works nice, I have to get DSorganize too.

Any other suggestions? I heard that I can play my LucasArts games via a emulator?
I'd love to replay Day of the Tentacle and Monkey Island.

If mods feel that this post is too much of an advertisment I will delete it.
 
With so many flashcart products available, which one should I pick? Is there a flashcard thingy available yet that sits ONLY in the NDS cart slot rather than the GBA slot?

I'm not interested in pirating DS games, I just wanna play some MAME or some SCUMM adventures or such. Maybe even try a bit of java, linux or originally developed software, these consoles seem to attract an increasing following of homebrewn apps...
 
Back
Top