Do 360/PS3 programmers still use assembler?

erp, i'm one of those c++ programmers you lament, but i've gotta give you some thanks for wdc and stunt racer 64. both are impressive even now (and often still beautiful)

edit: and this blog post, as it's loosely applicable to the thread, rings true to me like few things i've read from other programmers
 
Last edited by a moderator:
edit: and this blog post, as it's loosely applicable to the thread, rings true to me like few things i've read from other programmers
"When I was a lad we had 40 minute link times, and we were 'appy".

That was on a transputer development system (whose acronym described the link times perfectly) hosted in, "state of the art", 286-based PCs.


Back to ASM-related matters...
After profiling some of my code, I recently decided to try using SSE2 intrinsics to speed up some ARGB:8888 colour synthesis/blending, replacing some "pseudo SIMD" I'd previously done (packing pairs of channels into 2 32-bit variables). The challenge of working around the gaps in the instruction set and get the result faster than the original code was not what I was expecting.
 
The challenge of working around the gaps in the instruction set and get the result faster than the original code was not what I was expecting.
Vector calculation can be a bit challenging to get right, expecially at first time. Remember that vector pipelines are long, and it takes considerable time to move data from vector registers to floats/ints. So do not mix and match vector and float/int math, keep the data in vector register as long as possible. Remember that it's not a bad practice to use vector pipeline to do single scalar calculations occasionally, it's not a waste of resources even if you just use one component to do real work (vector ops are usually as fast as scalar counterparts, and moving data to float/int register takes considerable time). Eventually you will learn to populate the unused components of vectors better, but in many cases you just have to waste some computational power (0.75 cycles of "unneeded work" is far better than a stall worth tens of cycles).

Agressively unroll / inline your vector calculations so that compiler can reorder the instructions / registers for good pipeline usage (long vector pipeline needs lots of ILP).
 
Apart from the problems of there being an instruction to do operation X, say (hypothetically), with 8-bit, but no equivalent with, say, 16-bit, and having to jump through hoops to get the functionality with the data size I needed, the big killer was that with gcc, swapping from my "pseudo-SIMD" code to SSE silently pushed some functions over the inline limit, and suddenly call overheads completely cancelled out the gains. :( It took a bit of digging through the assembler to track that down.
 
AVX doesn't have a full integer operation set for all sizes, but AVX2 (or XOP) will include full set of integer operations for 256 bit registers as well. Most SSE versions should have full set of (basic) integer operations for all the supported sizes (but exotic operations of course can be more limited).
 
WRT assembler in education, I was tought the basics during a general computing degree course 4 years ago, having to write a simple calculator on a board which consisted of a hex keypad and a 4 digit LCD type readout. There were no mentions of intrinsics, inlining, caches or how programming assembly in the real world works. It was a basic primer that could've been useful if there had been a follow up module with more relevant knowledge.
 
Apart from messing about with 6502 assembly in high school and some initial "theoretical" (stack VS register machines etc), course work, my first hands-on course in assembler was with a DEC PDP-10..... 36-bit words and magnetic core memory FTW.
 
Back
Top