|
Unruly Member
Join Date: Jul 2004
Location: Bunkyo-ku
Posts: 4,694
|
OK so my looooong summary translation for Nishikawa's new article...
http://www.watch.impress.co.jp/game/...70131/3dlp.htm
Quote:
- "Capcom MT Framework" is the name of the engine and MT means "Multi-Thread", "Meta Tools" and "Multi-Target".
...
- After Onimusha 3 they began sharing engines for efficiency. As for MT Framework it was only meant for Dead Rising and Lost Planet at first and it was not their original intention that it would become generic like its current form. When developing these games they evaluated a certain famous middleware but it's a bit different from what they wanted and not enough performance-wise so they began developing a framework easy to use for them.
...
- MT Framework supports multiplatform. In addition to 360 and PC, they are implementing it for PS3 too. Apparently PC support was developed mainly for development efficiency and prototyping, not really for actual product release. In Oct. 2005 the prototypes for Lost Planet and Dead Rising were already up and running. Around this time, to answer requests for the PS3 version from other people in Capcom they started implementation for PS3 by reusing the basic design of the framework. The PC/360 version was developed by 3 people and then 5, and for the PS3 version 4 people were added.
PC version
http://www.watch.impress.co.jp/game/...131/3dlp04.htm
360 version for the same scene
http://www.watch.impress.co.jp/game/...131/3dlp05.htm
...
- In MT Framework it's possible to watch processing load in realtime with the Capcom in-house performance monitor. The history for thread usage are also viewable. Graphics settings can be changed in realtime. But they don't license it for other companies for now because their corporate structure is not yet ready and a good game requires not only a bare engine but also per-title customization after all.
...
- Some say game programs are a typical sequential processing structure and unsuitable for multithreading. However, according to a Capcom developer it's not true. If you look at it from the "task" point of view, dependency can be minimized and you can even say game programs are easy to parallelize. In MT Framework, workload has 3 types: module, task, loop (module is the biggest unit and loop is the smallest). A module is executed in an exclusive thread while a loop and a task are passed as a function pointer to always-running 5 general job threads.
...
- http://www.watch.impress.co.jp/game/...131/3dlp08.htm
Modules are basically relatively independent things in a main game program and often related to hardware functions. Rendering, sound, collision, motion, physics simulation, AI are modules and processed in parallel. When dependency can be minimized higher parallelism can be obtained. However it's not easy to separate modules since a game program is often constructed by modules with heavy dependency on each other. Also parallelism based on modules spoils cache efficiency when using SMT because modules are totally different codes.
...
- http://www.watch.impress.co.jp/game/...131/3dlp09.htm
A loop is independent and has a high load to process relatively larger data. The good point is it works well with SMT. The bad point is there are not many independent loops that occupy a certain percentage of a game program. Also synchronization overhead is big for a loop with less processing data.
...
- http://www.watch.impress.co.jp/game/...131/3dlp10.htm
A task in a game program is a unit that does update and drawing in each frame. It includes player, enemy, bullet, camera, lighting etc.. Each task has an "update process" and a "drawing process". Independent tasks are executed in parallel. The good points are dependency can be limited and SMT efficiency is higher when executing the same kind of tasks. The bad point is it can't be executed in parallel when there are no independent tasks.
...
- http://www.watch.impress.co.jp/game/...131/3dlp11.htm
Update process in a task can be further categorized for "parallel update" and "synchronized update". The former is update that can be parallelized and the latter cannot.
In the flow of older game programs (left) CPU does update (the green part) and makes drawing commands (red) and issues them with double buffering to GPU (light blue) and GPU does drawing (blue). In a parallelized game program (right) parallel update (green) are done in multiple CPU cores and synchronized update (yellow) in one core, then drawing commands are created in parallel again (red).
http://www.watch.impress.co.jp/game/...131/3dlp12.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp13.htm
In parallel update, independent tasks are registered in the same line and dependent ones are in different lines. Tasks in the same line can't refer to or update other tasks. Tasks can refer to tasks in other lines.
http://www.watch.impress.co.jp/game/...131/3dlp14.htm
In synchronized update, tasks are sequentially updated in a single thread and tasks can do update and reference in other tasks freely. To end it in the shortest time, the cache and memory bandwidth are fully assigned to the single thread and unnecessary synchronization objects are turned off. Actual calculation is postponed to the next parallel update as much as possible.
In the Xbox 360 version of MT Framework the number of threads are fixed to 5 including the main thread. In an update process tasks in the same line are pushed in the job queue and executed by 5 threads in parallel. When drawing, all tasks are pushed in the queue regardless of lines.
Job queue in symmetric multicore (Xbox 360)
http://www.watch.impress.co.jp/game/...131/3dlp15.htm
Lines in a tool in MT Framework
http://www.watch.impress.co.jp/game/...131/3dlp16.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp17.htm
http://www.watch.impress.co.jp/game/...131/3dlp18.htm
For parallel drawing commands, drawing order is important.
http://www.watch.impress.co.jp/game/...131/3dlp19.htm
First it generates an intermediate drawing command with a priority tag and then sort them to create a native drawing command. Intermediate commands are generated in parallel and stored in a buffer in each thread.
http://www.watch.impress.co.jp/game/...131/3dlp20.htm
http://www.watch.impress.co.jp/game/...131/3dlp21.htm
Then parallel merge sort is applied for each command buffer. Finally multiple command buffers are integrated by merge sort in a single command buffer and converted to a native command.
...
- Though some say the performance of the Xbox 360 CPU is not very good, according to Capcom, the performance of a single core of the Xbox 360 CPU is 2/3 of the Pentium 4 with the same clock speed. When SMT is fully exploited, about 4 times larger performance can be observed. In terms of PC it's comparable with 4 SMT threads in a dual-core Pentium 4 Extreme Edition 840 (3.2GHz).
...
- It's often pointed that the memory controller is the weak point of the 360 architecture. According to Capcom, though the latency is not small, SMT is there to hide it. This is very effective for 360 with about 1.5 times performance per core when SMT is used unlike Pentium 4 with only 1.2-1.3 times faster per core with Hyperthreading.
Dead Rising - 6 threads are 2.6 times as fast as a single thread
http://www.watch.impress.co.jp/game/...131/3dlp22.htm
http://www.watch.impress.co.jp/game/...131/3dlp23.htm
http://www.watch.impress.co.jp/game/...131/3dlp24.htm
Lost Planet - 2.15 times faster
http://www.watch.impress.co.jp/game/...131/3dlp25.htm
http://www.watch.impress.co.jp/game/...131/3dlp26.htm
http://www.watch.impress.co.jp/game/...131/3dlp27.htm
...
- A thread that ends quickly has to wait others before the next sync update (= pipeline bubble). When there are many jobs a bubble is minimized since a thread that ended a job fetches another from the queue. When there are few jobs they are processed not in SMT but in different cores. A long job is stacked at the beginning of the queue to reduce stalls.
A scene with low load, the color bars show the load
http://www.watch.impress.co.jp/game/...131/3dlp29.htm
A scene with high load
http://www.watch.impress.co.jp/game/...131/3dlp30.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp28.htm
In the PS3 version of the engine which is currently under development, the parallelism implementation is fairly different due to asymmetric cores. In the job queue, job threads are implemented as software threads. In a job function, a part of processing is offloaded to a co-processor. Until the work is done, it's switched in another software thread. In a co-processor, it fetches data by DMA and process it, then store data by DMA.
...
- In Lost Planet, each character is 10-20K polys. A VS robot is 30-40K polys. A background is about 500K. With shadows and other hidden rendering cost, it's about 3 million polys per frame. In Dead Rising, it's about 4 million polys. In Lost Planet special effects have more load but in Dead Rising they focused on polygon budgets for zombies.
...
- Rendering resolusion is 720p. AA is basically 4xMSAA and runs at 30fps, but when load is higher it's dynamically adjusted down to 2xMSAA or no AA. According to Capcom in Xbox 360 2xMSAA is +10% load and 4xMSAA is +20% load.
...
- 160MB textures are in memory at one time, in which 60-80MB are backgrounds. Loading is seamless for Lost Planet.
...
- Havok for destruction physics
http://www.watch.impress.co.jp/game/...131/3dlp45.htm
Dead Rising and Lost Planet both use Havok for basic physics such as ragdoll. Currently Havok physics is running single-thread and implementing a multithreaded engine is a future task for them.
...
- Character-local physics such as accessories and cloth simulation, character-to-character collisions, and inverse kinematics are done by an original multithreaded physics engine. IK for zombies in Dead Rising became possible by this multithreaded engine.
Without IK
http://www.watch.impress.co.jp/game/...131/3dlp46.htm
With IK
http://www.watch.impress.co.jp/game/...131/3dlp47.htm
Local physics by the original engine
http://www.watch.impress.co.jp/game/...131/3dlp48.htm
Collision detection by the original engine
http://www.watch.impress.co.jp/game/...131/3dlp49.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp50.htm
The "2.5D motion blur" technique is based on NVIDIA OpenGL Shader Tricks by Simon Green at GDC2003.
http://www.watch.impress.co.jp/game/...131/3dlp51.htm
http://www.watch.impress.co.jp/game/...131/3dlp52.htm
Renders a scene into a texture -> renders a velocity map into a texture by Vertex Shader -> generates the final frame by sampling the scene texture based on the velocity map.
http://www.watch.impress.co.jp/game/...131/3dlp53.htm
This method has 3 problems: glitchs in stretching a model that has spots with same vertex coordinates with different normal vectors, image-based rendering artifacts in model edges with different velocities, and doubling vertex load due to original rendering and velocity map rendering. For the 1st problem, they added dummy polygons in vertices like in the stencil shadow volume method.
http://www.watch.impress.co.jp/game/...131/3dlp55.htm
For the second problem, the depth information for models is generated when generating a velocity map to reject artifacts.
http://www.watch.impress.co.jp/game/...131/3dlp56.htm
Scene texture
http://www.watch.impress.co.jp/game/...131/3dlp57.htm
Velocity map
http://www.watch.impress.co.jp/game/...131/3dlp59.htm
Artifacts
http://www.watch.impress.co.jp/game/...131/3dlp61.htm
Depth information for a scene texture
http://www.watch.impress.co.jp/game/...131/3dlp58.htm
Depth information for a velocity map
http://www.watch.impress.co.jp/game/...131/3dlp60.htm
Depth information rejecs artifacts
http://www.watch.impress.co.jp/game/...131/3dlp62.htm
For the 3rd problem, they use a LOD-like method and apply the 2.5D motion blur that stretches models only to the foreground. Since static models and far models hardly move in the world space, their velocities are drawn based on the information in Z-buffer (= camera blur).
http://www.watch.impress.co.jp/game/...131/3dlp63.htm
http://www.watch.impress.co.jp/game/...131/3dlp64.htm
http://www.watch.impress.co.jp/game/...131/3dlp65.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp68.htm
This 2.5D motion blur is suitable for the unified shader architecture of Xbox 360 as vertex shaders are increased when stretching vertices and pixel shaders are increased when generating a velocity map and a blur. According to Capcom, current 2.5D motion blur takes about 5ms and vertex stretching takes 1ms in it. They feel the vertex performance of the Xbox 360 GPU can match that of NVIDIA GeForce 8800. The bad points are a still picture looks not very good and MSAA can't be applied to the blur. MT Framework can output supersampled images for media PR to hide these defects.
Artifacts
http://www.watch.impress.co.jp/game/...131/3dlp66.htm
Supersampled output
http://www.watch.impress.co.jp/game/...131/3dlp67.htm
MSAA and edge artifacts
http://www.watch.impress.co.jp/game/...131/3dlp69.htm
http://www.watch.impress.co.jp/game/...131/3dlp70.htm
Quick workaround by stretching edges
http://www.watch.impress.co.jp/game/...131/3dlp71.htm
http://www.watch.impress.co.jp/game/...131/3dlp72.htm
Full-scene 2.5D motion blur pipeline
http://www.watch.impress.co.jp/game/...131/3dlp73.htm
2.5D motion blur in the product
http://www.watch.impress.co.jp/game/...131/3dlp74.htm
Theoretical PR shot with supersampling
http://www.watch.impress.co.jp/game/...131/3dlp75.htm
No blur
http://www.watch.impress.co.jp/game/...131/3dlp76.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp77.htm
http://www.watch.impress.co.jp/game/...131/3dlp78.htm
MSAA reduction buffer is based on a presentation by Masaki Kawase for CEDEC 2002. In the original reduction buffer method, scene color and Z are reduced to 1/4 or smaller and draws in them -> draws effects in reduction buffers with transparency in alpha -> enlarges reduction buffers and synthesizes them with a scene by alpha. Though the picture quality is lower than 1/4 it can save the fill rate in 1/4. But this method has several problems in HD. Due to depth test done in a low-res Z-buffer, Z relations between particles and objects are not tested correctly and object edges tend to have artifacts. Also an unnatural blur happens in synthesized parts because of a low-res translucent part of a mask pattern.
...
- http://www.watch.impress.co.jp/game/...131/3dlp79.htm
To remedy these problems in the reduction buffer method, MT Framework uses the 10MB EDRAM in the Xbox 360 GPU. In MT Framework, a scene frame is sent to the EDRAM as the content of the sub-pixel buffer without reducing the resolution unlike the original method that reduces it to 1/4. Pixel Shader does the conversion for pixel arrangement in a scene frame and the MSAA sub-pixel buffer. Then it draws translucent particle effects in the EDRAM without adding MSAA. Because of the MSAA algorithm, 1 pixel in a translucent particle is drawn in 4 sub-pixels in the 4xMSAA buffer. When all particles are drawn the content of the subpixel buffer are loaded in the original scene frame arrangement. The rest is the same as the original method. This technique can achieve 4 times more fill-rate for these effects.
24.02 fps without MSAA reduction buffer
http://www.watch.impress.co.jp/game/...131/3dlp80.htm
33.13 fps with MSAA reduction buffer
http://www.watch.impress.co.jp/game/...131/3dlp81.htm
...
- http://www.watch.impress.co.jp/game/...131/3dlp83.htm
The good points of MSAA reduction buffer: it has no synthesis problems except for resolution, it has no Z aliasing and blurring, and it doesn't require a 8-bit output alpha channel which means applicable to FP10. The bad points are it becomes blocky because bilinear filtering is unavailable and it's dependent on the Xbox 360 GPU hardware design.
To reduce block artifacts, only nearby effects are in the reduction buffer. Also LOD and motion blur are applied to hide them.
http://www.watch.impress.co.jp/game/...131/3dlp82.htm
http://www.watch.impress.co.jp/game/...131/3dlp84.htm
...
- Shadow is basically LSPSM (Light-Space Shadow Maps). But depending on the distance from the view frustum, Near, Middle, and Far has each shadow map. In Lost Planet each has a 1024x1024 texel shadow map so it renders 12MB for each frame only for shadows. When generating a shadow map it does Z-buffer rendering but it's helpful that the Xbox 360 GPU has 2x fillrate when it's Z output only.
Self shadow and mutual shadow
http://www.watch.impress.co.jp/game/...131/3dlp86.htm
Near shadow map
http://www.watch.impress.co.jp/game/...131/3dlp87.htm
Middle shadow map
http://www.watch.impress.co.jp/game/...131/3dlp88.htm
Far shadow map
http://www.watch.impress.co.jp/game/...131/3dlp89.htm
In LSPSM (green - the black is the view frustum and the red is the light source) nearby jaggies and far flickers appear. In Lost Planet, they extended LSPSM in a cascade-style (blue) like 3DMark06 and the Doom 4 engine.
http://www.watch.impress.co.jp/game/...131/3dlp90.htm
In actual shadow drawing it only switches shadow maps by the view coordinate and the Z value of a target pixel. They couldn't add a complementing process in edges of shadow maps because its processing load gets too high. In Parallel-Split Shadow Maps by Fan Zhang distances are split automatically but Lost Planet does it manually.
Glitch in LSPSM
http://www.watch.impress.co.jp/game/...131/3dlp91.htm
Cascade-extended LSPSM
http://www.watch.impress.co.jp/game/...131/3dlp92.htm
...
- Soft shadow is done by 9 samples PCF (Percentage Closer Filtering) by Pixel Shader. Splinter Cell 3: Chaos Theory does 16 samples. According to Capcom more dynamic branching was slow for them so they added all shadows with 9 samples.
4-point sampling bilinear PCF
http://www.watch.impress.co.jp/game/...131/3dlp93.htm
9-point sampling bilinear PCF
http://www.watch.impress.co.jp/game/...131/3dlp94.htm
...
- Without normal maps
http://www.watch.impress.co.jp/game/...131/3dlp95.htm
With normal maps
http://www.watch.impress.co.jp/game/...131/3dlp96.htm
Bloom and glare HDR expressions
http://www.watch.impress.co.jp/game/...131/3dlp97.htm
FP16-64 HDR (on PC)
http://www.watch.impress.co.jp/game/...131/3dlp98.htm
FP10-32 HDR on Xbox 360
http://www.watch.impress.co.jp/game/...131/3dlp99.htm
The water surface is by scrolling multi-layered normal maps and the water bottom is by sampling scene-rendering textures, and reflection on the surface is precomputed environmental maps. Wave simulation and dynamic reflection is omitted but it has Fresnel reflection. Human skins are specular maps and diffuse maps adjusted by artists. Fur layers are 8. Particle edges are softened by comparing its depth info with the Z-buffer. Soft particle is running in a usable speed thanks to the EDRAM reduction buffer technique.
Without soft particle
http://www.watch.impress.co.jp/game/...31/3dlp102.htm
With soft particle
http://www.watch.impress.co.jp/game/...31/3dlp103.htm
...
- http://www.watch.impress.co.jp/game/...31/3dlp104.htm
http://www.watch.impress.co.jp/game/...31/3dlp105.htm
http://www.watch.impress.co.jp/game/...31/3dlp106.htm
http://www.watch.impress.co.jp/game/...31/3dlp107.htm
MT Framework has texture compression methods for normal maps and HDR textures based on DXTC. It provides the high-quality version (DXT5) and the low-quality version (DXT1).
http://www.watch.impress.co.jp/game/...31/3dlp108.htm
http://www.watch.impress.co.jp/game/...31/3dlp109.htm
For normal map, DXT5 encoding puts X in alpha and Y in G while R is 1.0. DXT1 puts X in R and Y in G while alpha = 1.0. Decoding of both can be done by the same shader code (X=R*A, Y=G, Z=SQRT(1.0-X^2-Y^2) ).
http://www.watch.impress.co.jp/game/...31/3dlp110.htm
http://www.watch.impress.co.jp/game/...31/3dlp111.htm
http://www.watch.impress.co.jp/game/...31/3dlp112.htm
http://www.watch.impress.co.jp/game/...31/3dlp113.htm
To encode a HDR texture, it puts the reciprocal of the maximum value in the whole texture {R,G,B,1.0} in Alpha which is a scaling coefficient, and divides RGB by the maximum value for a texture. Decoding is RGB / A. In DXT1 dynamic range is lost but it can be processed on the same shader.
http://www.watch.impress.co.jp/game/...31/3dlp112.htm
http://www.watch.impress.co.jp/game/...31/3dlp113.htm
For a light map and other kinds of HDR textures, the maximum value in a texture is stored in the Range constant value and does shading with it when rendering. In DXT5 the RGB values of the HDR texture are normalized and its absolute value is stored in Alpha. In DXT1 the gradation quality gets lower because the absolute value is lost but the maximum value for the dynamic range stays as a constant.
|
Last edited by one; 01-Feb-2007 at 02:02.
|