Optimizing Performance On The Xbox 360

After my Performance Tips and Tricks series I got a lot of requests to write a post regarding the performance on the Xbox 360 platform. I will try to focus on why Xbox 360 is slower than PC and what we can do to get around it. But first we need to underline the importance of good programming.

Algorithms and datastructure complexity

I’ve been on the topic of complexity before. I can’t underline how important it is that we keep the complexity of our algorithms low. If you are sorting a large dataset, make sure you use the correct algorithm and if you are sorting a small dataset (<10 items), there is another and faster algorithm.

Note: If you use .Net List<T>.Sort(), it automatically chooses the best algorithm depending on your dataset.

If you are using pixel perfect collision detection, perhaps you should implement a physics system that uses a polygon collision system instead. You could also implement a broad phase system to filter out collisions that you don’t need to process. Implementing a fast and efficient algorithm is 100 times better than any micro-optimization or performance trick out there. But this post is not about using efficient algorithms, it is about the case where you already have implemented the algorithms and still need to squeeze the last drops of performance out of the platform.

Why is it slow?

Xbox 360 uses a custom triple-core 64bit PowerPC 3.2 GHz CPU. It is designed by IBM with high floating point performance in mind by using multiple FPU and SIMD vector processing units in each core. It is a quite powerful platform, but an equivalent PC still outperforms it, how can that be?

I’m afraid that the hardware is not the problem here, it is the software on the Xbox 360. Let us take an example; the following piece of code runs in 12500 ticks on Xbox 360 and 10 ticks on PC.

That is a huge difference and even if we switch to integers instead of floats (to get around the FPU) we still get around 7060 ticks on Xbox 360 and 10 ticks on PC. The low performance when using floats is due to missing AltiVec support in .Net Compact Framework and some missing optimizations in the floating point code. The .Net Compact Framework was not designed to run on Xbox 360. In fact, it was designed with portability in mind and did not have any floating point code until the Xbox was released. Thankfully the .Net CF team implemented some FPU code and that improved the floating point performance a lot (by a factor of 10!).

Then there is the garbage collector. The CF has a simplified version of the garbage collector that works without generations. Instead it looks up all the objects after 1 MB has been allocated. There are also a few other differences such as code pitching where the GC frees code it has jitted before to free up memory. And last but not least there are the compiler optimizations performed by the CF JIT compiler. The optimizations carried out by the CF JIT is severely limited, especially the method inlining. The CF JIT will only inline a method based on the following criteria:

16 bytes or less of IL
No branching (if statements)
No local variables
No exception handlers
No 32-bit floating point arguments or return value
If it has more than one argument, they must be accessed in order from lowest to highest.

That means it is basically only inline property getter/setters and methods that call other methods.

The missing FPU code, the few optimizations and the simple garbage collector is all because the CF never was designed to work on anything else but embedded devices such as mobile phones. Lets take a look at what options we have…

Optimizing the code

It is very important that you start by profiling your code. You can use the XNA Remote Performance Monitor to profile memory allocations (garbage collector problems) and you can use the Stopwatch class to profile different areas of your code to see which area you should focus on. A normal PC profiler also comes in handy as the areas that tend to take up the most time on PC, almost always are the same areas on the Xbox 360.

When you have found the different areas to focus on, we can move on to actually optimizing them.

Virtual methods
One of the places to optimize are the virtual methods and properties. While virtuals are great for OOP, they are not so great for the CF JIT compiler. Virtual methods are around 40% slower than static or instance calls and just to make it all worse, they are never inlined by the compiler.

To get around it you need to avoid virtuals where you can. If you need to use virtuals, make sure you mark your overridden methods and classes as sealed. If you use the sealed keyword, the compiler can sometimes resolve the destination of a virtual call and avoid the expensive virtual call.

Use fields instead of properties
While properties are a neat way to keep your code clean, they are not that great performance wise. Under the hood, properties are ordinary methods and while simple get/set properties gets inlined by the compiler, there are no guarantees that all of them will get inlined. The solutions is to use fields instead, but not at the costs of a poorly designed application. So use it wisely!

The ref keyword
If you are working with structs, you should try and use the ref keyword to send them by reference instead of by value. If you are working with the XNA library, the Matrix, Vector2 and Vector3 classes have math methods that uses ref and out to pass structs by reference instead of by value.

Inlining
Manually inlining code when using the Compact Framework can give you an extra performance boost. The gain is typically around 5-15% depending on your code. The tricky part of inlining is to determine what code to inline. I recommend you follow some simple rules:

Small methods
Methods that are only called from one place
Methods that does not alter the state of the object (only give input and get output methods)
Methods that does not already get inlined by the compiler

It is important to run a profiler while you inline. You might get even worse performance by inlining the wrong methods. It is important to measure it!

Pooling
In the CLR, creating an object is a really efficient operation while destroying the object is not. This is due to the fact that destroying an object will make it subject to the garbage collector. A way to get around this is to use pooling of objects. You basically create a pool (you could use a stack) and then pre-allocate a bunch of empty objects before use. When you need an object, you fetch one from the pool and when you are done with it, you put it back in the pool. This way you keep the object reference alive and it will get collected by the garbage collector.

Fixed point math
I’ve been on this subject before. If you have some heavy floating point operations that you need to run on a platform with low floating point performance, then you should implement fixed point math. A trick is to identify operations that don’t need floating point accuracy (like my code above in the explanation) and convert it to use integers instead.

Update: Part 2 is here.

Whatever Crossed My Mind