Making my game 5x faster

Tags: Graphics & Game Dev

Date: Friday, 12 July 2024

"Mistah Optimization"

In this blog post, I will be covering how I optimized my game and made it around 3-5 times faster compared to before.

Now, the very first thing I worked on was optimizing the level system. The old system utilized hash tables in order to store tile data. An example of how the hash tables were used is given below, in which I loop through a given range of tiles and append whatever tiles I find into a list for testing collisions later.

     
        ...
        uprof_start(&timer);
        for (i32 x = x_div; x < x_div_two; x++) {
            lvl -> x_pos_str = ui32_to_str(x, lvl -> x_pos_str, 12);
            for (i32 y = y_div; y < y_div_two; y++) {
                lvl -> y_pos_str = ui32_to_str(y, lvl -> y_pos_str, 12); 

                strncat(lvl -> pos_str, lvl -> x_pos_str, strlen(lvl -> x_pos_str));
                strncat(lvl -> pos_str, ";", 1); 
                strncat(lvl -> pos_str, lvl -> y_pos_str, strlen(lvl -> y_pos_str));

                lvl -> pos_str[strlen(lvl -> x_pos_str) + 1 + strlen(lvl -> y_pos_str)] = '';
                uht_item_phys_objs_found_t item = uht_phys_objs_betget(&lvl -> tilemap, lvl -> pos_str);
                if (item.found) {
                    lvl -> neighbors[lvl -> current_tile] = item.data;
                    lvl -> current_tile++;
                } 
                memset(lvl -> pos_str, 0, strlen(lvl -> pos_str));
            }
        }
        uprof_end(&timer);
        // engine.h, supaShot-old-table / sys
        ...

In the original code, I test for collisions for a total of 24 tiles using the for loop. The very first thing that I decided to do was to get rid of the string-based hash tables and go for a more key-val pair type of approach where any hash table can old key-value pairs easily with any type of key. The very first challenge I faced with this approach was hashing any sort of data despite its type. I did some research and found a framework called Gunslinger which used SipHash for hashing all sorts of data. It was a bit too much for me to understand so I asked some help from ChatGPT, namely around the topics of:

Getting raw bytes in C
Cases where the byte representation of some data could be the same

With this, I experimented around and adapted the FNV-1a hash function to hashing bytes:

     
        unsigned char * bytePtr = ptr;
        u64 hash = 0xcbf29ce484222325; // FNV_offset_basis
        for (size_t i = 0; i < size; i++) {
            hash = hash ^ bytePtr[i] * 0x100000001b3; // error right here, I found this later
        }
        return hash;

This code right here is good enough for hashing bytes BUT WATCH OUT, there is an error in the code which I found out later. The actual hash function is supposed to be:

     
        unsigned char * bytePtr = ptr;
        u64 hash = 0xcbf29ce484222325; // FNV_offset_basis
        for (size_t i = 0; i < size; i++) {
            hash ^= (u64)bytePtr[i];
            hash *= 0x100000001b3; // FNV_prime
        }
        return hash;

This error resulted in a few more collisions taking place in the hash table, but anyways, I just made a few changes to my generic hash tables. Namely, I changed the key-type to be user-defined. Now, note that when you use strings, you can easily compare them using strcmp but how do you compare raw bytes? Well, here comes memcmp. This function compares the raw memory of whatever data you have passed into it and returns a value based on that. It didn't take me long to refactor the code and now the programmer can simply do something like CHT_DEF(uv2_t, u32) which will generate the necessary code for a hash table that stores data in uv2_t-u32 pairs where uv2_t is a 2D vector and u32 is a 32-bit unsigned integer (duh).

After having finished the refactoring, I copy-pasted the hash table into the game and changed some of the code accordingly so that it can use the new table. I used a key-val pair of a 2D vector and a physics object (custom physics type). You can see the same code, which I gave as an example above, is much simpler now:

     
        ...
        uprof_start(&timer);
        for (i32 x = x_div; x < x_div_two; x++) {
            for (i32 y = y_div; y < y_div_two; y++) {
                uv2_t id = { x, y };
                uht_item_phys_objs_found_t item = uht_phys_objs_betget(&lvl -> tilemap, &id);
                if (item.found) {
                    lvl -> neighbors[lvl -> current_tile] = item.data;
                    lvl -> current_tile++;
                } 
            }
        }
        uprof_end(&timer);
        ...

How fast is it now? The old system:

            
                Physics Lookups Time (ms): 0.0568
                Render Lookups Time (ms): 1.7714

On the other hand, the new one:

        
            Physics Lookups Time (ms): 0.00143
            Render Lookups Time (ms): 0.151

You can see that the render look up time (time to look up tiles for rendering) decreased significantly, by around 10 times and the physics system had almost a 35 times speedup.

At this point, I also fixed the hash function mistake I made earlier and also changed the key-val pair to be a 2D vector and a u8. A person asked me how performant my hash tables were in comparison to other hash table implementations... and oh boy... This is where I found out the flaws of my implementation. I looked at other implementations as well as for some other languages and lets see how long it takes to insert 1 million integers into other hash tables:

NOTE: I DID NOT THOROUGHLY TEST THESE HASH TABLES. THESE TIMES ARE EXECUTION TIMES FROM A SINGLE RUN OF ONE OF THE EXECUTABLES TAKEN RANDOMLY.

1 million string-integer pair benchmark. C/C++ Benchmarks were compiled with -O3

Language	Hash Table	Insertion Performance (ms)	Lookup Performance (ms)	Initial Table Size
Python	Dictionary	296	300	N/A
C++	Unordered Map	282	195	1,000,000
C	Hashdict.c	274	122	1,000,000 * 1.3
C	CHT (mine)	887	143	2,097,152

As you can see, I benchmarked mine, and the time was insanely bad. I tried to optimize it but I was too tired for the day and decided to quit. Instead, I started stress testing my game with large levels like one that had 250,000 tiles and it was still a lot faster than the old one. The old system took around 1 second to process this file while the new one did it within 200 ms. However, I made a minor mistake right here that cost me around 4 hours of development time. While loading in the map, I forgot that I had more than 25 enemies but my enemies list could only store 25 enemies so 1/20 times, the program would crash and it took me 3 hours and learning VS Code debugging to fix that. VS Code is actually pretty good. With this bug fixed, I implemented destroying tiles.

This blog isn't just about optimization though, I also want to talk about some changes to the game and and its systems. The first one being, a change in the way collisions are handled as well as some bugfixes. I found one bug where the screenshake was permanent, this was the result of not normalizing the screenshake value to 0 so that was a pretty quick fix. The collision handling change was, that now I'd first map the size of the entity as well so that we get a number of tiles that is according to the size of the entity on which the collisions will be tested on. You can see below, the player keeps phasing, and this is what I fixed.

    
            i32 xpos = (i32) center_pos.x / lvl -> tile_size.x;
            i32 ypos = (i32) center_pos.y / lvl -> tile_size.y;

            i32 x_div = xpos - (2 + (i32) round(size.x / lvl -> tile_size.x));
            i32 x_div_two = xpos + (2 + (i32) round(size.x / lvl -> tile_size.x)); 

            i32 y_div = ypos - (2 + (i32) size.y / lvl -> tile_size.y);
            i32 y_div_two = ypos + (2 + (i32) size.y / lvl -> tile_size.y);

Above is the changed code, it gets the range of tiles according to the size of the player relative to the tile size and this fixed the bug I showed. I also removed some stuff from the game for the sake of optimizing and changed some settings. This included changing the resolution, pixel art size, removing the background squares, added a setting to turn off sparks, removed the red particle upon killing an enemy. Some of these changes you might have seen earlier as well. The purpose of removing this stuff was also for testing on an older computer and seeing what the performance was like. I found that SDL2 really does not like drawing large textures. In my case, I have a smaller render texture which is half of the window's resolution and I scale it up and draw it. On my laptop, the performance was fine but on an older machine, the performance downgraded massively.

I went with the temporary fix of just reducing the resolution and removing some stuff from the game, but realistically speaking I think that I will most likely add a config file/settings option into the game where players can decide parameters such as resolution, render texture resolution, background squares and particles.

A day later, I came back motivated to optimize the hash table again. This time, I was able to find the problem. I took a look again at my code and I realized, I could just pre-allocate the memory instead of allocating memory whenever I'm adding an element in the chained table. The only time I create another pair would be when a collision takes place. This pre-allocation didn't take long to implement as it simply meant switching up some parts of the code here and there. In a nutshell, it meant:

    
        tsuht_mk_table(&table); // -> PRE-ALLOCATES ALL THE SLOTS
        tsuht_add_to_table("ssss", 123); // -> FILLS THE PRE-ALLOCATED SLOTS 
        // ALLOCATION ONLY TAKES PLACE IN CASE OF COLLISION
        // PREVIOUSLY: tsuht_add_to_table MEANT ALLOCATING ANOTHER PAIR AND ADDING IT
        // new_impl_is_slow.c -> new_tables.c

With this simple but effective optimization, my speed really boosted. I'll update the table and you can see the results for yourself.

Language	Hash Table	Insertion Performance (ms)	Lookup Performance (ms)	Initial Table Size
Python	Dictionary	296	300	N/A
C++	Unordered Map	282	195	1,000,000
C	Hashdict.c	274	122	1,000,000 * 1.3
C	CHT (mine)	887	143	2,097,152
C	CHT NEW (mine)	295	314	2,097,152

Kind of weird that the newer approach has poorer lookups and I still don't know why. Anyways, this approach of pre-allocation gave me the idea of open-addressed hash tables and I rewrote the entire thing which is just code for "I told ChatGPT to rewrite it". Now, cutting this long story short, the open addressed hash tables are not only faster but also consume less memory than the chained hash table I started with. I'll update the table with the newer open addressed table.

Language	Hash Table	Insertion Performance (ms)	Lookup Performance (ms)	Initial Table Size
Python	Dictionary	296	300	N/A
C++	Unordered Map	282	195	1,000,000
C	Hashdict.c	274	122	1,000,000 * 1.3
C	CHT Old Slow	887	143	2,097,152
C	CHT Chained Faster	295	314	2,097,152
C	CHT Open Addressed	254	180	2,097,152

I ended up going with the open-addressed hash table in my game for some time where I noticed that the memory usage was 20 MB less than that of the chained and it was slightly faster. The performance was good enough and I decided to add back the removed stuff such as the background squares and some of the particles. At this point, gs_drava and John (mrFrenik) from the Gunslinger Discord Server recommended that I just go with a simple array instead of using a hash table. They were right. Stress testing the hash table approach on a world completely filled with 1 million tiles made the memory usage of the game jump from 37 MB to 72 MB. Now, if I went with a simple array, I would be saving much more memory. Since I use a u8 to represent my tiles, that is only 1 byte and I would need 1 megabyte at max to represent a 1000x1000 level easily. Yes, in cases where the world isn't completely filled, there will be some wasted memory but that is inevitable since a hash table also has to pre-allocate memory anyways which also wastes memory. Taking this into account, I switched to a simple array and the RAM usage for the 1 million tile level dropped to 40 megabytes. Truly an optimization moment.

Lessons Learned

Keep it simple
Just use arrays where possible

What's next?

Did I waste time working on the hash tables? Absolutely not. The problem with the hash table would arise sooner or later and I'm pretty happy with what I've learned and ended up with. Next TODO:

Fix and improve hash tables, open addressed table has a lot of collisions
Work on actual game mechanics and objectives

1devm0

⚡

Making my game 5x faster

Lessons Learned

What's next?

Thank you for reading!