Lemmings Forums

Lix => Lix Main => Topic started by: Simon on July 31, 2023, 03:28:05 AM

Title: Bit pushing in the Lix struct
Post by: Simon on July 31, 2023, 03:28:05 AM
Here's some loose rambling from pushing source code around. I don't have concrete plans to rework anything here, but I've re-examined the existing memory layout this weekend. Thus, why not write it down, maybe it's of interest.

A lix (the struct Lixxie that describes a single lix) is 56 bytes long, assuming we're on a 64-bit machine with 64-bit pointers. The layout is roughly like this:

Bytes 0-32: The lix's job, an emplaced subclass of Job. All subclasses fit into 32 bytes.
Bytes 32-36: Position as two 2-byte integers
Bytes 36-38: 2-byte bitset for all the bools such as facing direction, ability to climb, ...
Bytes 38-40: Queued fling speed as two 1-byte integers
Byte 40: Style (color, player affiliation)
Byte 41: Encounters (water, exits, ...) during the current physics update
Byte 42: Exploder timer in multiplayer
Bytes 43-48: unused (5 bytes)
Bytes 48-56: Pointer to the entire physical world. Kitchen sink object, only filled & used during updating.

The first 32 bytes of that are the emplaced Job; in detail, the Job looks like this:

Bytes 0-8: vtable pointer for the Job class object
Bytes 8-16: monitor pointer for the Job class, an unused (?) multithreading D object per class object
Byte 16: Activity (determines vertical frame in the spritesheet)
Byte 17: Horizontal frame in the spritesheet
Bytes 18-32: Subclass payload: Job subclasses define it, e.g., leftover builder bricks, miner anti-shock

Things in the Job subclass are properly aligned. If the first field wants to be 4-byte-aligned (e.g., the common 32-bit int, which has size 4 and alignment 4), it goes at byte 20 in the Job, with bytes 18 and 19 unused padding. In D as in C++, a single-inheriting subclass starts its data after the base class's data, and the base class's data comes after the hidden vtable pointer (and in D the monitor pointer).

+----------------------+
| 0-8: vpointer of Job |
+----------------------+
| 8-16: monitor of Job |
+--+--+----------------+
|Ac|fr|      18-32:    |
+--+--+     subclass   |
|           payload    |
+----------+-----+-----+
| 32-36:pos|flags|fling|
+--+--+--+-+-----+-----+
|st|en|tm|  5 unused   |
+--+--+--+-------------+
| 48-56: world pointer |
+----------------------+





Cache lines in most computers are 64 bytes. If you fit into a cache line, you're quick to load.

The lix is 56 bytes long, that fits. Still, I should try to pad it with 8 more unused bytes. The padded lix would then always be 64 bytes long, and it's possible that it'll be even more cache-friendly (even though it's longer!) than the 56-byte lix.

The alternative experiment would be to pack everything into 32 bytes. The bytes will then be as snugly packed as a pile of kittens. We'd have to pass the pointer to the outside world in the methods, or abuse the memory layout and cram it over the (hopefully unused) monitor pointer. I think I could reduce the Job subclass payload from 12 to 8 bytes, and cram the Job-independent fields from 16 bytes into 12. It would be an elaborate experiment.

The elephants in the room are really the pointers. These 64-bit pointers (8 bytes) are fat walruses. You can't herd too many of those in your bathtub at the same time.

-- Simon
Title: Re: Bit pushing in the Lix struct
Post by: Simon on July 31, 2023, 09:32:04 PM
struct LixxieImpl {

becomes

align (64) struct LixxieImpl {

This makes the formerly 56-byte-long 8-byte-aligned struct 64-byte-long and 64-byte-aligned. No need to add unused fields.

I've measured the improvement during replay verification of the entire Lix proof collection (https://github.com/SimonN/lemforum-replays). In vanilla Lix 0.10.12 (the most recent release), mass replay verification took 18.3 seconds on sil (my main Desktop). Now it takes around 18.1 seconds, and occasionally I see runtimes in the 17.xx that I've never seen before putting the align (64). Improving the mass replay verification by 0.2 seconds is a 1 % improvement.

The one-liner change of making the struct longer from 56 to 64 bytes and, what's really the heart of the idea, aligning it to 64-byte-long cache lines improves the mass replay verification by 1 %. This goes into the next release, Lix 0.10.13. We'll see in the long term how it runs on 32-bit machines -- if anybody still uses one.

-- Simon