Hephaestian Sea

Master Boot Record

Storage devices/"disks" (hard drives, SSDs, flash drives, etc.) can be thought of as contiguous arrays of bits numbered from 0 up to the disk size. The device itself does not implement files or directories, or even record how much of its space is used up and empty. The disk is only responsible for saving provided data and reading it back later.

Two layers of software outline the disk to imbue its content with meaning:

Partitioning helps the firmware of the computer load the operating system, and separates parts of the disk used by incompatible software. It is so essential, that anyone who has ever (re-)installed an operating system has used a disk partitioning interface. Despite this, very few software developers know anything about how they work.

Nobody really has to build disk partitioning software anymore as the dominant standard (the GPT i.e. GUID Partition Table) has not changed since the 1990s. Still, it is a shame to rely on a massive, ancient, C-language binary to simply read or tweak a disk's partition table. We can and should do better by understanding the underlying technology.

Sector 1

First, we'll need to get hands-on access to a partitioned disk. On Linux, you can interact with the drive's "bit array" directly, using block devices typically mounted at /dev:

from pathlib import Path

with Path("/dev/nvme0n1").open("rb") as f:
    # Read the first 512 bytes off the first NVME disk
    print(f.read(512).hex())

# Example output:
# eb639000000000000000000000000000000000000000000000000000000000000000000000000000
# 00000000000000000000000000000000000000000000000000000000000000000000000000000000
# 0000000000000000000000800058000000000000fffa9090f6c2807405f6c2707402b280ea797c00
# 0031c08ed88ed0bc0020fba0647c3cff740288c252be807de81701be057cb441bbaa55cd135a5272
# 3d81fb55aa753783e101743231c0894404408844ff894402c7041000668b1e5c7c66895c08668b1e
# 607c66895c0cc744060070b442cd137205bb0070eb76b408cd13730d5a84d20f83d800be8b7de982
# 00660fb6c68864ff40668944040fb6d1c1e20288e888f4408944080fb6c2c0e80266890466a1607c
# 6609c0754e66a15c7c6631d266f73488d131d266f774043b44087d37fec188c530c0c1e80208c188
# d05a88c6bb00708ec331dbb80102cd13721e8cc3601eb900018edb31f6bf00808ec6fcf3a51f61ff
# 265a7cbe867deb03be957de83400be9a7de82e00cd18ebfe47525542200047656f6d004861726420
# 4469736b005265616400204572726f720d0a00bb0100b40ecd10ac3c0075f4c30000000000000000
# 00000000000000000200eeffffff01000000ffff7f0c000000000000000000000000000000000000
# 00000000000000000000000000000000000000000000000000000000000055aa

512 bytes is an important constant which, for historical reasons, is known as a disk "sector". All of our operations from here on out will be in multiples of 512.

The first sector of a disk is special as it was, in the past, the part which the computer's firmware (sometimes referred to as the "BIOS") would load and execute after the boot menu. This is an outdated scheme, but almost everything on the low level of computer software is backwards-compatible with or influenced by software from the 70s and 80s so it is still relevant. Of particular note are the bytes 55aa at the end of the sector, which make up the IBM PC boot signature. Despite IBM PCs being long forgotten, the signature is still required in modern partitioning schemes, so modern drives all have it.

Bootloader

Most of the sector, technically, makes up the code of the bootloader, so it can be disassembled:

# ndisasm -b 16 sector1.bin
00000000  EB63              jmp short 0x65
<snip>
00000065  FA                cli ; clear interrupts
00000066  90                nop
00000067  90                nop
00000068  F6C280            test dl,0x80 ; dl = boot drive number
0000006B  7405              jz 0x72
0000006D  F6C270            test dl,0x70
00000070  7402              jz 0x74
00000072  B280              mov dl,0x80
00000074  EA797C0000        jmp 0x0:0x7c79 ; initialize the code segment
00000079  31C0              xor ax,ax ; jump target from ^
<snip>
000001FE  55                push bp ; boot signature
000001FF  AA                stosb

Note that this is 16-bit x86 assembly.

On modern systems which use UEFI (instead of BIOS boot) this will be a "dummy" bootloader that displays an error message or offers to reboot to a different device.

In any case, whether there is a valid bootloader at all does not impact the partitions in any way. Sector 1 in UEFI systems is typically laid out the same as in BIOS systems.

Partition Table

The partitioning data is after the bootloader, starting at 446 bytes into the sector:

00000200eeffffff01000000ffff7f0c ; partition 1
00000000000000000000000000000000 ; partition 2
00000000000000000000000000000000 ; partition 3
00000000000000000000000000000000 ; partition 4
55aa ; boot signature

This is the "Master Boot Record" table, which is also a deprecated standard, but ubiquitous due to it being semi-required in the GPT standard.

As you see above, it is divided into 4 entries and ends right before the boot signature. The overall layout of the disk is as follows:

Start (decimal) Length
0000 0 446B Bootloader
01BE 446 16B Partition 1
01CE 462 16B Partition 2
01DE 478 16B Partition 3
01EE 494 16B Partition 4
01FE 510 2B Boot signature (0x55AA)

Let's read the table in Python:

from pathlib import Path

with Path("/dev/nvme0n1").open("rb") as f:
    data = f.read(512)

assert data[0x1FE:] == b"\x55\xAA"

entries = [
    data[base:base + 16]
    for base in [
        0x1BE,
        0x1CE,
        0x1DE,
        0x1EE,
    ]
]

for x in entries:
    print(x.hex())
# 00000200eeffffff01000000ffff7f0c
# 00000000000000000000000000000000
# 00000000000000000000000000000000
# 00000000000000000000000000000000

Entries

Each entry has the following structure:

Start (decimal) Length
00 0 1B Status (0x80 = active)1
01 1 3B First sector (CHS)
04 4 1B Type
05 5 3B Last sector (CHS)
08 8 4B First sector (LBA)
0C 12 4B Length in sectors

As an example, here is how to interpret the first partition from our disk:

00       ; status = inactive
000200   ; first sector = 2 (see below for instructions on reading CHS)
ee       ; type = GPT
ffffff   ; last sector = (1023, 255, 63) (overflow)
01000000 ; first sector = 0x00000001 = 1
ffff7f0c ; length = 0x0c7fffff = 209715199 sectors

Note that:


Code to parse MBR entries:

@dataclass(kw_only=True)
class MbrEntry:
    status: int
    first_chs: bytes
    type: int
    last_chs: bytes
    first_lba: int
    length: int

def parse_mbr_entry(data: bytes) -> MbrEntry:
    return MbrEntry(
        status=data[0:1],
        first_chs=data[1:4],
        type=data[4],
        last_chs=data[5:8],
        first_lba=data[8:12],
        length=data[12:],
    )

Testing with our example data, we get one non-zero partition:

MbrEntry(
  status=b'\x00',
  first_chs=b'\x00\x02\x00',
  type=238,
  last_chs=b'\xff\xff\xff',
  first_lba=1,
  length=209715199
)

and three identical zero partitions:

MbrEntry(
  status=b'\x00',
  first_chs=b'\x00\x00\x00',
  type=0,
  last_chs=b'\x00\x00\x00',
  first_lba=0,
  length=0
)
...

Protective MBR

All the partitions are marked inactive (status = 0x00) because is this is a modern disk, so the MBR table is not intended for use. The real partition information is elsewhere, in GPT format, and the MBR does not have enough information for the operating system to boot.

Still, the MBR has to be included and valid and, typically, include a "protective" partition. This is the non-zero partition that we see in the table. Its purpose is to inform legacy software (which does not support GPT) that the disk has data on it and should not be overridden. The protective partition spans the entire disk, as we will shortly see.

The meaning of some fields is not clear by name alone, so we will discuss them separately.

Disk Addressing

There are two somewhat redundant specifications for the placement and size of each partition. One is the pair of fields "First sector (CHS)" and "Last sector (CHS)" and the other is the pair "First sector (LBA)" and "Length in sectors".

These reference two different ways of writing down locations on a storage device:

  1. the legacy Cylinder Head Sector (CHS) format,
  2. and the modern Logical Block Addressing (LBA) format.

CHS

The CHS format got its name from hard disk drive devices which have physical reading "heads", inserted into a stack of magnetic plates like a comb.

You can see them in a photo closeup:

The head stack moves together, and can be set closer to or further from the center of the platters, thus changing which part of the platters the heads are placed on top of. The platters spin when the disk is in use, thus making the heads trace a circular "track" through the platters. The combination of tracks traced by all heads in a given head stack position is called a "cylinder".

So, CHS is three numbers:

  1. C = cylinder, the position of the head stack starting at the center and reaching out towards the edge of the platters,
  2. H = head, the number of the head from which to read,
  3. S = sector, the rotational position of the platter stack.

When reading from a disk, you can read from any head without any physical movement. Reading a cylinder different from the last one requires moving the head stack. And reading a sector different from the last one requires waiting for the platters to spin ("seek") into position.

Old hard drives had simple controllers that reported the exact physical topology of the disk: the number of cylinders, heads, and sectors. Smart file systems could take this information into account to layout data in a way that can be accessed with less seeking (rotation of the platters or heads) and thus faster.2

Eventually, hard drive controllers became more complicated, and started lying about their physical topology to improve the performance of older systems, which were either not compatible with new larger disks or used tricks that harmed performance on new hardware. And so the ability and need to use physical typologies and CHS fell away.

On modern systems, you cannot make assumptions of how many cylinders or heads are on a disk, and cannot map CHS addresses to linear addresses or vice versa. CHS addressing is fully unusable. Thus, you can simply ignore the CHS values from MBR entries. Use the LBA start + length fields instead.

LBA

Logical block addressing, in contrast with CHS, is completely straightforward. The LBA of a sector is simply its linear number on the drive. So, say you wanted to read byte 0x123456 from the drive. You would compute the LBA by dividing by sector size (512): 0x123456 // 512 = 0x91a. Then, ask the drive to read this sector to a buffer, and then read the data at on offset within the buffer using the modulo: 0x123456 % 512 = 0x56.

In pseudo-code:

def read_byte(x: int) -> int:
    lba = x // 512
    data = read_sector(lba)
    return data[x % 512]

In the other direction, given LBA 0x1234 and offset 0x99, you can calculate the byte index using 0x1234 * 512 + 0x99 = 0x246899.

The word "logical" refers to the fact that the underlying hardware could use larger "physical" sectors like the (relatively rare) 4Kn/Advanced Format drives, which actually operate in multiples of 4096 bytes. These devices run in 512 sector emulation (512e) mode by default, which means their controller pretends to have 512-byte sectors anyway, meaning the "logical" and "physical" size is different. It is not clear to me whether these disks could theoretically be partitioned with 4096 sectors, but in any case this is uncommon enough where anything using non-standard sectors would be unusable by a lot of software.

Partition Types

There is no central authority registering partition type numbers so they tend to be very ambiguous. A list is available on wikipedia.

There are a few types that are common enough to be notable:

Type Number
0x00 Empty
0x07 NTFS
0x0C FAT32
0x82 Linux swap
0x83 Linux
0xEE GPT

Our protective partition uses 0xEE (GPT) as expected of a modern GPT-partitioned drive.

MBR from Scratch

Since we've already covered the code for reading an MBR, what remains is to cover the code for writing one.

If you had an empty drive and you wanted to add an MBR partition table to it, here is what you would do:

# use "r+" mode instead of "w" to avoid
# trying to truncate the drive
#
# do not use "a" mode since we want to start at sector 0
with Path("/dev/nvme0n1").open("rb+") as f:
    f.write(b"\x00" * 446) # empty bootloader
    
    # entry 1
    f.write(b"\x00") # status = inactive
    f.write(0.as_bytes(1)) # start head = 0
    f.write(2.as_bytes(1)) # start sector = 2, start cylinder (high 2 bits) = 0
    f.write(0.as_bytes(1)) # start cylinder (low bits) = 0
    f.write(0xEE) # type = GPT
    f.write(0xFFFFFF) # end = overflow
    f.write(2.as_bytes(4, "little")) # start = sector 2
    f.write(ceil(disk_size / 512).as_bytes(4, "little")) # end = end of disk

    f.write(b"\x00" * 16) # entry 2
    f.write(b"\x00" * 16) # entry 3
    f.write(b"\x00" * 16) # entry 4
    f.write(b"\x55aa") # boot signature

This writes only the GPT protective partition (but no actual GPT data), so it's kind of useless as-is but can be easily modified for other partition types.

I will cover reading and writing GPT in a future post, so this code is really just a stepping stone for that.

Addendum: CHS in MBR

CHS is useless but, for completeness sake, I will describe the format in which it is used in MBR. It involves a bit of bit fiddling but is not all that obtuse for an ancient API. The address is 3 bytes, laid out as follows:

0: HHHH HHHH
1: CCss ssss
2: CCCC CCCC

Each group of bits makes up a number, with higher bits before lower bits as usual: H for head, C for cylinder, s for sector.

As you can see, byte 1 has a mix of cylinder and sector bits. Each field also has different total lengths:

You can parse CHS addresses as follows:

def parse_chs(x: bytes) -> tuple[int, int, int]:
    # note that the binary order is HSC not CHS
    #
    # and, of course, the sector number is dirty
    # (we need to take out the cylinder bits)
    return (
        ((x[1] & 0b1100_0000) << 2) | x[2],
        x[0],
        x[1] & 0b0011_1111
    )

Notes:

Sources

Footnotes

  1. In practice the meaning varies, though modern systems will only use 0x80 and 0x00. Only bit 7 is defined and, if set, means active/bootable.

  2. See, for example, the layout used in the BSD FFS.

#programming #python