Master Boot Record
Storage devices/"disks" (hard drives, SSDs, flash drives, etc.) can be thought of as contiguous arrays of bits numbered from 0 up to the disk size. The device itself does not implement files or directories, or even record how much of its space is used up and empty. The disk is only responsible for saving provided data and reading it back later.
Two layers of software outline the disk to imbue its content with meaning:
- the first layer is known as partitioning, implemented in the operating system kernel,
- and the second is the file system, implemented by an operating system module or even a userspace program.
Partitioning helps the firmware of the computer load the operating system, and separates parts of the disk used by incompatible software. It is so essential, that anyone who has ever (re-)installed an operating system has used a disk partitioning interface. Despite this, very few software developers know anything about how they work.
Nobody really has to build disk partitioning software anymore as the dominant standard (the GPT i.e. GUID Partition Table) has not changed since the 1990s. Still, it is a shame to rely on a massive, ancient, C-language binary to simply read or tweak a disk's partition table. We can and should do better by understanding the underlying technology.
Sector 1
First, we'll need to get hands-on access to a partitioned disk. On Linux, you can interact with the drive's "bit array" directly, using block devices typically mounted at /dev
:
from pathlib import Path
with Path("/dev/nvme0n1").open("rb") as f:
# Read the first 512 bytes off the first NVME disk
print(f.read(512).hex())
# Example output:
# eb639000000000000000000000000000000000000000000000000000000000000000000000000000
# 00000000000000000000000000000000000000000000000000000000000000000000000000000000
# 0000000000000000000000800058000000000000fffa9090f6c2807405f6c2707402b280ea797c00
# 0031c08ed88ed0bc0020fba0647c3cff740288c252be807de81701be057cb441bbaa55cd135a5272
# 3d81fb55aa753783e101743231c0894404408844ff894402c7041000668b1e5c7c66895c08668b1e
# 607c66895c0cc744060070b442cd137205bb0070eb76b408cd13730d5a84d20f83d800be8b7de982
# 00660fb6c68864ff40668944040fb6d1c1e20288e888f4408944080fb6c2c0e80266890466a1607c
# 6609c0754e66a15c7c6631d266f73488d131d266f774043b44087d37fec188c530c0c1e80208c188
# d05a88c6bb00708ec331dbb80102cd13721e8cc3601eb900018edb31f6bf00808ec6fcf3a51f61ff
# 265a7cbe867deb03be957de83400be9a7de82e00cd18ebfe47525542200047656f6d004861726420
# 4469736b005265616400204572726f720d0a00bb0100b40ecd10ac3c0075f4c30000000000000000
# 00000000000000000200eeffffff01000000ffff7f0c000000000000000000000000000000000000
# 00000000000000000000000000000000000000000000000000000000000055aa
512 bytes is an important constant which, for historical reasons, is known as a disk "sector". All of our operations from here on out will be in multiples of 512.
The first sector of a disk is special as it was, in the past, the part which the computer's firmware (sometimes referred to as the "BIOS") would load and execute after the boot menu. This is an outdated scheme, but almost everything on the low level of computer software is backwards-compatible with or influenced by software from the 70s and 80s so it is still relevant. Of particular note are the bytes 55aa
at the end of the sector, which make up the IBM PC boot signature. Despite IBM PCs being long forgotten, the signature is still required in modern partitioning schemes, so modern drives all have it.
Bootloader
Most of the sector, technically, makes up the code of the bootloader, so it can be disassembled:
# ndisasm -b 16 sector1.bin
00000000 EB63 jmp short 0x65
<snip>
00000065 FA cli ; clear interrupts
00000066 90 nop
00000067 90 nop
00000068 F6C280 test dl,0x80 ; dl = boot drive number
0000006B 7405 jz 0x72
0000006D F6C270 test dl,0x70
00000070 7402 jz 0x74
00000072 B280 mov dl,0x80
00000074 EA797C0000 jmp 0x0:0x7c79 ; initialize the code segment
00000079 31C0 xor ax,ax ; jump target from ^
<snip>
000001FE 55 push bp ; boot signature
000001FF AA stosb
Note that this is 16-bit x86 assembly.
On modern systems which use UEFI (instead of BIOS boot) this will be a "dummy" bootloader that displays an error message or offers to reboot to a different device.
In any case, whether there is a valid bootloader at all does not impact the partitions in any way. Sector 1 in UEFI systems is typically laid out the same as in BIOS systems.
Partition Table
The partitioning data is after the bootloader, starting at 446 bytes into the sector:
00000200eeffffff01000000ffff7f0c ; partition 1
00000000000000000000000000000000 ; partition 2
00000000000000000000000000000000 ; partition 3
00000000000000000000000000000000 ; partition 4
55aa ; boot signature
This is the "Master Boot Record" table, which is also a deprecated standard, but ubiquitous due to it being semi-required in the GPT standard.
As you see above, it is divided into 4 entries and ends right before the boot signature. The overall layout of the disk is as follows:
Start | (decimal) | Length | |
---|---|---|---|
0000 |
0 | 446B | Bootloader |
01BE |
446 | 16B | Partition 1 |
01CE |
462 | 16B | Partition 2 |
01DE |
478 | 16B | Partition 3 |
01EE |
494 | 16B | Partition 4 |
01FE |
510 | 2B | Boot signature (0x55AA ) |
Let's read the table in Python:
from pathlib import Path
with Path("/dev/nvme0n1").open("rb") as f:
data = f.read(512)
assert data[0x1FE:] == b"\x55\xAA"
entries = [
data[base:base + 16]
for base in [
0x1BE,
0x1CE,
0x1DE,
0x1EE,
]
]
for x in entries:
print(x.hex())
# 00000200eeffffff01000000ffff7f0c
# 00000000000000000000000000000000
# 00000000000000000000000000000000
# 00000000000000000000000000000000
Entries
Each entry has the following structure:
Start | (decimal) | Length | |
---|---|---|---|
00 |
0 | 1B | Status (0x80 = active)1 |
01 |
1 | 3B | First sector (CHS) |
04 |
4 | 1B | Type |
05 |
5 | 3B | Last sector (CHS) |
08 |
8 | 4B | First sector (LBA) |
0C |
12 | 4B | Length in sectors |
As an example, here is how to interpret the first partition from our disk:
00 ; status = inactive
000200 ; first sector = 2 (see below for instructions on reading CHS)
ee ; type = GPT
ffffff ; last sector = (1023, 255, 63) (overflow)
01000000 ; first sector = 0x00000001 = 1
ffff7f0c ; length = 0x0c7fffff = 209715199 sectors
Note that:
- all integer values are little endian;
- if the partition is not marked as active it should be ignored.
Code to parse MBR entries:
@dataclass(kw_only=True)
class MbrEntry:
status: int
first_chs: bytes
type: int
last_chs: bytes
first_lba: int
length: int
def parse_mbr_entry(data: bytes) -> MbrEntry:
return MbrEntry(
status=data[0:1],
first_chs=data[1:4],
type=data[4],
last_chs=data[5:8],
first_lba=data[8:12],
length=data[12:],
)
Testing with our example data, we get one non-zero partition:
MbrEntry(
status=b'\x00',
first_chs=b'\x00\x02\x00',
type=238,
last_chs=b'\xff\xff\xff',
first_lba=1,
length=209715199
)
and three identical zero partitions:
MbrEntry(
status=b'\x00',
first_chs=b'\x00\x00\x00',
type=0,
last_chs=b'\x00\x00\x00',
first_lba=0,
length=0
)
...
Protective MBR
All the partitions are marked inactive (status = 0x00
) because is this is a modern disk, so the MBR table is not intended for use. The real partition information is elsewhere, in GPT format, and the MBR does not have enough information for the operating system to boot.
Still, the MBR has to be included and valid and, typically, include a "protective" partition. This is the non-zero partition that we see in the table. Its purpose is to inform legacy software (which does not support GPT) that the disk has data on it and should not be overridden. The protective partition spans the entire disk, as we will shortly see.
The meaning of some fields is not clear by name alone, so we will discuss them separately.
Disk Addressing
There are two somewhat redundant specifications for the placement and size of each partition. One is the pair of fields "First sector (CHS)" and "Last sector (CHS)" and the other is the pair "First sector (LBA)" and "Length in sectors".
These reference two different ways of writing down locations on a storage device:
- the legacy Cylinder Head Sector (CHS) format,
- and the modern Logical Block Addressing (LBA) format.
CHS
The CHS format got its name from hard disk drive devices which have physical reading "heads", inserted into a stack of magnetic plates like a comb.
You can see them in a photo closeup:
The head stack moves together, and can be set closer to or further from the center of the platters, thus changing which part of the platters the heads are placed on top of. The platters spin when the disk is in use, thus making the heads trace a circular "track" through the platters. The combination of tracks traced by all heads in a given head stack position is called a "cylinder".
So, CHS is three numbers:
- C = cylinder, the position of the head stack starting at the center and reaching out towards the edge of the platters,
- H = head, the number of the head from which to read,
- S = sector, the rotational position of the platter stack.
When reading from a disk, you can read from any head without any physical movement. Reading a cylinder different from the last one requires moving the head stack. And reading a sector different from the last one requires waiting for the platters to spin ("seek") into position.
Old hard drives had simple controllers that reported the exact physical topology of the disk: the number of cylinders, heads, and sectors. Smart file systems could take this information into account to layout data in a way that can be accessed with less seeking (rotation of the platters or heads) and thus faster.2
Eventually, hard drive controllers became more complicated, and started lying about their physical topology to improve the performance of older systems, which were either not compatible with new larger disks or used tricks that harmed performance on new hardware. And so the ability and need to use physical typologies and CHS fell away.
On modern systems, you cannot make assumptions of how many cylinders or heads are on a disk, and cannot map CHS addresses to linear addresses or vice versa. CHS addressing is fully unusable. Thus, you can simply ignore the CHS values from MBR entries. Use the LBA start + length fields instead.
LBA
Logical block addressing, in contrast with CHS, is completely straightforward. The LBA of a sector is simply its linear number on the drive. So, say you wanted to read byte 0x123456
from the drive. You would compute the LBA by dividing by sector size (512): 0x123456 // 512 = 0x91a
. Then, ask the drive to read this sector to a buffer, and then read the data at on offset within the buffer using the modulo: 0x123456 % 512 = 0x56
.
In pseudo-code:
def read_byte(x: int) -> int:
lba = x // 512
data = read_sector(lba)
return data[x % 512]
In the other direction, given LBA 0x1234
and offset 0x99
, you can calculate the byte index using 0x1234 * 512 + 0x99 = 0x246899
.
The word "logical" refers to the fact that the underlying hardware could use larger "physical" sectors like the (relatively rare) 4Kn/Advanced Format drives, which actually operate in multiples of 4096 bytes. These devices run in 512 sector emulation (512e) mode by default, which means their controller pretends to have 512-byte sectors anyway, meaning the "logical" and "physical" size is different. It is not clear to me whether these disks could theoretically be partitioned with 4096 sectors, but in any case this is uncommon enough where anything using non-standard sectors would be unusable by a lot of software.
Partition Types
There is no central authority registering partition type numbers so they tend to be very ambiguous. A list is available on wikipedia.
There are a few types that are common enough to be notable:
Type Number | |
---|---|
0x00 |
Empty |
0x07 |
NTFS |
0x0C |
FAT32 |
0x82 |
Linux swap |
0x83 |
Linux |
0xEE |
GPT |
Our protective partition uses 0xEE
(GPT) as expected of a modern GPT-partitioned drive.
MBR from Scratch
Since we've already covered the code for reading an MBR, what remains is to cover the code for writing one.
If you had an empty drive and you wanted to add an MBR partition table to it, here is what you would do:
# use "r+" mode instead of "w" to avoid
# trying to truncate the drive
#
# do not use "a" mode since we want to start at sector 0
with Path("/dev/nvme0n1").open("rb+") as f:
f.write(b"\x00" * 446) # empty bootloader
# entry 1
f.write(b"\x00") # status = inactive
f.write(0.as_bytes(1)) # start head = 0
f.write(2.as_bytes(1)) # start sector = 2, start cylinder (high 2 bits) = 0
f.write(0.as_bytes(1)) # start cylinder (low bits) = 0
f.write(0xEE) # type = GPT
f.write(0xFFFFFF) # end = overflow
f.write(2.as_bytes(4, "little")) # start = sector 2
f.write(ceil(disk_size / 512).as_bytes(4, "little")) # end = end of disk
f.write(b"\x00" * 16) # entry 2
f.write(b"\x00" * 16) # entry 3
f.write(b"\x00" * 16) # entry 4
f.write(b"\x55aa") # boot signature
This writes only the GPT protective partition (but no actual GPT data), so it's kind of useless as-is but can be easily modified for other partition types.
I will cover reading and writing GPT in a future post, so this code is really just a stepping stone for that.
Addendum: CHS in MBR
CHS is useless but, for completeness sake, I will describe the format in which it is used in MBR. It involves a bit of bit fiddling but is not all that obtuse for an ancient API. The address is 3 bytes, laid out as follows:
0: HHHH HHHH
1: CCss ssss
2: CCCC CCCC
Each group of bits makes up a number, with higher bits before lower bits as usual: H
for head, C
for cylinder, s
for sector.
As you can see, byte 1 has a mix of cylinder and sector bits. Each field also has different total lengths:
- 8 bits for
H
, - 10 bits for
C
, - and 6 bits for
s
.
You can parse CHS addresses as follows:
def parse_chs(x: bytes) -> tuple[int, int, int]:
# note that the binary order is HSC not CHS
#
# and, of course, the sector number is dirty
# (we need to take out the cylinder bits)
return (
((x[1] & 0b1100_0000) << 2) | x[2],
x[0],
x[1] & 0b0011_1111
)
Notes:
- As discussed before, there is no way to turn CHS into LBA.
- For partitions using LBA, the CHS end value is set to the "overflow" value
0xffffff
which parses to(C, H, S) = (1023, 255, 63)
Sources
Footnotes
In practice the meaning varies, though modern systems will only use
0x80
and0x00
. Only bit 7 is defined and, if set, means active/bootable.↩See, for example, the layout used in the BSD FFS.↩