Understanding the BitTorrent Protocol: Part 1

Traditional client-server file distribution has a fundamental scalability problem. A server with 100 Mbps upload capacity serving a 1 GB file can serve approximately 13 simultaneous downloads at full speed. Want to serve more users? You need more servers, more bandwidth, more money. The cost scales linearly with demand, which is why large-scale file distribution has traditionally been expensive.

BitTorrent flips this model on its head. Instead of download capacity being limited by server bandwidth, it scales with the number of downloaders themselves. Each peer uploads while downloading, transforming passive consumers into active distributors. A 1000-peer swarm with average 1 Mbps upload per peer provides 1000 Mbps of aggregate bandwidth without any central infrastructure. The more popular a file becomes, the faster it distributes. This is the opposite of traditional systems where popularity creates bottlenecks.

The protocol solves three critical problems that make this possible:

  1. Bandwidth amplification: Uploaders contribute capacity proportional to their download benefit, creating a self-sustaining ecosystem where everyone who downloads also helps others download.
  2. Verifiable integrity: Content is divided into cryptographically-hashed pieces, which prevents corruption or malicious injection. You can’t fake the data because every piece is verified against its SHA-1 hash.
  3. Decentralized coordination: There’s no single point of failure. Multiple trackers or DHT can coordinate peer discovery, so the system continues functioning even when parts of it go offline.

This architecture makes BitTorrent optimal for large-scale content distribution where centralized hosting costs would be prohibitive. But to understand how it all works, we need to start with the foundations: how torrents are described, how peers discover each other, and how the data is encoded.

Contents

  1. Bencode: BitTorrent’s Wire Format
  2. Metainfo File Structure
  3. Tracker Protocol

Bencode: BitTorrent’s Wire Format

Before peers can exchange files, they need a way to describe what they’re sharing. What’s the file called? How big is it? What pieces does it contain? And crucially, how can we ensure that every client interprets this metadata identically?

This is where Bencode comes in. It’s a binary-safe serialization format that was specifically designed for BitTorrent. You might wonder why Bram Cohen (BitTorrent’s creator) didn’t just use JSON or XML, which were already available. The answer lies in a critical requirement: deterministic encoding. When you hash something cryptographically, you need the exact same byte sequence every time, or you’ll get different hashes. JSON and XML don’t guarantee this. Two programs can represent the same data structure in JSON with different whitespace, key ordering, or number formatting, and you’d get different hashes.

Bencode solves this by having strict rules for how data must be encoded. There’s only one correct way to encode any given data structure. This makes it perfect for generating info hashes (which we’ll discuss later) that uniquely identify torrents. If two people create torrents from the same file with identical settings, they’ll get identical infohashes, and peers using either torrent can share with each other.

Beyond deterministic encoding, Bencode is also simple to implement. The entire specification fits on a page. It’s binary-safe (can handle any byte values), and it’s reasonably compact. Let’s look at how it works.

Type System

Bencode defines four primitive types:

1. Byte Strings

Format: <length>:<contents>
Example: 4:spam → "spam"
Example: 0: → empty string

Length is ASCII decimal, followed by colon delimiter, followed by raw bytes. No encoding assumption. Strings are arbitrary byte sequences. The metainfo specification recommends UTF-8 for human-readable fields, but implementations must treat strings as uint8_t[].

2. Integers

Format: i<number>e
Example: i42e → 42
Example: i-42e → -42
Example: i0e → 0
Invalid: i-0e (negative zero forbidden)
Invalid: i03e (leading zeros forbidden except for i0e)

Integers are ASCII decimal wrapped in i and e delimiters. Arbitrary precision (no 32-bit or 64-bit limit in specification, though implementations typically use int64). Negative zero and leading zeros are explicitly invalid to ensure canonical encoding.

3. Lists

Format: l<contents>e
Example: l4:spam4:eggse → ["spam", "eggs"]
Example: le → []
Example: li42ei-5ee → [42, -5]
Nested: ll4:spam4:eggsee → [["spam", "eggs"]]

Lists begin with l, end with e, contain zero or more bencoded values. Heterogeneous types allowed. Order is preserved.

4. Dictionaries

Format: d<key1><value1>...<keyN><valueN>e
Example: d3:cow3:moo4:spam4:eggse → {"cow": "moo", "spam": "eggs"}
Example: de → {}

Dictionaries begin with d, end with e, contain key-value pairs.

Critical constraint: keys must be byte strings and must appear in sorted order (lexicographic byte comparison, not UTF-8 aware). This ensures deterministic encoding. Values can be any bencoded type.

Example with nested structures:

d4:spaml1:a1:bee → {"spam": ["a", "b"]}
d9:publisher3:bob17:publisher-webpage15:www.example.com18:publisher.location4:homee
→ {"publisher": "bob", "publisher-webpage": "www.example.com", "publisher.location": "home"}

Metainfo File Structure

Now we get to the heart of BitTorrent: the .torrent file. This is the small file you download from a website or receive from a friend when you want to start downloading something via BitTorrent. Despite often being just a few kilobytes, it contains everything needed to coordinate the download of a multi-gigabyte file.

Think of a .torrent file as a blueprint. It describes what you’re downloading (the file names and sizes), how to verify it’s correct (cryptographic hashes), and where to find other people who have it (tracker URLs). It’s all encoded in Bencode, which means the entire file is just one big bencoded dictionary.

The genius of this design is that the .torrent file is tiny and can be distributed easily (via email, web downloads, or even embedded in a magnet link), but it provides all the information needed to bootstrap participation in a potentially massive swarm. Let’s break down what’s inside.

Top-Level Keys

announce: Tracker URL. HTTP(S) or UDP. Examples:

http://tracker.example.com:8080/announce
udp://tracker.example.com:6969/announce

announce-list: Multi-tracker extension (BEP 12). Organized into tiers:

[
  ["http://tracker1.example.com/announce"],
  ["http://backup1.example.com/announce", "http://backup2.example.com/announce"],
  ["http://backup3.example.com/announce"]
]

Client tries trackers sequentially by tier. Within a tier, shuffle order and try all before moving to next tier. On success, move successful tracker to front of tier.

creation date: Unix timestamp (seconds since epoch). Example: 1609459200 = 2021-01-01 00:00:00 UTC.

comment: Free-form text. UTF-8 recommended.

created by: Client name and version. Example: "Transmission/3.00".

info: Core torrent metadata. This dictionary’s bencoded form is hashed to produce the infohash.

Info Dictionary

Contains file layout and piece hashes. Two modes: single-file and multi-file.

Common Fields

piece length: Bytes per piece. Must be power of two. Typical values: 256 KiB, 512 KiB, 1 MiB, 2 MiB, 4 MiB. Smaller pieces increase metadata size and handshake overhead. Larger pieces reduce granularity and increase verification latency.

pieces: Concatenated SHA-1 hashes. Each piece has a 20-byte SHA-1 hash. Total length = ceil(total_bytes / piece_length) * 20.

Example: 10 MB file, 1 MiB pieces = 10 pieces = 200 bytes in pieces field.

Binary data, not hex-encoded. To verify piece N:

expected_hash = pieces[N*20 : (N+1)*20]
actual_hash = SHA1(piece_data)
if expected_hash != actual_hash:
  reject_piece()

name: Display name. For single-file mode, suggested filename. For multi-file mode, suggested directory name.

private: If 1, client must not use DHT, PEX, or LSD. Only announce to tracker(s). Trackers use this to enforce ratio systems by controlling peer discovery.

Single-File Mode

info: {
  "name": "file.iso",
  "length": 1048576000,  // bytes
  "piece length": 262144,  // 256 KiB
  "pieces": <binary blob>,
  "private": 1  // optional
}

length: File size in bytes.

Multi-File Mode

When a torrent contains multiple files (like a complete album with individual song files, or a software package with multiple components), the structure is slightly different. Instead of a single length field, there’s a files list that describes each file individually.

info: {
  "name": "directory_name",
  "piece length": 262144,
  "pieces": <binary blob>,
  "files": [
    {
      "length": 524288,
      "path": ["subdirectory", "file1.txt"]
    },
    {
      "length": 1048576,
      "path": ["file2.bin"]
    }
  ]
}

files: Each file has:

  • length (integer): File size in bytes
  • path (list of strings): Path components. ["dir", "subdir", "file.txt"]dir/subdir/file.txt

Here’s an important detail: files are laid out sequentially in the piece address space. Imagine all the files concatenated into one giant stream of bytes. The first file starts at byte 0, the second file immediately follows, etc. Pieces are then carved out of this byte stream at regular intervals (every piece length bytes). This means pieces can span file boundaries. A single piece might contain the end of one file and the beginning of the next.

Why this design? It simplifies the piece verification logic. Every piece is just a continuous range of bytes, regardless of file boundaries. The client doesn’t need special logic for pieces that cross files - it just reads bytes from the appropriate offsets in the appropriate files.

Example with 3 files (10 KiB, 20 KiB, 15 KiB) and piece length 16 KiB:

Piece 0: bytes 0-16383 (first 10 KiB from file1, first 6 KiB from file2)
Piece 1: bytes 16384-32767 (bytes 6-22 KiB from file2, first 10 KiB from file3)
Piece 2: bytes 32768-45055 (last 5 KiB from file3, zero-padded to piece boundary if last piece)

Info hash Calculation

This is where everything comes together. The infohash is BitTorrent’s way of giving every torrent a unique identifier. It’s a 20-byte SHA-1 hash of the info dictionary, and it serves as the torrent’s fingerprint throughout the system.

Why hash the info dictionary specifically? Because it contains everything that defines what’s being shared: the file names, sizes, piece length, and piece hashes. The announce URL and other metadata in the outer dictionary don’t matter for the content itself. Two people could create .torrent files for the same content using different trackers, and as long as their info dictionaries are identical, they’ll get the same infohash. Their swarms can merge.

Here’s how it’s calculated:

info_dict_bytes = bencode(info_dict)
infohash = SHA1(info_dict_bytes)

The result is 20 bytes, typically displayed as a 40-character hex string like:

2c6b6858d61da9543d4231a71db4b1c9264b0685

This 160-bit identifier is used everywhere in BitTorrent:

  • Tracker announces: URL-encoded in the info_hash parameter, so the tracker knows which swarm you’re joining
  • DHT lookups: Used as the lookup key in the distributed hash table to find peers without a tracker
  • Peer handshakes: Both peers exchange infohashes to verify they’re downloading the same torrent

Here’s the critical implementation detail: the infohash must be computed from the literal bencoded bytes of the info dictionary. You can’t re-encode it, because even though the data structure is the same, you might encode it slightly differently and get a different hash. This is why clients typically extract the raw byte range from the .torrent file (find where the info dictionary starts and ends in the file) rather than parsing and re-encoding. It guarantees an exact match.

Complete .torrent Example

d
  8:announce41:http://tracker.example.com:8080/announce
  13:announce-listll41:http://tracker.example.com:8080/announceeel42:http://backup.example.com:8080/announceee
  13:creation datei1609459200e
  4:infod
    5:filesld
      6:lengthi524288e
      4:pathl4:file8:test.txte
    ed
      6:lengthi1048576e
      4:pathl9:other.bine
    ee
    4:name11:test_torrent
    12:piece lengthi262144e
    6:pieces60:<binary sha1 hashes>
  e
e

Formatted for readability. Actual file is raw bytes without whitespace.


Tracker Protocol

Here’s a fundamental question: if BitTorrent is peer-to-peer, how do peers find each other in the first place? You have a .torrent file that describes what you want to download, but it doesn’t contain IP addresses of peers. Those change constantly as people join and leave the swarm. This is where trackers come in.

A tracker is a simple HTTP (or UDP) server that acts as a central meeting point. It doesn’t store or transfer any file data. Instead, it just maintains a list of peers currently participating in each torrent. When you start downloading, your client announces itself to the tracker (“I’m interested in this torrent, here’s my IP and port”), and the tracker responds with a list of other peers. Now you can connect to them directly and start exchanging pieces.

The elegance is in the simplicity. Trackers are stateless (they don’t need to remember you between announces), lightweight (they only track metadata, not file data), and optional (DHT can replace them). Two protocols exist: HTTP/HTTPS (original, widely supported) and UDP (more efficient, slightly more complex). Let’s examine both.

HTTP Tracker Protocol

Standard HTTP GET request. No POST, no authentication in base protocol.

Announce Request

URL: Tracker URL from .torrent announce field + query parameters.

Required Parameters:

  • info_hash: 20-byte infohash, URL-encoded. Example: %2c%6b%68%58%d6%1d%a9%54...
  • peer_id: 20-byte client identifier, URL-encoded. Typically -XX1234- prefix where XX is client code (AZ=Azureus, UT=μTorrent, TR=Transmission, etc.) followed by random bytes. Example: -TR3000-abcdefghijkl
  • port: TCP listen port (1-65535). NAT/firewall users may report 0.
  • uploaded: Total bytes uploaded this session (integer, not URL-encoded)
  • downloaded: Total bytes downloaded this session
  • left: Bytes remaining to download (0 for seeders)

Optional Parameters:

  • event: Lifecycle marker:
    • started: First announce after torrent starts
    • completed: Announce when download completes (transitioned from leecher to seeder)
    • stopped: Final announce before closing torrent (tracker removes peer from swarm)
    • Omit for regular interval announces
  • compact: If 1, request compact peer list format (6 bytes per IPv4 peer instead of dictionary). All modern clients use this.
  • no_peer_id: If 1, request tracker omit peer_id from response (save bandwidth). Used with compact=1.
  • numwant: Desired number of peers (default 50). Tracker may return fewer.
  • ip: Override reported IP address. Used behind proxies. Tracker may ignore.
  • key: Random value for tracker to verify client identity across IP changes (NAT rebinding).
  • trackerid: Opaque value from previous announce response. Tracker uses for session continuity.

Example URL:

http://tracker.example.com:8080/announce?
info_hash=%2c%6b%68%58%d6%1d%a9%54%3d%42%31%a7%1d%b4%b1%c9%26%4b%06%85&
peer_id=-TR3000-abcdefghijkl&
port=51413&
uploaded=0&
downloaded=0&
left=1048576000&
compact=1&
event=started&
numwant=50

URL Encoding Rules: info_hash and peer_id must be URL-encoded. Use percent-encoding for all non-alphanumeric bytes. Some clients incorrectly encode already-printable characters. Safer to encode everything.

Announce Response

Bencoded dictionary.

Failure Response:

d14:failure reason20:Tracker is shut downe

Key: failure reason (string). Client must display message and stop announcing.

Success Response:

d
  8:intervali1800e
  12:min intervali300e
  8:completei145e
  10:incompletei67e
  5:peers<binary data>
e

Fields:

  • interval: Seconds until next announce (typically 1800 = 30 minutes). Client must not announce more frequently except for event announces.
  • min interval (optional): Minimum seconds between announces. Client should respect this.
  • complete: Number of seeders (peers with all pieces)
  • incomplete: Number of leechers (peers without all pieces)
  • peers: Peer list. Two formats:

Compact Format (Binary): 6 bytes per peer: 4-byte IPv4 + 2-byte port (big-endian).

peers = b'\xc0\xa8\x01\x64\x1a\xe1\xc0\xa8\x01\x65\x1a\xe2'
Decodes to:
  192.168.1.100:6881
  192.168.1.101:6882

Parse:

for i in range(0, len(peers), 6):
  ip = socket.inet_ntoa(peers[i:i+4])
  port = struct.unpack('>H', peers[i+4:i+6])[0]

Dictionary Format (Legacy):

peers: [
  {
    "peer id": "...",  // 20 bytes (omitted if no_peer_id=1)
    "ip": "192.168.1.100",
    "port": 6881
  },
  ...
]

Rarely used. 50+ bytes per peer vs 6 bytes compact.

IPv6 Support (BEP 7): Additional field: peers6 (18 bytes per peer: 16-byte IPv6 + 2-byte port).

peers6 = b'\x20\x01\x0d\xb8...\x1a\xe1'  // IPv6 address + port

Scrape Request

Query tracker for torrent statistics without announcing.

URL: Replace /announce with /scrape in tracker URL. Add info_hash parameter (can repeat for multiple torrents).

Example:

http://tracker.example.com:8080/scrape?info_hash=%2c%6b%68%58...&info_hash=%ab%cd%ef...

Response:

d5:filesd
  20:<infohash1>d
    8:completei145e
    10:incompletei67e
    10:downloadedi5432e
  e
  20:<infohash2>d
    8:completei89e
    10:incompletei34e
    10:downloadedi2100e
  e
ee

Fields per infohash:

  • complete: Seeders
  • incomplete: Leechers
  • downloaded: Total completed downloads (tracker-tracked, may be inaccurate)

Not all trackers support scrape. Client must handle 404 gracefully.

Error Handling

  • 404 Not Found: Tracker doesn’t support scrape or torrent unknown
  • 5xx Server Error: Temporary failure, retry with exponential backoff
  • DNS Failure: Try next tracker in announce-list
  • Timeout: Default 60 seconds, then try next tracker
  • retry in Extension (BEP 31): Tracker response includes retry in field:
    • Integer: Retry after N minutes
    • "never": Permanent failure, remove tracker

Example:

d14:failure reason11:Overloaded8:retry ini5ee

UDP Tracker Protocol (BEP 15)

HTTP trackers work well, but they have overhead. Every announce requires a full TCP handshake (SYN, SYN-ACK, ACK), then HTTP headers, then the response, then TCP teardown (FIN, FIN-ACK). For a tracker handling thousands of announces per second, this adds up.

UDP trackers eliminate this overhead. There’s no connection setup or teardown. Instead, you send a single UDP packet with your announce, and the tracker responds with a single UDP packet containing the peer list. This is much more efficient, especially for high-traffic trackers.

The tradeoff is reliability. UDP doesn’t guarantee delivery or ordering. If a packet gets lost, you won’t know unless you implement your own timeout and retry logic. BitTorrent’s UDP tracker protocol handles this by having clients implement exponential backoff and retries. The protocol is also slightly more complex because it needs a way to prevent IP spoofing attacks (which is where the connection ID comes in).

The protocol works in three steps: connect (get a connection ID), announce (send your info and get peer list), and optionally scrape (get statistics). Let’s walk through each.

Connection Protocol

You might wonder: if UDP is connectionless, why is there a “connection” step? This is a security measure against IP spoofing attacks. Without it, an attacker could send fake announce requests claiming to be from someone else’s IP address. The tracker would respond to that IP, potentially overwhelming it with peer lists (a DDoS attack).

The connection ID solves this problem. Before you can announce, you must first request a connection ID from the tracker. The tracker responds with this ID, which you must include in your subsequent announce request. Since UDP is connectionless, the tracker can’t verify your IP directly, but by requiring this two-step process, it ensures that whoever announces can actually receive UDP packets at that IP address. An attacker spoofing an IP wouldn’t receive the connection ID response, so they couldn’t complete the announce.

The connection ID is valid for about 1 minute, which means you can reuse it for multiple announces within that window, reducing overhead.

Connect Request (16 bytes):

Offset  Size  Name            Value
0       8     protocol_id     0x41727101980 (magic constant)
8       4     action          0 (connect)
12      4     transaction_id  random uint32

Send via UDP to tracker host:port from announce URL.

Connect Response (16 bytes):

Offset  Size  Name            Value
0       4     action          0 (connect)
4       4     transaction_id  must match request
8       8     connection_id   use in subsequent requests

Connection ID valid for 1 minute. Client must reconnect if expired.

Announce Request

Announce Request (98 bytes):

Offset  Size  Name            Value
0       8     connection_id   from connect response
8       4     action          1 (announce)
12      4     transaction_id  random uint32
16      20    info_hash       raw 20 bytes (not URL-encoded)
36      20    peer_id         raw 20 bytes
56      8     downloaded      int64 bytes
64      8     left            int64 bytes
72      8     uploaded        int64 bytes
80      4     event           0=none, 1=completed, 2=started, 3=stopped
84      4     ip              0 (default) or override address
88      4     key             random uint32 for identity verification
92      4     num_want        desired peers (-1 = default)
96      2     port            TCP listen port

All integers big-endian.

Announce Response (20 + 6N bytes):

Offset  Size  Name            Value
0       4     action          1 (announce)
4       4     transaction_id  must match request
8       4     interval        seconds until next announce
12      4     leechers        incomplete count
16      4     seeders         complete count
20      6N    peers           compact format (4-byte IP + 2-byte port per peer)

Parse peers same as HTTP compact format.

Error Response

Error Packet (8+ bytes):

Offset  Size  Name            Value
0       4     action          3 (error)
4       4     transaction_id  must match request
8       N     error_string    human-readable error message

Scrape Request (UDP)

Scrape Request (16 + 20N bytes):

Offset  Size  Name            Value
0       8     connection_id   from connect response
8       4     action          2 (scrape)
12      4     transaction_id  random uint32
16      20N   info_hashes     raw 20-byte infohashes (multiple allowed)

Scrape Response (8 + 12N bytes):

Offset  Size  Name            Value
0       4     action          2 (scrape)
4       4     transaction_id  must match request
8       12N   stats           12 bytes per infohash:
                              - 4 bytes: seeders
                              - 4 bytes: completed
                              - 4 bytes: leechers

Tracker Selection Strategy

With multi-tracker support (announce-list), clients must implement tracker failover:

  1. Tier Processing: Try all trackers in tier 0. If all fail, move to tier 1, etc.
  2. Intra-Tier Shuffling: Randomize tracker order within tier to distribute load.
  3. Successful Tracker Promotion: Move successful tracker to front of tier for next announce.
  4. Private Torrent Restrictions: If private=1, disable DHT/PEX and disconnect all peers on tracker switch.

Example announce-list:

[
  ["udp://tracker1.example.com:6969/announce"],
  ["http://tracker2.example.com:8080/announce", "http://tracker3.example.com:8080/announce"],
  ["http://backup.example.com:8080/announce"]
]

Execution:

  1. Try UDP tracker1
  2. If timeout, shuffle and try HTTP tracker2 and tracker3
  3. If tracker2 succeeds, move to front: ["http://tracker2.example.com:8080/announce", "http://tracker3.example.com:8080/announce"]
  4. If tier 1 fails, try backup tracker in tier 2