Understanding the BitTorrent Protocol: Part 1
Traditional client-server file distribution has a fundamental scalability problem. A server with 100 Mbps upload capacity serving a 1 GB file can serve approximately 13 simultaneous downloads at full speed. Want to serve more users? You need more servers, more bandwidth, more money. The cost scales linearly with demand, which is why large-scale file distribution has traditionally been expensive.
BitTorrent flips this model on its head. Instead of download capacity being limited by server bandwidth, it scales with the number of downloaders themselves. Each peer uploads while downloading, transforming passive consumers into active distributors. A 1000-peer swarm with average 1 Mbps upload per peer provides 1000 Mbps of aggregate bandwidth without any central infrastructure. The more popular a file becomes, the faster it distributes. This is the opposite of traditional systems where popularity creates bottlenecks.
The protocol solves three critical problems that make this possible:
- Bandwidth amplification: Uploaders contribute capacity proportional to their download benefit, creating a self-sustaining ecosystem where everyone who downloads also helps others download.
- Verifiable integrity: Content is divided into cryptographically-hashed pieces, which prevents corruption or malicious injection. You can’t fake the data because every piece is verified against its SHA-1 hash.
- Decentralized coordination: There’s no single point of failure. Multiple trackers or DHT can coordinate peer discovery, so the system continues functioning even when parts of it go offline.
This architecture makes BitTorrent optimal for large-scale content distribution where centralized hosting costs would be prohibitive. But to understand how it all works, we need to start with the foundations: how torrents are described, how peers discover each other, and how the data is encoded.
Contents
Bencode: BitTorrent’s Wire Format
Before peers can exchange files, they need a way to describe what they’re sharing. What’s the file called? How big is it? What pieces does it contain? And crucially, how can we ensure that every client interprets this metadata identically?
This is where Bencode comes in. It’s a binary-safe serialization format that was specifically designed for BitTorrent. You might wonder why Bram Cohen (BitTorrent’s creator) didn’t just use JSON or XML, which were already available. The answer lies in a critical requirement: deterministic encoding. When you hash something cryptographically, you need the exact same byte sequence every time, or you’ll get different hashes. JSON and XML don’t guarantee this. Two programs can represent the same data structure in JSON with different whitespace, key ordering, or number formatting, and you’d get different hashes.
Bencode solves this by having strict rules for how data must be encoded. There’s only one correct way to encode any given data structure. This makes it perfect for generating info hashes (which we’ll discuss later) that uniquely identify torrents. If two people create torrents from the same file with identical settings, they’ll get identical infohashes, and peers using either torrent can share with each other.
Beyond deterministic encoding, Bencode is also simple to implement. The entire specification fits on a page. It’s binary-safe (can handle any byte values), and it’s reasonably compact. Let’s look at how it works.
Type System
Bencode defines four primitive types:
1. Byte Strings
Format: <length>:<contents>
Example: 4:spam → "spam"
Example: 0: → empty string
Length is ASCII decimal, followed by colon delimiter, followed by raw bytes. No
encoding assumption. Strings are arbitrary byte sequences. The metainfo
specification recommends UTF-8 for human-readable fields, but implementations
must treat strings as uint8_t[].
2. Integers
Format: i<number>e
Example: i42e → 42
Example: i-42e → -42
Example: i0e → 0
Invalid: i-0e (negative zero forbidden)
Invalid: i03e (leading zeros forbidden except for i0e)
Integers are ASCII decimal wrapped in i and e delimiters. Arbitrary
precision (no 32-bit or 64-bit limit in specification, though implementations
typically use int64). Negative zero and leading zeros are explicitly invalid to
ensure canonical encoding.
3. Lists
Format: l<contents>e
Example: l4:spam4:eggse → ["spam", "eggs"]
Example: le → []
Example: li42ei-5ee → [42, -5]
Nested: ll4:spam4:eggsee → [["spam", "eggs"]]
Lists begin with l, end with e, contain zero or more bencoded values.
Heterogeneous types allowed. Order is preserved.
4. Dictionaries
Format: d<key1><value1>...<keyN><valueN>e
Example: d3:cow3:moo4:spam4:eggse → {"cow": "moo", "spam": "eggs"}
Example: de → {}
Dictionaries begin with d, end with e, contain key-value pairs.
Critical constraint: keys must be byte strings and must appear in sorted order (lexicographic byte comparison, not UTF-8 aware). This ensures deterministic encoding. Values can be any bencoded type.
Example with nested structures:
d4:spaml1:a1:bee → {"spam": ["a", "b"]}
d9:publisher3:bob17:publisher-webpage15:www.example.com18:publisher.location4:homee
→ {"publisher": "bob", "publisher-webpage": "www.example.com", "publisher.location": "home"}
Metainfo File Structure
Now we get to the heart of BitTorrent: the .torrent file. This is the small
file you download from a website or receive from a friend when you want to
start downloading something via BitTorrent. Despite often being just a few
kilobytes, it contains everything needed to coordinate the download of a
multi-gigabyte file.
Think of a .torrent file as a blueprint. It describes what you’re downloading
(the file names and sizes), how to verify it’s correct (cryptographic hashes),
and where to find other people who have it (tracker URLs). It’s all encoded in
Bencode, which means the entire file is just one big bencoded dictionary.
The genius of this design is that the .torrent file is tiny and can be distributed easily (via email, web downloads, or even embedded in a magnet link), but it provides all the information needed to bootstrap participation in a potentially massive swarm. Let’s break down what’s inside.
Top-Level Keys
announce: Tracker URL. HTTP(S) or UDP. Examples:
http://tracker.example.com:8080/announce
udp://tracker.example.com:6969/announce
announce-list: Multi-tracker extension (BEP 12). Organized into tiers:
[
["http://tracker1.example.com/announce"],
["http://backup1.example.com/announce", "http://backup2.example.com/announce"],
["http://backup3.example.com/announce"]
]
Client tries trackers sequentially by tier. Within a tier, shuffle order and try all before moving to next tier. On success, move successful tracker to front of tier.
creation date: Unix timestamp (seconds since epoch). Example:
1609459200 = 2021-01-01 00:00:00 UTC.
comment: Free-form text. UTF-8 recommended.
created by: Client name and version. Example: "Transmission/3.00".
info: Core torrent metadata. This dictionary’s bencoded form is hashed to
produce the infohash.
Info Dictionary
Contains file layout and piece hashes. Two modes: single-file and multi-file.
Common Fields
piece length: Bytes per piece. Must be power of two. Typical values: 256
KiB, 512 KiB, 1 MiB, 2 MiB, 4 MiB. Smaller pieces increase metadata size and
handshake overhead. Larger pieces reduce granularity and increase verification
latency.
pieces: Concatenated SHA-1 hashes. Each piece has a 20-byte SHA-1 hash.
Total length = ceil(total_bytes / piece_length) * 20.
Example: 10 MB file, 1 MiB pieces = 10 pieces = 200 bytes in pieces field.
Binary data, not hex-encoded. To verify piece N:
expected_hash = pieces[N*20 : (N+1)*20]
actual_hash = SHA1(piece_data)
if expected_hash != actual_hash:
reject_piece()
name: Display name. For single-file mode, suggested filename. For
multi-file mode, suggested directory name.
private: If 1, client must not use DHT, PEX, or LSD. Only announce to
tracker(s). Trackers use this to enforce ratio systems by controlling peer
discovery.
Single-File Mode
info: {
"name": "file.iso",
"length": 1048576000, // bytes
"piece length": 262144, // 256 KiB
"pieces": <binary blob>,
"private": 1 // optional
}
length: File size in bytes.
Multi-File Mode
When a torrent contains multiple files (like a complete album with individual
song files, or a software package with multiple components), the structure is
slightly different. Instead of a single length field, there’s a files list
that describes each file individually.
info: {
"name": "directory_name",
"piece length": 262144,
"pieces": <binary blob>,
"files": [
{
"length": 524288,
"path": ["subdirectory", "file1.txt"]
},
{
"length": 1048576,
"path": ["file2.bin"]
}
]
}
files: Each file has:
length(integer): File size in bytespath(list of strings): Path components.["dir", "subdir", "file.txt"]→dir/subdir/file.txt
Here’s an important detail: files are laid out sequentially in the piece
address space. Imagine all the files concatenated into one giant stream of
bytes. The first file starts at byte 0, the second file immediately follows,
etc. Pieces are then carved out of this byte stream at regular intervals (every
piece length bytes). This means pieces can span file boundaries. A single
piece might contain the end of one file and the beginning of the next.
Why this design? It simplifies the piece verification logic. Every piece is just a continuous range of bytes, regardless of file boundaries. The client doesn’t need special logic for pieces that cross files - it just reads bytes from the appropriate offsets in the appropriate files.
Example with 3 files (10 KiB, 20 KiB, 15 KiB) and piece length 16 KiB:
Piece 0: bytes 0-16383 (first 10 KiB from file1, first 6 KiB from file2)
Piece 1: bytes 16384-32767 (bytes 6-22 KiB from file2, first 10 KiB from file3)
Piece 2: bytes 32768-45055 (last 5 KiB from file3, zero-padded to piece boundary if last piece)
Info hash Calculation
This is where everything comes together. The infohash is BitTorrent’s way of giving every torrent a unique identifier. It’s a 20-byte SHA-1 hash of the info dictionary, and it serves as the torrent’s fingerprint throughout the system.
Why hash the info dictionary specifically? Because it contains everything that defines what’s being shared: the file names, sizes, piece length, and piece hashes. The announce URL and other metadata in the outer dictionary don’t matter for the content itself. Two people could create .torrent files for the same content using different trackers, and as long as their info dictionaries are identical, they’ll get the same infohash. Their swarms can merge.
Here’s how it’s calculated:
info_dict_bytes = bencode(info_dict)
infohash = SHA1(info_dict_bytes)
The result is 20 bytes, typically displayed as a 40-character hex string like:
2c6b6858d61da9543d4231a71db4b1c9264b0685
This 160-bit identifier is used everywhere in BitTorrent:
- Tracker announces: URL-encoded in the info_hash parameter, so the tracker knows which swarm you’re joining
- DHT lookups: Used as the lookup key in the distributed hash table to find peers without a tracker
- Peer handshakes: Both peers exchange infohashes to verify they’re downloading the same torrent
Here’s the critical implementation detail: the infohash must be computed from the literal bencoded bytes of the info dictionary. You can’t re-encode it, because even though the data structure is the same, you might encode it slightly differently and get a different hash. This is why clients typically extract the raw byte range from the .torrent file (find where the info dictionary starts and ends in the file) rather than parsing and re-encoding. It guarantees an exact match.
Complete .torrent Example
d
8:announce41:http://tracker.example.com:8080/announce
13:announce-listll41:http://tracker.example.com:8080/announceeel42:http://backup.example.com:8080/announceee
13:creation datei1609459200e
4:infod
5:filesld
6:lengthi524288e
4:pathl4:file8:test.txte
ed
6:lengthi1048576e
4:pathl9:other.bine
ee
4:name11:test_torrent
12:piece lengthi262144e
6:pieces60:<binary sha1 hashes>
e
e
Formatted for readability. Actual file is raw bytes without whitespace.
Tracker Protocol
Here’s a fundamental question: if BitTorrent is peer-to-peer, how do peers find each other in the first place? You have a .torrent file that describes what you want to download, but it doesn’t contain IP addresses of peers. Those change constantly as people join and leave the swarm. This is where trackers come in.
A tracker is a simple HTTP (or UDP) server that acts as a central meeting point. It doesn’t store or transfer any file data. Instead, it just maintains a list of peers currently participating in each torrent. When you start downloading, your client announces itself to the tracker (“I’m interested in this torrent, here’s my IP and port”), and the tracker responds with a list of other peers. Now you can connect to them directly and start exchanging pieces.
The elegance is in the simplicity. Trackers are stateless (they don’t need to remember you between announces), lightweight (they only track metadata, not file data), and optional (DHT can replace them). Two protocols exist: HTTP/HTTPS (original, widely supported) and UDP (more efficient, slightly more complex). Let’s examine both.
HTTP Tracker Protocol
Standard HTTP GET request. No POST, no authentication in base protocol.
Announce Request
URL: Tracker URL from .torrent announce field + query parameters.
Required Parameters:
info_hash: 20-byte infohash, URL-encoded. Example:%2c%6b%68%58%d6%1d%a9%54...peer_id: 20-byte client identifier, URL-encoded. Typically-XX1234-prefix where XX is client code (AZ=Azureus, UT=μTorrent, TR=Transmission, etc.) followed by random bytes. Example:-TR3000-abcdefghijklport: TCP listen port (1-65535). NAT/firewall users may report 0.uploaded: Total bytes uploaded this session (integer, not URL-encoded)downloaded: Total bytes downloaded this sessionleft: Bytes remaining to download (0 for seeders)
Optional Parameters:
event: Lifecycle marker:started: First announce after torrent startscompleted: Announce when download completes (transitioned from leecher to seeder)stopped: Final announce before closing torrent (tracker removes peer from swarm)- Omit for regular interval announces
compact: If1, request compact peer list format (6 bytes per IPv4 peer instead of dictionary). All modern clients use this.no_peer_id: If1, request tracker omit peer_id from response (save bandwidth). Used withcompact=1.numwant: Desired number of peers (default 50). Tracker may return fewer.ip: Override reported IP address. Used behind proxies. Tracker may ignore.key: Random value for tracker to verify client identity across IP changes (NAT rebinding).trackerid: Opaque value from previous announce response. Tracker uses for session continuity.
Example URL:
http://tracker.example.com:8080/announce?
info_hash=%2c%6b%68%58%d6%1d%a9%54%3d%42%31%a7%1d%b4%b1%c9%26%4b%06%85&
peer_id=-TR3000-abcdefghijkl&
port=51413&
uploaded=0&
downloaded=0&
left=1048576000&
compact=1&
event=started&
numwant=50
URL Encoding Rules: info_hash and peer_id must be URL-encoded. Use
percent-encoding for all non-alphanumeric bytes. Some clients incorrectly
encode already-printable characters. Safer to encode everything.
Announce Response
Bencoded dictionary.
Failure Response:
d14:failure reason20:Tracker is shut downe
Key: failure reason (string). Client must display message and stop
announcing.
Success Response:
d
8:intervali1800e
12:min intervali300e
8:completei145e
10:incompletei67e
5:peers<binary data>
e
Fields:
interval: Seconds until next announce (typically 1800 = 30 minutes). Client must not announce more frequently except foreventannounces.min interval(optional): Minimum seconds between announces. Client should respect this.complete: Number of seeders (peers with all pieces)incomplete: Number of leechers (peers without all pieces)peers: Peer list. Two formats:
Compact Format (Binary): 6 bytes per peer: 4-byte IPv4 + 2-byte port (big-endian).
peers = b'\xc0\xa8\x01\x64\x1a\xe1\xc0\xa8\x01\x65\x1a\xe2'
Decodes to:
192.168.1.100:6881
192.168.1.101:6882
Parse:
for i in range(0, len(peers), 6):
ip = socket.inet_ntoa(peers[i:i+4])
port = struct.unpack('>H', peers[i+4:i+6])[0]
Dictionary Format (Legacy):
peers: [
{
"peer id": "...", // 20 bytes (omitted if no_peer_id=1)
"ip": "192.168.1.100",
"port": 6881
},
...
]
Rarely used. 50+ bytes per peer vs 6 bytes compact.
IPv6 Support (BEP 7):
Additional field: peers6 (18 bytes per peer: 16-byte IPv6 + 2-byte port).
peers6 = b'\x20\x01\x0d\xb8...\x1a\xe1' // IPv6 address + port
Scrape Request
Query tracker for torrent statistics without announcing.
URL: Replace /announce with /scrape in tracker URL. Add info_hash
parameter (can repeat for multiple torrents).
Example:
http://tracker.example.com:8080/scrape?info_hash=%2c%6b%68%58...&info_hash=%ab%cd%ef...
Response:
d5:filesd
20:<infohash1>d
8:completei145e
10:incompletei67e
10:downloadedi5432e
e
20:<infohash2>d
8:completei89e
10:incompletei34e
10:downloadedi2100e
e
ee
Fields per infohash:
complete: Seedersincomplete: Leechersdownloaded: Total completed downloads (tracker-tracked, may be inaccurate)
Not all trackers support scrape. Client must handle 404 gracefully.
Error Handling
- 404 Not Found: Tracker doesn’t support scrape or torrent unknown
- 5xx Server Error: Temporary failure, retry with exponential backoff
- DNS Failure: Try next tracker in announce-list
- Timeout: Default 60 seconds, then try next tracker
retry inExtension (BEP 31): Tracker response includesretry infield:- Integer: Retry after N minutes
"never": Permanent failure, remove tracker
Example:
d14:failure reason11:Overloaded8:retry ini5ee
UDP Tracker Protocol (BEP 15)
HTTP trackers work well, but they have overhead. Every announce requires a full TCP handshake (SYN, SYN-ACK, ACK), then HTTP headers, then the response, then TCP teardown (FIN, FIN-ACK). For a tracker handling thousands of announces per second, this adds up.
UDP trackers eliminate this overhead. There’s no connection setup or teardown. Instead, you send a single UDP packet with your announce, and the tracker responds with a single UDP packet containing the peer list. This is much more efficient, especially for high-traffic trackers.
The tradeoff is reliability. UDP doesn’t guarantee delivery or ordering. If a packet gets lost, you won’t know unless you implement your own timeout and retry logic. BitTorrent’s UDP tracker protocol handles this by having clients implement exponential backoff and retries. The protocol is also slightly more complex because it needs a way to prevent IP spoofing attacks (which is where the connection ID comes in).
The protocol works in three steps: connect (get a connection ID), announce (send your info and get peer list), and optionally scrape (get statistics). Let’s walk through each.
Connection Protocol
You might wonder: if UDP is connectionless, why is there a “connection” step? This is a security measure against IP spoofing attacks. Without it, an attacker could send fake announce requests claiming to be from someone else’s IP address. The tracker would respond to that IP, potentially overwhelming it with peer lists (a DDoS attack).
The connection ID solves this problem. Before you can announce, you must first request a connection ID from the tracker. The tracker responds with this ID, which you must include in your subsequent announce request. Since UDP is connectionless, the tracker can’t verify your IP directly, but by requiring this two-step process, it ensures that whoever announces can actually receive UDP packets at that IP address. An attacker spoofing an IP wouldn’t receive the connection ID response, so they couldn’t complete the announce.
The connection ID is valid for about 1 minute, which means you can reuse it for multiple announces within that window, reducing overhead.
Connect Request (16 bytes):
Offset Size Name Value
0 8 protocol_id 0x41727101980 (magic constant)
8 4 action 0 (connect)
12 4 transaction_id random uint32
Send via UDP to tracker host:port from announce URL.
Connect Response (16 bytes):
Offset Size Name Value
0 4 action 0 (connect)
4 4 transaction_id must match request
8 8 connection_id use in subsequent requests
Connection ID valid for 1 minute. Client must reconnect if expired.
Announce Request
Announce Request (98 bytes):
Offset Size Name Value
0 8 connection_id from connect response
8 4 action 1 (announce)
12 4 transaction_id random uint32
16 20 info_hash raw 20 bytes (not URL-encoded)
36 20 peer_id raw 20 bytes
56 8 downloaded int64 bytes
64 8 left int64 bytes
72 8 uploaded int64 bytes
80 4 event 0=none, 1=completed, 2=started, 3=stopped
84 4 ip 0 (default) or override address
88 4 key random uint32 for identity verification
92 4 num_want desired peers (-1 = default)
96 2 port TCP listen port
All integers big-endian.
Announce Response (20 + 6N bytes):
Offset Size Name Value
0 4 action 1 (announce)
4 4 transaction_id must match request
8 4 interval seconds until next announce
12 4 leechers incomplete count
16 4 seeders complete count
20 6N peers compact format (4-byte IP + 2-byte port per peer)
Parse peers same as HTTP compact format.
Error Response
Error Packet (8+ bytes):
Offset Size Name Value
0 4 action 3 (error)
4 4 transaction_id must match request
8 N error_string human-readable error message
Scrape Request (UDP)
Scrape Request (16 + 20N bytes):
Offset Size Name Value
0 8 connection_id from connect response
8 4 action 2 (scrape)
12 4 transaction_id random uint32
16 20N info_hashes raw 20-byte infohashes (multiple allowed)
Scrape Response (8 + 12N bytes):
Offset Size Name Value
0 4 action 2 (scrape)
4 4 transaction_id must match request
8 12N stats 12 bytes per infohash:
- 4 bytes: seeders
- 4 bytes: completed
- 4 bytes: leechers
Tracker Selection Strategy
With multi-tracker support (announce-list), clients must implement tracker failover:
- Tier Processing: Try all trackers in tier 0. If all fail, move to tier 1, etc.
- Intra-Tier Shuffling: Randomize tracker order within tier to distribute load.
- Successful Tracker Promotion: Move successful tracker to front of tier for next announce.
- Private Torrent Restrictions: If
private=1, disable DHT/PEX and disconnect all peers on tracker switch.
Example announce-list:
[
["udp://tracker1.example.com:6969/announce"],
["http://tracker2.example.com:8080/announce", "http://tracker3.example.com:8080/announce"],
["http://backup.example.com:8080/announce"]
]
Execution:
- Try UDP tracker1
- If timeout, shuffle and try HTTP tracker2 and tracker3
- If tracker2 succeeds, move to front:
["http://tracker2.example.com:8080/announce", "http://tracker3.example.com:8080/announce"] - If tier 1 fails, try backup tracker in tier 2