ARTICLES

Google File System

By adam

May 9, 2020

5.1 High Availability

Chunk Replication

RF of 3
master clones existing replicas when chunkservers go offline or detect corrupted replicas

Master Replication

operation log and checkpoints replicas
“shadow” masters provide read-only access
- file metadata like directory contest could be stale
- reads replica information from logs
- pools from chunkservers to locate chunk replicas
- depends on primary for decisions to create and delete replicas

5.2 Data Integrity

impractical to very replica data between replicas
use 32bit checksum on 64KB blocks
- stored persistently with logging and separate from user data
in reads:
- chunkserver verifies the checksum before returning data
- client reads from another replica
- master creates a different replica and delete the corrupted one
in appends:
- incrementally update the checksum for last partial checksum blocks
- even if last partial checksum is corrupted, new checksum value will not match stored data and corruption will be detected
in writes:
- if write overwrites an existing range on the chunk, need to verify the first and last blocks of the range being overwritten
- calculate new checksums based from previous checksum so that corruption of unchanged areas will be detected

6 Measurements

1 master, two master replicas, 16 chunkservers, and 16 clients

6.1.1 Reads

random 4MB read from 320 GB file set
chunkservers only have 32 GB of memory, expect at most 10% hit rate
theoretical limit 125MB/s with 1Gbps link becomes saturated
- 12.5 MB/s per client when 100Mbps network interface gets saturated
- observed 10MB/s when one client reading
- aggregate read 94 MB/s for 16 readers

6.1.2 Writes

1GB of data in 1MB writes
Limits at 67 MB/s because need to write each byte to 3 of the 16 chunkservers each with a 12.5 MB/s input
write rate for one client is 6.3MB/s - delays in propagating data to replicas
aggregate 35MB/s for 16 clients, multiple clients writing concurrently to same chunkservers

6.1.3 Appends

limited by BW of the chunkservers storing the last chunk of the file

6.2.4 Master Load

200-500 ops/s
master bottlenecs scanning through large directories looking for particular files, change to binary seraches through namespace

6.2.5 Recovery time

experiment: kill chunkserver with 15,000 chunks 600GB of data
what does 91 concurrent clonings mean?

6.3.2 Chuckserver Workload

Bimodal

small reads <64KB from seek-intensive clients
large reads >512KB from long sequential reads
small writes for clients that checkpoint or sync more often
large writes for clients that do more buffering

6.3.3 Appends vs Writes

100x more appends than writes
mutations are mostly appends and not overwriting

6.3.4 Master Workload

FindLocation - request for chunk locations (for reads)
FindLeaseLocker - lease holder information (for mutations)

7 Experiences

RW lock

lock blocked network thread while disk threads were paging in previously mapped data, solution let threads read from a copy so that network thread is not blocked

Summary

design space

common failures
optimize for huge files
most append only and sequential reads

fault tolerance

constant monitoring
- checksum to detect data corruption
replicating crucial data
- chuck replication
fast and automatic recovery
- online repair system compensates for lost resplicas asap

high throughput

separate file system control from data transfer
file system control handled by master
data transfer handled by chunkservers
master involvement is minimized
- large chunk size
- chunk leases - delegates authority to primary replicas