Skip to content

scale large data in the infrastructure layer

Case: Web crawling Problem: Downloads teh intarwebs. Process teh intarwebs.

Scenario: fat servers

  • Reliable, expensive, high-end servers
  • => assume reliability => low fault tolerance
  • Local disks, no RAID
  • Partitioned by domain name part of URL
  • Hierarchical network (to compensate for lack of switch bandwidth)
  • Consequences: failures are more expensive

Scenario: Skinny servers:

  • Commodity, consumer grade, cheap servers
  • => assume frequent failures => fault tolerant software
  • Abundance of CPU and to some extent RAM
  • Communication-heavy, chatty
  • Consequences: more moving parts => higher management overhead (can compensate w/ Puppet, Chef, etc and other automation)