Skip to content

Advanced Topics

Cloud Computing

Service Models

  • IaaS (Infrastructure as a Service): Virtual machines, storage, networking (AWS EC2, Google Compute Engine)
  • PaaS (Platform as a Service): Managed runtime environments (Google App Engine, Heroku)
  • SaaS (Software as a Service): Fully managed applications (Gmail, Office 365)

Key Concepts

  • Virtualization: Hypervisors enable multiple VMs on one physical machine
  • Containers: Lightweight isolation (Docker, Kubernetes) — share kernel, faster startup
  • Elasticity: Automatic scaling based on demand
  • Multi-tenancy: Multiple customers share the same infrastructure

MapReduce

Programming model for processing large datasets on clusters.

Paradigm

Map:    (key1, value1) → [(key2, value2)]
Reduce: (key2, [value2]) → [(key3, value3)]

Key Features

  • Automatic parallelization across cluster
  • Fault tolerance via data replication and task re-execution
  • Locality optimization: move computation to data

Ecosystem

  • Hadoop: Open-source MapReduce implementation
  • Spark: In-memory processing, RDD abstraction, faster than Hadoop for iterative algorithms
  • HDFS: Distributed filesystem for large-scale data storage

Fault Tolerance

Types of Failures

  • Fail-stop: Processor ceases operation, detectable
  • Byzantine: Arbitrary (malicious) behavior
  • Transient: Temporary (e.g., bit flip), may self-correct

Techniques

  • Checkpoint/Restart: Periodically save state, restart from last checkpoint on failure
  • Replication: Keep multiple copies of data/tasks (e.g., 3-way replication in HDFS)
  • Consensus Protocols: Paxos, Raft — ensure agreement in distributed systems despite failures

Distributed Machine Learning

Data Parallelism

Each worker has a copy of the model, trains on a data shard, gradients are aggregated.

Model Parallelism

Different parts of the model reside on different workers. Used for very large models.

Parameter Server Architecture

  • Parameter servers: Store shared model parameters
  • Workers: Compute gradients on local data
  • Communication: Push gradients, pull updated parameters

Frameworks

  • Horovod (MPI-based, ring all-reduce)
  • PyTorch Distributed (torch.distributed)
  • TensorFlow distributed strategy

High Performance Computing (HPC)

TOP500

Lists the 500 most powerful computers in the world, using Rmax of Linpack benchmark.

  • Heterogeneous computing (CPU + GPU + FPGA)
  • Exascale computing (1018 flops)
  • Convergence of HPC and AI workloads
  • Energy efficiency as a primary design constraint

Key Units

PrefixFlop/sBytes
Mega106220
Giga109230
Tera1012240
Peta1015250
Exa1018260