Hadoop Installation in AWS EC2

After working with Cloudera’s local VM image, I decided to set up Hadoop on AWS EC2. I found two existing blog posts on the topic, but both had limitations and required supplementary guidance to get a working setup.

Here’s the workflow using Apache Whirr to automate cluster deployment:

Seven-Step Installation Workflow

  1. Obtain the latest Whirr binary — Download from the Apache Whirr project page
  2. Configure Whirr settings — Set up the config file with your AWS credentials (access key, secret key, region)
  3. Install Java — Ensure the correct JDK version is available on your local machine
  4. Generate SSH public key — Whirr uses SSH to provision and communicate with cluster nodes
  5. Launch the cluster — Run Whirr and wait for the node access instructions to appear in the output
  6. Connect to cluster nodes via SSH — Use the connection strings Whirr provides
  7. Verify Hadoop functionality — Run a sample MapReduce job to confirm the setup is working

Key Tools & Resources

  • Apache Whirr — Automates cluster provisioning on cloud platforms
  • Cloudera’s blog — Provided a foundation but had gaps that required workarounds
  • Additional troubleshooting documentation filled in the missing pieces

While existing documentation provided a starting point, manual adjustments were necessary for a successful EC2 deployment. Documenting those adjustments is the whole point of this post — hopefully it saves someone else the debugging time.