Hadoop Installation in AWS EC2
After working with Cloudera’s local VM image, I decided to set up Hadoop on AWS EC2. I found two existing blog posts on the topic, but both had limitations and required supplementary guidance to get a working setup.
Here’s the workflow using Apache Whirr to automate cluster deployment:
Seven-Step Installation Workflow
- Obtain the latest Whirr binary — Download from the Apache Whirr project page
- Configure Whirr settings — Set up the config file with your AWS credentials (access key, secret key, region)
- Install Java — Ensure the correct JDK version is available on your local machine
- Generate SSH public key — Whirr uses SSH to provision and communicate with cluster nodes
- Launch the cluster — Run Whirr and wait for the node access instructions to appear in the output
- Connect to cluster nodes via SSH — Use the connection strings Whirr provides
- Verify Hadoop functionality — Run a sample MapReduce job to confirm the setup is working
Key Tools & Resources
- Apache Whirr — Automates cluster provisioning on cloud platforms
- Cloudera’s blog — Provided a foundation but had gaps that required workarounds
- Additional troubleshooting documentation filled in the missing pieces
While existing documentation provided a starting point, manual adjustments were necessary for a successful EC2 deployment. Documenting those adjustments is the whole point of this post — hopefully it saves someone else the debugging time.