Sunday, 27 October 2013

Running Hadoop locally without installation

If we want to take Hadoop for a test-drive without installing the whole distribution, we can do it quite easily.

First of all, let's create a maven project with the following dependencies:

<dependency>
<groupId>org.apache.mahout.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.1</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.0.13</version>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>
There is a known issue with running newer version on Windows, so the older one is chosen.
Cygwin is also required to be installed when running on Windows.

We will create a job to count the words in files (it's a well-known example taken from the official tutorial).

Our mapper would look like:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}
The mapper splits the lines from file into words and pass on each word as the key with the value of one.

Here comes the reducer:

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
The reducer receives all values for given key and counts them.

All that is left is a main class that will run the job:

public class App {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(App.class);
FileInputFormat.addInputPath(job, new Path("src/main/resources"));
FileOutputFormat.setOutputPath(job, new Path("results/output-" + System.currentTimeMillis()));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
view raw App.java hosted with ❤ by GitHub
We are setting job's mapper, reducer and classer for key and value.
Input and output paths are set as well.
You can run it directly and check the output file with the result.
The whole project can be found on github.