First of all, let's create a maven project with the following dependencies:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<dependency> | |
<groupId>org.apache.mahout.hadoop</groupId> | |
<artifactId>hadoop-core</artifactId> | |
<version>0.20.1</version> | |
</dependency> | |
<dependency> | |
<groupId>ch.qos.logback</groupId> | |
<artifactId>logback-classic</artifactId> | |
<version>1.0.13</version> | |
</dependency> | |
<dependency> | |
<groupId>commons-httpclient</groupId> | |
<artifactId>commons-httpclient</artifactId> | |
<version>3.1</version> | |
</dependency> |
Cygwin is also required to be installed when running on Windows.
We will create a job to count the words in files (it's a well-known example taken from the official tutorial).
Our mapper would look like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { | |
private final static IntWritable ONE = new IntWritable(1); | |
private Text word = new Text(); | |
@Override | |
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { | |
String line = value.toString(); | |
StringTokenizer tokenizer = new StringTokenizer(line); | |
while (tokenizer.hasMoreTokens()) { | |
word.set(tokenizer.nextToken()); | |
context.write(word, ONE); | |
} | |
} | |
} |
Here comes the reducer:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { | |
@Override | |
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { | |
int sum = 0; | |
for (IntWritable val : values) { | |
sum += val.get(); | |
} | |
context.write(key, new IntWritable(sum)); | |
} | |
} |
All that is left is a main class that will run the job:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public class App { | |
public static void main(String[] args) throws Exception { | |
Job job = new Job(); | |
job.setJarByClass(App.class); | |
FileInputFormat.addInputPath(job, new Path("src/main/resources")); | |
FileOutputFormat.setOutputPath(job, new Path("results/output-" + System.currentTimeMillis())); | |
job.setMapperClass(WordCountMapper.class); | |
job.setReducerClass(WordCountReducer.class); | |
job.setOutputKeyClass(Text.class); | |
job.setOutputValueClass(IntWritable.class); | |
boolean result = job.waitForCompletion(true); | |
System.exit(result ? 0 : 1); | |
} | |
} |
Input and output paths are set as well.
You can run it directly and check the output file with the result.
The whole project can be found on github.