imxf6g
Last Updated: February 25, 2016
·
8.178K
· sagarnikam123
Snikamnaturephoto short

Running WordCount on Hadoop using R script

A wordcount program using R on Apache Hadoop


Prerequisite



(Tip:- Tutorial prepared on Ubuntu / Linux System)


STEP 1. check Points


  • make sure Hadoop is running, type jps on terminal
  • should give running processes as --> DataNode, NameNode, JobTracker, TaskTracker,SecondaryNameNode
  • R must be in path ,run
Rscript --version
# R scripting front-end version 3.0.0 (2013-04-03)
  • export Hadoop Home
$HADOOP_HOME=/home/trendwise/apache/hadoop-1.0.4/

STEP 2. writing mapper & reducer in R


mapper.R

#! /usr/bin/env Rscript

# mapper.R - Wordcount program in R
# script for Mapper (R-Hadoop integration)

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))

## **** could wo with a single readLines or in blocks
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    ## **** can be done as cat(paste(words, "\t1\n", sep=""), sep="")
    for (w in words)
        cat(w, "\t1\n", sep="")
}
close(con)

reducer.R

#! /usr/bin/env Rscript

# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
    val <- unlist(strsplit(line, "\t"))
    list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count
    if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
    cat(w, "\t", get(w, envir = env), "\n", sep = "")

SETP 3. checking mapper file


echo "foo foo quux labs foo bar quux" | Rscript mapper.R 
  • on test file
cat '/home/trendwise/Desktop/Learn/RHadoop/inputFile' | Rscript mapper.R

STEP 4. checking / Running on command line with separate mappers and reducers (Run using R)


echo "foo foo quux labs foo bar quux" | Rscript mapper.R  | sort -k1,1 | Rscript reducer.R
  • on test file
cat inputFile | Rscript mapper.R | sort | Rscript reducer.R

Run using Hadoop-R


SETP 5. copy any text file to HDFS


cd $HADOOP_HOME
bin/hadoop dfs -copyFromLocal '/home/trendwise/apache/hadoop-1.0.4/README.txt'  /readme
bin/hadoop dfs -ls /

STEP 6. running MapReduce scripts on Hadoop


bin/hadoop jar /home/trendwise/apache/hadoop-1.0.4/contrib/streaming/hadoop-streaming-1.0.4.jar \
-file  /home/trendwise/Desktop/Learn/RHadoop/mapper.R  -mapper /home/trendwise/Desktop/Learn/RHadoop/mapper.R \
-file /home/trendwise/Desktop/Learn/RHadoop/reducer.R  -reducer /home/trendwise/Desktop/Learn/RHadoop/reducer.R \
-input /readme -output /RCount

STEP 7. view WordCount output


bin/hadoop fs -cat /RCount/part-00000

STEP 8. copy output to local filesystem


bin/hadoop dfs -get /RCount/part-00000 /home/trendwise/Desktop/Learn/RHadoop/wcOutput.txt

STEP 9. you can view wcOutputfile in any editor


Say Thanks
Respond

1 Response
Add your response

20771
4eabutjw normal

Nice post! Since the popularity of combined platform of R and Hadoop increases more and more, I think the Big Data Analytics can become a emerging trend. With the help of this parallel Data Analytics platform, Large organization can easily derive insightful insights to get bigger and bigger advantages from Big Data Analytics. More at www.youtube.com/watch?v=1jMR4cHBwZE

over 1 year ago ·