R programming: May 2017

Connect R with Google Analytics

Google Analytics is being used by analyst for various purposes, like who all are accessing their websites and at what time of day. What are the prominent keywords being entered in search criteria of webpage. It will be very helpful for analysts/professionals if they can directly import data from GA into for further analysis.
Method 1

install.package("RGoogleAnalytics")
require(RGoogleAnalytics)

## It need not be executed in each session as the token is saved in the working directory of R on your computer

token <- Auth(client.id="client Id",client.secret="Client Secret")
save(token,file="token_file")

## In future sessions it can be loaded as follows
load("./token_file") ,
ValidateToken(token)
query.list<-Init(start.date="2017-5-30",
   end.date ="2017-5-31",
                         dimensions = "ga:date,ga:hour",
   metrics = "ga:sessions,ga:pageviews",
   max.results=100000,
   sort = "-ga:date",
                         table.id = "ga:table.id")
## Table ID is in the URL of your Google Analyics page. It is everything past the “p” in the URL. Example,
https://www.google.com/analytics/web/?hl=en#management/Setting/a48963421w80588688pTABLE_ID_NUMBER

ga.query <- QueryBuilder(query.list)
ga.data <- GetReportData(ga.query, token, split_daywise = T, delay = 5)

The data get saved in data fram ga.data.

R connectivity with Oracle

R can be connected with different databases like Oracle, Teradata, Netezza.
Here I am explaining connectivity with Oracle.

How to connect R with Oracle

##Step1: Install RJDBC package in R

install.packages('RJDBC')

library(RJDBC)

##Step 2: Download Oracle RJDBC Driver.
##Go to http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html.

Download the ojdbc6.jar file. Place it in a permanent directory.

##Step 3: Create a Driver Object in R.

jdbcDriver =JDBC("oracle.jdbc.OracleDriver",classPath="/directory/ojdbc6.jar")

##Step 4: Create a Connection to the Oracle Database .
jdbcConnection =dbConnect(jdbcDriver, "jdbc:oracle:thin:@//database.hostname.com:port/service_name/sid", "username", "password")

##Step 5: Run Oracle SQL Query.
##dbReadTable: read a table into a data frame

df1=dbReadTable(con,'PC_ITEM')

# dbGetQuery: read the result from a SQL statement to a data frame

df2=dbGetQuery(con,'select * from tabl where to_number(colname)<10')

# dbWriteTable: write a data frame to the schema. It is typically very slow with large tables.

dbWriteTable(con,'TableName',dataframe)

Functions used in R

R has a huge list of functions. Below are very commonly used functions which are used in day to day life while working in R.
We will use following object to test the functionality of functions.
m<-matrix(1:12,6,2)
a<-array(1:8,c(2,2,2))
d<-data.frame("Amy",1001,c(78,45,89,78,67))

##1. dim function

It is used to check the dimensions of an object like matrix, array or data frame. Dim function is not applicable on vectors.

dim(m)
[1] 6 2

dim(a)
[1] 2 2 2

dim(d)
[1] 5 3

##2. head(obj,n) function

It is used to print the first n lines of an object like matrix or array or data frame. By default n is 5. So we write head(m), it will show first five lines of matrix.
head(m, 2)
        [,1] [,2]
[1,]    1    7
[2,]    2    8

##3.tail (obj,n) function

It is used to print the last n lines of an object like matrix or array or data frame. By default n is 5. So we write tail(m), it will show last five lines of matrix.
tail(m,2)
[,1] [,2]
[5,] 5 11
[6,] 6 12

##4. Str(Object)

It is used to check the structure of any new object. Like for m it has returned that m is an integer matrix with 6 rows and 2 column. Apart from this it also display the data stored in structure.
str(m)
int [1:6, 1:2] 1 2 3 4 5 6 7 8 9 10 ...

##5 sort(object, decreasing=FALSE/TRUE)

Sort object is to sort the data of an object in ascending or descending order.
v<-c(9,1, 3, -4,0,-9)
sort(v)
[1] -9 -4 0 1 3 9

##6 order(object,decreasing=FALSE)

order object returns the index number of the object in ascending or descending order
order(c(4,2,7,1,3,9,10,16,13))
[1] 4 2 5 1 3 6 7 9 8

##7 split(x,f)

##split function divides the data into groups as defined by f.
data(energy)
expand stature
9.21 Obese
7.53 lean
7.48 lean
8.08 lean
8.09 lean
10.15 Obese
split(energy$expand, energy$stature)
$lean
7.53 7.48.....
$obese
9.2110.15....

## 8) unique(object).

##Unique function returns the unique value inside a object unique(c(1,1,1,2,2,3,3,3,4,4,4))
[1] 1 2 3

## 9) paste(vector1, vector2, sep= , collapse=).

Paste concatenates the two vectors according to their index number. First element of vector1 gets concatenated with first element of vector2 and value passed in sep will be placed between them. Now all these elements are collapsed togaeher with value of collapse placed between them The output of paste function is a one element vector which has all elements concatenated together.
part1<-c("M","na","i", "Te")
part2<-c("y","me","s","st")
paste(part1,part2,sep="" ,collapse=" ")
[1] "My name is Test"
paste(part1,part2,sep="." ,collapse="-")
[1] "M.y-na.me-i.s-Te.st"
part1<-c(1,3,5,7)
part2<-c(2,4,6,8)
paste(part1,part2,sep="" ,collapse="")
[1] "12345678"

Longitudinal Data Analysis

What is Longitudinal data
It is the collection of few observations over time from various sources such a blood pressure measurement during a marathon (1 hour) for many people. It is different from time series data in duration and source. Time series data is collection of lot of observation for one source.

Case Study
install.package("nlme")
library(nlme)
## We will do the analysis on Orthodont Data. It is a study on 27 children (16 boys and 11 girls). Data is the distance of centre of pituitary gland to the pterygomaxillary fissure. There are four measurement at age 8, 10, 12, 14.
head(Orthodont,10)
distance age subject gender
1 26         8        M01    Male
2 25         10      M01    Male
3 29         12      M01    Male
4 31         14      M01    Male

## Questions to answer:
1) Whether distances over time are larger for boys than for girl.
2) Determine whether rate of change of distance over time is similar for boys and girls.

Step 1: Plot(Orthodont)
Step 2:## Create Scatter plot
   plot(distance~age, data=Orthodont,
ylab="distance"
xlab="age")
Step 3: ## create scatter plot with smother
          with(Orthodont, scatter.smooth(distance, age, col="blue",
                  ylab="distance", xlab="age", lpars=list(col="red",lwd=3)))

Step 4: fm1<-lmList(distance ~ age | subject, Orthodont)
Step 5: plot(intervals(fm1))

Step 6:## Create Box plot
library(lattice)
bwplot(distance~as.factor(age)|Sex, data=Orthodont,
ylab="Distance",
xlab="6 year duration-8,1012,14")

Analysis:
1) The trajectory of distance is approximately a linear function of age.
2) The trajectories vary between child.
3) The distance measurement increases with age.
4) The distance trajectories for boys are higher on an average than girls.
5) There is a population trend as well as subject specific variation in the data.

R List

WHAT IS LIST
A list is an object which can store any object of any dimension.
A list can store a vector, matrix, array, data frame or list inside it. There is no limitation of size. A list will return a list when accessed by single square bracket. i.e [ ]. But when accessed via [[ ]], then it simplifies the output i.e we get output in the form of a vector or matrix. (in the form it was inserted in list)
METHOD OF CREATION
##A list can be created by list function()
l1<-list(name="Amy",num=1001,marks=c(70,75,80,68,79),mat=matrix(1:10,2,5))
l2<-list(name="Sam",num=1001,marks=c(56,78,76,69,89),mat=matrix(1:10,2,5))
l3<-list(name="Dan",num=1001,marks=c(69,86,75,87,65),mat=matrix(1:10,2,5))
## we have created three list, each one with four elements, name, number, and marks and at. Name is character type, num is a number, marks is a vector and mat is a matrix. It is like a structure which helps to store different data types in one place. Now we can make a composite list consisting of above three list.
l<-list(Amy=l1,Sam=l2,Dan=l3)
ACCESSING LIST
l1[1] ## return first element of list l1 in the form of list.
$name
[1] "Amy"
l1[2] ## return second element of list l1in the form of list.
$num
[1] 1001
l1[3] ## return third element of list l1 in the form of list.
$marks
[1] 70 75 80 68 79
## Now if we want to access the first element of marks( marks is a vector). Then first of all we have to use [[]]. By placing [[3]] output get simplified and we will get a vector. Now by placing another set of square bracket with element number 1,we will get first element of third element of list.
l1[[3]]   ## return third element of list in the form of vector.
[1] 70 75 80 68 79
l1[[3]][1] ## return first element of third element of list in the form of vector.
[1] 70
l1[[4]] ## return fourth element of list in the form of matrix.
       [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

R timeseries data

WHAT IS TIME SERIES DATA
Any data which is aligned with time is time series data. For example sales data of 12 months of year is time series data.
Time Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Data 45   67 89    34   12    56 78 89   91   92   68   72
TIME SERIES DATA CREATION
R has a special data structure for storing the time series data. It is created by ts() function.
ts(data, start , end , frequency)
data : data is passed in the form of vector
start : start of time series data
end   : end of time series data (optional)
frequency : Decide the time difference between two readings.
                  : 1 for yearly data i.e 1 observation will be allocated to entire year.
                  : 2 for bimonthly data i.e 2 observation will be allocated to entire year.
                  : 4 for quarterly data i.e 4 observations will be allocated to entire year
                  : 12 for monthly data i.e 12 observations will be allocated to entire year
                  : 52 for weekly data i.e 52 observations will be allocated to entire year
                 : 24 for 15 day data
                 : 365 for daily data
EXAMPLES
1) ts(1:10,start=2000, frequency=1) ## yearly data
Time Series:
Start = 2000
End = 2009
Frequency = 1
[1] 1 2 3 4 5 6 7 8 9 10
2) ts(1:12,start=2000, frequency=4) ## quarterly data
Qtr1 Qtr2 Qtr3 Qtr4
2000    1    2    3    4
2001    5    6    7    8
2002    9   10   11   12
3) ts(1:12,start=2000, frequency=12 ) ## monthly data
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000   1   2   3   4   5   6   7   8   9 10 11 12
4) ts(1:24,start=2000, frequency=24) ## fortnight data
Time Series:
Start = c(2000, 1)
End = c(2000, 24)
Frequency = 24
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
5) ts(1:52,start=2000, frequency=52) ## weekly data
Time Series:
Start = c(2000, 1)
End = c(2000, 52)
Frequency = 52
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[38] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

R Data Frame

WHAT IS DATA FRAME
A data frame is 2 dimensional data structure which can store any type of data. It can store number, integer, character, boolean or complex. Whenever we load a file in R, it creates a data frame. Data frame helps in creating a table like structure in R where we store relational data base structure. But R does not follow any key constraints.
DATA FRAME CREATION METHOD
## Method 1: By function data.frame
child<-c("Joe","Amy","John")
age<-c(8,9,10)
class<-c(4,5,6)
childdata<-data.frame(child,age,class,stringsAsFactors=FALSE)
childdata
     child age class
1   Joe     8     4
2   Amy   9     5
3 John 10     6

## Method2: By loading data file.
## Download ozone.csv from following link and save it in location C:/R with name ozone.csv. Upload this data in R using following code.
airquality<-read.table("C:/R/ozone.csv",header=TRUE, sep=",")
## we will get a dataframe with name airquality.
ACCESSING DATA FRAME
A dataframe is like a 2 dimensional structure matrix. Only difference is a matrix can store single type of data, but data frame can store any type of data.
airquality[1,1] ## returns the 1 element of 1st row of dataframe.
airquality[1, ]   ## returns the 1st row of data frame.
airquality[ ,1]   ## returns the 1st column of data frame.
airquality[1:2,1:4] ## returns first 2 two rows and first four columns of data.
COLUMN NAMES OF DATA FRAME
While uploading data in R, if first row in file contains header then set flag header=TRUE. Each column of data frame can be accessed by directly placing $ ahead of column name.
airquality$Ozone ## returns the ozone column of data
airquality$Solar.R ## returns the solar.R column of data
airquality$Temp     ## returns the Temp column of data
QUESTIONS-1
1)Extract first two rows of data frame
airquality[1:2,]
2)How may observations are in this data frame
dim(airquality)
3)What is the value of Ozone in 47th row?
airquality$Ozone[[47]]
4)Extract the rows where Ozone value is above 31 and temp value are above 90.
airquality[airquality$Temp>90&&airquality$Ozone>31,1:6]
5)Take the mean of Solar.R, use function mean.
mean(airquality$Solar.R)
QUESTIONS-2
Download another file crime.csv from link.
This file has robbery and murder data for 50 states of U.S of year 2005.
crime<-read.table("C:/R/crime.csv",header=TRUE, sep=",")
1) Extract those rows where population>5000000
Crime[crime$Population>5000000,]
2) Extract the name of states where murder>6
Crime[crime$Murder>6,1]
3) Extract the name of states where the number of murder is between 3 and 6.
    Crime[Crime$Murder>3 & Crime$Murder <6 ,1]
4) Extract the average murder rate of all states where population>5000,000.
mean(Crime[Crime$Population>5000000,2])
5) The name of state with maximum number of crime.
Crime[Crime$Murder==max(Crime$Murder),1]
6) The name of state with maximum number of robbery.
Crime[Crime$Robbery==max(Crime$Robbery),1]

R Array

An array is an object helps in storing 3 dimensional data in R. When it is used to store 2 dimensional data becomes equivalent to matrix. When it is used to store 1 dimensional data, it becomes equivalent to vector. The first dimension in array is number of rows, second one is number of columns and third is number of matrices. So we can assume array as collection of matrices with similar number of rows and columns.

ARRAY CREATION METHOD

## Method-1 : It can be created by function, array(). It needs two input- data, dimensions of array which is passed in the form of vector. Dimnames is optional and is used for passing dimension names.

a<-array(1:40, c(2,2,10), dimnames=list(c("A","B"),c("Science","Maths"),c(2001:2010")))

a<-array(c(1,2,3,4,5,6,7,8) c(2,2,2))
## Method-2: It can be created by passing matrix.
m<-(1:8,4,2)
a<-array(m,c(2,2,2))
## Method-3: It can be created by changing dimensions of array,
v<-c(1,2,3,4,5,6,7,8,9,10,11,12)
dim(v)<-c(2,2,3)
## Method-4: It can be created by making blank array.
v<-array(,c(2,2,))

ARRAY ATTRIBUTES

a<-array(1:16,c(2,4,2))
dim(a)                                   ## return the dimensions of array
dim(a)<-c(2,2,8)                   ##update the dimensions of array
rownames<-c("Amy","Ben") ## update the rownames of an array
colnames<-c("A","B")           ## update the colnames of an array
dimnames(a)<-list(c("A","B"),c("1","2"),2001:2008) ## update the names of all dimensions. i.e rownames, colnames, matrixnames

ARRAY OPERATIONS

a<-array(1:8,c(2,2,2))
b<-array(9:16,c(2,2,2))
a+b ## Addition of two array
a-b    ## subtraction of two array
a*b   ## multiplication of two array
a/b    ## division of two array

R Matrix

What are Matrices?

Matrices are the another type of R object which arranges data in 2 dimensional layout. They are like mathematical matrix with a defined set of row and column. These matrices can store, number, character, Boolean, integer or complex.

MATRIX CREATION METHOD

## Method 1. Use function matrix. This function needs three inputs. First one is data which can be passed in the form of vector.Second one is number of rows and third is number of columns.
m<-matrix(c(1,2,3,4),nrow=2, ncol=2)
m<-matrix(1:4 ,nrow=2, ncol=2)
m<-matrix(seq(1,4,by=1),nrow=2, ncol=2)
v<-c(1,2,3,4)
m<-matrix(v ,nrow=2, ncol=2)

   [,1] [,2]
[1,]   1     3
[2,]   2     4
It will create a vector of 2 rows and 2 columns
## Method 2.Change the dimensions of vector. A vector is one dimensional set of data. If we change the dimensions of vector it can take form of matrix.
v<-c(1,2,3,4,5,6)
dim(v)<-c(2,3)
## Method 3. Use cbind(), rbind() function.
v1<-c(1,2,3,4)
v2<-c(5,6,7,8)
Now these two vectors can be binded horizontally or vertically for form a matrix.
cbind(v1,v2)                     rbind(v1,v2)
   [,1] [,2]                           [,1] [,2] [,3] [,4]
[1,]   1    5                        [1,]   1    2    3    4
[2,]   2    6                        [2,]   5 6    7    8
[3,]   3    7
[4,]   4    8
## Method 4. create a blank matrix and update the matrix when needed.
m<-matrix(,nrow=2,ncol=2)
m
[1,] [2,]
[,1] NA NA
[,2] NA NA
Now we can update the matrix
m[1,1] = 1, m[1,2]=2,
m[2,1] =3, m[2,2]=3

MATRIX ATTRIBUTES

## dim function give the detail of dimensions of matrix
m<-m(1:16, 4 4)
dim(m)<-c(8,2)                                     ## update dimensions of matrix
dim(m)                                                  ## Return dimension of matrix
[1] 4 4
colnames(m)<-c("Q1","Q2","Q3","Q4") ##update column name of matrix m
colnames(m)                                         ##Return column name of matrix m
rownames(m)>-c(1,2,3,4)                     ##update row names of matrix m
rownames(m)                                        ##update col names of matrix m

ACCESSING AND MODIFYING MATRIX

## An element of matrix can be modified by its row number and column number
m[1,1] ## return the first element of first row of matrix m
m[1,] ## return all elements of first row of matrix
m[,1] ## return all elements of first column of matrix
m[2,3:4] ## return third and fourth column of second row
m[2:3,1] ##return the second and third element of first column.
m[,] ## return all the elements of the matrix.

MATRIX OPERATIONS

## Addition of matrix

m1<-matrix(1;4,2,2)

m2<-matirx(5:8,2,2)

m1+m2 ## addition of matrix

m2-m1 ## subtraction of matrix

m1*m2 ## product of matrix

m2/m1 ##division of matrix

m1%*%m2 ##matrix multiplication

t(m) ## transpose of matrix

diag(m) ##diagonal of matrix

eigen(m) ## eigen value and eigen vectors of matrix

det(m) ## determinant of matrix

tr(m) ## trace of matrix

SOLVING EQUATIONS BY MATRIX

x+2y=7

3x+y=11

##create a matrix of coefficients of x,y

a=matrix(c(1,3,2,1),2,2)

b=matrix(c(7,11),2,1)

solve(a,b)

When b is not passed the solve(a) will return the inverse of a.

QUESTIONS

## Q1 Give the general expression to create a matrix in R.
The general expression to create a matrix in R is - matrix(data, nrow, ncol, byrow, dimnames)

## Q2 How do you access the element in the 2nd column and 4th row of a matrix named M?
The expression M[4,2] gives the element at 4th row and 2nd column.
## Q3 The sales percentage of two branches for 4 weeks is as follows ( Week start from Monday and end on Sunday).
1)40,45,34,67,56,87,45,23,45,27,37,87,98,45,25,35,54,56,76,84,65,35,56,45,67,67,77,87
2)34,37,39,41,45,49,51,46,45,49,52,55,58,60,67,55,54,58,65,69,70,74,75,65,64,68,69,74
3.1. Find Average, max, min sales of both stores?
3.2 which day was best and worst of both stores?
3.3 Week average, week min, week max sales of both store?
3.4 Average sales of both the stores in the form of matrix?
3.5 Which store was performing better for each daty. Answer should be in the form of matrix value 1 or 2?

K-means-Clustering

Clustering is the task to create groups inside data that are more similar to each other. Example. Let us assume we have invited papers for a conference and thousands of research paper appeared in my mail box in two days. Now opening each one and saving in different track will take time. So what can I do is to download all of them and ran a clustering algorithm. The result is that I got 5 or 7 clusters of similar type of papers. The process can be explained by below image.

There are various algorithms for clustering. Most efficient and simple to use are.
1) K-Means Clustering
2) Hierarchical Clustering
3) Agglomerative Clustering

In this blog we will understand K-Means Clustering. It is an unsupervised learning method, i.e we do not provide any information, through data, regarding how to create clusters. But we do tell the number of clusters or groups to be created.
Algorithm of clustering
1) Define the number of clusters. Let's say n=5
2) Assign one data point to each cluster which is a centroid of that cluster. You can choose any data point to be the center of cluster. If we have 50 data points, then 5 are assigned to cluster.
3) Start a loop for rest of 45 points.
4) Start allocation these 45 points to clusters and criteria should be distance from center of cluster. Find the distance of each point from five cluster centre and assign it to nearest cluster.
5) once allocation is done and iteration-1 is complete.
6) Recalculate the centroid by taking mean of all data points of each cluster and
again start from step-3.
Lets implement this one in R on text data. Text wtitten with ## are comments and will not appear in coding, if copied.
##Step1: Read the data from a file and put in dataframe mydata
install.package("tm")
install.package("wordcloud")
library(tm)
library(wordcloud)

mydata<-read.table(file="new.txt", sep=",",stringsAsFactors=FALSE)

##update the column name as term

colnames(mydata)=c("term")

dim(mydata)

## generate the corpus of the text data

mycorpus = Corpus(VectorSource(mydata$term)
## inspect the corpus for data upload
inspect(mycorpus[1:10])
## Data cleaning of the corpus
cleancorpus=tm_map(corpus,toLower)
cleancorpus=tm_map(corpus,removePunctuation)
cleancorpus=tm_map(corpus,removeNumbers)
cleancorpus=tm_map(corpus,removeWords,stopwords("english"))
cleancorpus=tm_map(corpus,stripWhitespace)
## Create DocumentTermMatrix of the data
dtm=TermDocumentMatrix(cleancorpus,control=list(minWordLength=1))
dtm_tfidf=weightTfIdf(dtm)
m1=as.matrix(dtm_tfidf)
m=t(m1)

Text-Analytics-3

An important part of text analysis is finding a pattern. Let is say we have 100,000 search queries. Now we want to find out those search terms where there is a difference of only 1 or 2 space.
Example -
1. Ledbulb and led bulb are same
2. T-shirt, T shirt, Tshirt are same
3. PVC and P V C are same

Here I present a code in R which will help to find such terms. This code can be improved to find terms, where there is a difference of s/es/-/ed. So we can say this is a base version and improvements will come in due run of course.
Step1 : Load the file of query terms in R and clean this data

## load the file has two columns- search term, count of search term
data<-read.table(file="term.csv", sep=",",stringsAsFactors=FALSE,header=T)
## Create another column and assign the column names
data[,3]=data[,1]
colnames(data)<-c("terms","count","collapse")
## Take out a sample as it better to test code on sample data
output=data.frame(data[1:100,],stringsAsFactors=FALSE)
## Remove any leading or trailing spaces from column 1 and put in another set
out=data.frame(trimws(output[1:100,1],which=c("both")),stringsAsFactors=FALSE) out[,2:3]=output[,2:3]
v<-vector(length=1)
Step 2 : Once cleaning is done, remove all the space from search term. This loop written below will help in this. Although this can be done by apply function, but I am presenting a simple approach, which non R programmers can also understand.
## Remove all the spaces from all the termsfor (i in 1:nrow(out))
{
v[1]<-out[i,1]
vec<-unlist(strsplit(v[1], " "))
v[1]<-paste(vec, sep="",collapse="")
out[i,3]=v[1]
}
Step 3 : Now we have two pair of data, one with space and another without space. Next is search for similar words in without space data and pair it up with space data.
## find the words similar but without spaces
for (i in 1:nrow(out))
{
v[1]<-out[i,3]
test<-out[ endsWith(out[,3],v[1]) & startsWith(out[,3],v[1]) ,1]
test<-unique(test)
if (length(test)>1 )
{
   for (j in 1:length(test))
    {
          ##if (out[i,1] != test[j])
              out[i,j+3]=test[j]
    }
}
}

Step 4 : Write the data in a file.
## nchar will select only those row from dataset which have similar search terms.
flag=nchar(out[,4], keepNA=FALSE)>2
write.table(out[flag,1:4], "output.csv")

Output will look like this.

"mens tshirts"	"mens t shirts"
"plywood"	"ply wood"
"tshirts"	"t shirts"
"power bank"	"powerbank"
"ro water purifier"	"r o water purifier"
"tshirts"	"t shirts"
"t shirt"	"tshirt"
"pen drive"	"pendrive"
"pharma ceutical products"	"pharmaceutical products"
"tshirt printing machine"	"t shirt printing machine"
"solar street light"	"solar streetlight"
"plywood"	"ply wood"
"ro water purifier"	"r o water purifier"
"mens tshirts"	"mens t shirts"
"power bank"	"powerbank"
"bathroom accessories"	"bath room accessories"
"tshirts"	"t shirts"
"bi cycle"	"bicycle"
"solar street light"	"solar streetlight"
"bi cycle"	"bicycle"
"pen drive"	"pendrive"
"t shirt"	"tshirt"
"pharma ceutical products"	"pharmaceutical products"
"bathroom accessories"	"bath room accessories"
"tshirt printing machine"	"t shirt printing machine"

Any questions/update/code improvements are welcome.

Text Analytics-2

In previous blog I explained how generate the bar graph and word cloud of most frequent words from any text. Now we will do bi gram and trigram analysis of text data. Let us first understand what is n gram. It is a contiguous sequence of words from any text of length n. So bi gram stands for sequence of 2 words, trigram stands for set of 3 words and so on.
Example: "R is used in text analytics"
1-gram : R, is, used, in, text, analytics
2-gram : R is, is used, used in, in text, text analytics
3-gram : R is used, is used in, used in text, in text analytics
Google has digitized 5-billion books but it is impossible for someone to read all of them. So what they have done, they generated 3-gram data from these books and prepared a dataset for analysis. So they can tell how many times line "Pursuit of happiness" is used in 1801, 1802......2008. This way they have generated a table of 2-billion line or 2-billion n grams which tells a lot of about history, cultural changes etc. It is known as culturomics (like genomics). It has been discussed in detail in one of Ted talk. Google has ngrams.googlelabs.com where you can type any word and generate its chart. Below graph displays the variation in the use of word "love" between 1800 and 2000.
N gram analysis can help us in understanding the motive of a document. It can explain sentiment of text and also in predicting next word in sequence. We will see each one of these one by one.
First of all we will load the data, generate corpus and clean it.

install.package("tm")
install.package("wordcloud")
install.package("RWeka")
install.package("SnowballC")

Bi gram Analysis

Text Analytics-1

Text Analytics deals with analysis of text in any manner. This analysis can be as simple as counting the most frequent words in text or orientation of people in an upcoming election by collecting social media text. I will write in detail about this in some other blog. Here I am going to find most frequent words in any text or file, by using R and creating its wordcloud.

Here I am presenting step by step script to perform this task.

install.package("tm")

install.package("wordcloud")
install.pakage("ggplot2")

library(tm)

library(wordcloud)
library(ggplot)

## Load the data in R with read.table

mydata<-read.table(file="new.txt", sep=",",stringsAsFactors=FALSE)

colnames(mydata)=c("term")

dim(mydata)

## new.txt file has 100000 lines. Now this data has been loaded into R in mydata, which is a dataframe. This dataframe has only one column which we rename as term. colnames function helps in renaming the column name of a data structure in R.

corpus = Corpus(VectorSource(data$term))

## Corpus function will generate the corpus of text.

inspect(corpus[1:10])

## inspect function helps in viewing corpus data

## once corpus is created we perform data cleaning action on this data.

## convert all data into lowercase

corpus_clean = tm::tm_map(corpus, tolower)

## remove all the punctuation
corpus_clean = tm::tm_map(corpus_clean, removePunctuation)

## remove all the numbers

corpus_clean = tm::tm_map(corpus_clean, removeNumbers)

## Generate the document term matrix

dtm<-tm::TermDocumentMatrix(corpus_clean, control=list(wordLengths=c(3,Inf)))

## generate the dataframe consisting of words and their frequency
freq=sort(rowSums(as.matrix(dtm)), decreasing=TRUE)

word_freq=data.frame(word=names(freq), freq=freq)

## n is the number most frequent words we want to show in graph

n=25

word_freq_head=head(word_freq,n)

##plots

p=ggplot(data=word_freq_head, aes(x=word, y=freq, fill=word))

p+geom_bar(stat="identity") + labs("title = "Most Frequent Words")

R Vector

WHAT IS A VECTOR

Vector is the simplest method of storing one dimensional data in R. It can be character, Number, Integer, Boolean or Complex. [1,2,3,4,5] is a vector of numbers. ["Amy","Joe","Sam"] is a vector of characters.

VECTOR CREATION METHODS

v=c(1,2,3,4,5)     ## Method 1. Use function c.
This will give us a vector starting from 1 to 5.
v<-seq(1,10,by=2) ##Method 2. Use seq function
## seq will generate the sequence as per start value, end value and difference between each term
v<-rep(2,10)           ## Method 3. Use repeat (rep) function
## repeat function will repeat the value n number of times.
v<-1:20                   ## Method 4. Use : between start value and end value

FUNCTIONS ON VECTORS

## Function to find length of vector
length(v)
## Function to set the name of each element of vector
v<-c(1,2,3,4,5,6,7)
names(v)<-c("Sun","Mon","Tue","Wed","Thr","Fri","Sat")
## Function to calculate the sum of vectors
sum(v)
## Function to calculate the product of vectors
prod(v)
## Function to sort the vector
sort(v)
## Function to sort the matrix indexes according to the values.
order(v)

ACCESSING AND MODIFYING VECTORS

## How to access elements of vector
v[1] ## first element of vector
v[length(v)] ## last element of vector
v[length(v-1)] ## last but one element of vector
v[v>3] ## list of all element where are > 3
v[v<5] ## list of all element which are <5
v[4]<-7 ## It will modify the fourth element of vector
v[1:3] ## It will return first three elements of vector
v[c(1,3,5)] ## It will return first,third and fifth elements of vector

VECTOR OPERATIONS

## We can perform mathematical operations on two vectors as we do on numbers.
v1<-c(1,2,3,4)
v2<-c(5,6,7,8)
v1+v2            ## Addition of two vectors
[1] 6 8 10 12
v2-v1             ## Subtraction of two vectors
[1] 4 4 4 4
v2*v1            ## Multiplication of two vectors
[1] 5 12 21 32
v2/v1              ## Division of two vectors
[1] 5.0 3.0 2.33 2.00
## We can perform mathematical operations on vectors and numbers.
v1+10           ## Addition of vector and number
[1] 11 12 13 14
v1*2              ## Multiplication of vector and number
[1] 2 4 6 8
## We can perform operations between different sized vectors
v1<-c(1,2,3,4)
v2<-c(1,2)
v1+v2              ## Addition of a small and big vector
[1] 2 4 4 6
## The small vector is recycled so that its size can be matched with large vector, This is called recycling of vectors

NESTED METHOD OF VECTOR CREATION

## vectors can be created by combining two or more methods
v<-c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)) ## rep inside a c function
[1] 1 1 1 2 2 2 3 3 3 4 4 4
v<-rep(c(1,2,3),4)                                     ## c inside a rep function
[1] 1 2 3 1 2 3 1 2 3 1 2 3

MERGING TWO VECTORS

v1<-c(1,2,3,4,5)
v2<-c(6,7,8,9,10)
## These two vectors can be merged by passing into one vector.
v<-c(v1,v2)          ## create a new vector after merging v1 and v2
v2=v1                  ## update v2 with the values of v1

Questions
## Q1)What is recycling of elements in a vector? Give an example
A##When two vectors of different length are involved in a operation then the elements of the shorter vector are reused to complete the operation. This is called element recycling. Example - v1 <- c(4,1,0,6) and V2 <- c(2,4) then v1*v2 gives (8,4,0,24). The elements 2 and 4 are repeated.
## Q2)How will you check if an element 2 is present in a vector?
It can be done using the grep() or match () function.
1) grep() function returns the location of all matching values.
v<-c(2,4,6,8,2)
grep(2,v)
[1] 1 5
2) match() function returns the first location of element 2
match(2,v)
[1] 1
3) is.element(2,v)
[1] TRUE

## Q3) What will be the class of the resulting vector if you concatenate a number and a character?
character
## Q4) What will be the result of multiplying two vectors in R having different lengths?
The multiplication of the two vectors will be performed and the output will be displayed with a warning message like – “Longer object length is not a multiple of shorter object length.” Suppose there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the multiplication of the vectors a*b will give the resultant as 2 6 6 with the warning message. The multiplication is performed in a sequential manner but since the length is not same, the first element of the smaller vector b will be multiplied with the last element of the larger vector a.
##Q5) what will be the result of multiplying a vector with matrix?
##Q6) What is the output of rep(1,3):rep(3,3)
##Q7) How to find the index of maximum element of vector

Other Blogs from Author