Text Analytics deals with analysis of text in any manner. This analysis can be as simple as counting the most frequent words in text or orientation of people in an upcoming election by collecting social media text. I will write in detail about this in some other blog. Here I am going to find most frequent words in any text or file, by using R and creating its wordcloud.
Here I am presenting step by step script to perform this task.
install.package("tm")
install.package("wordcloud")
install.pakage("ggplot2")
install.pakage("ggplot2")
library(tm)
library(wordcloud)
library(ggplot)
library(ggplot)
## Load the data in R with read.table
mydata<-read.table(file="new.txt", sep=",",stringsAsFactors=FALSE)
colnames(mydata)=c("term")
dim(mydata)
## new.txt file has 100000 lines. Now this data has been loaded into R in mydata, which is a dataframe. This dataframe has only one column which we rename as term. colnames function helps in renaming the column name of a data structure in R.
corpus = Corpus(VectorSource(data$term))
## Corpus function will generate the corpus of text.
inspect(corpus[1:10])
## inspect function helps in viewing corpus data
## once corpus is created we perform data cleaning action on this data.
## convert all data into lowercase
corpus_clean = tm::tm_map(corpus, tolower)
## remove all the punctuation
corpus_clean = tm::tm_map(corpus_clean, removePunctuation)
corpus_clean = tm::tm_map(corpus_clean, removePunctuation)
## remove all the numbers
corpus_clean = tm::tm_map(corpus_clean, removeNumbers)
## Generate the document term matrix
dtm<-tm::TermDocumentMatrix(corpus_clean, control=list(wordLengths=c(3,Inf)))
## generate the dataframe consisting of words and their frequency
freq=sort(rowSums(as.matrix(dtm)), decreasing=TRUE)
freq=sort(rowSums(as.matrix(dtm)), decreasing=TRUE)
word_freq=data.frame(word=names(freq), freq=freq)
## n is the number most frequent words we want to show in graph
n=25
No comments:
Post a Comment