1) K-Means Clustering
2) Hierarchical Clustering
3) Agglomerative Clustering
Algorithm of clustering
1) Define the number of clusters. Let's say n=5
2) Assign one data point to each cluster which is a centroid of that cluster. You can choose any data point to be the center of cluster. If we have 50 data points, then 5 are assigned to cluster.
3) Start a loop for rest of 45 points.
4) Start allocation these 45 points to clusters and criteria should be distance from center of cluster. Find the distance of each point from five cluster centre and assign it to nearest cluster.
5) once allocation is done and iteration-1 is complete.
6) Recalculate the centroid by taking mean of all data points of each cluster and
again start from step-3.
Lets implement this one in R on text data. Text wtitten with ## are comments and will not appear in coding, if copied.
##Step1: Read the data from a file and put in dataframe mydata
install.package("tm")
install.package("wordcloud")
library(tm)
library(wordcloud)
mydata<-read.table(file="new.txt", sep=",",stringsAsFactors=FALSE)
##update the column name as term
colnames(mydata)=c("term")
dim(mydata)
## generate the corpus of the text data
mycorpus = Corpus(VectorSource(mydata$term)
## inspect the corpus for data upload
inspect(mycorpus[1:10])
## Data cleaning of the corpus
cleancorpus=tm_map(corpus,toLower)
cleancorpus=tm_map(corpus,removePunctuation)
cleancorpus=tm_map(corpus,removeNumbers)
cleancorpus=tm_map(corpus,removeWords,stopwords("english"))
cleancorpus=tm_map(corpus,stripWhitespace)
## Create DocumentTermMatrix of the data
dtm=TermDocumentMatrix(cleancorpus,control=list(minWordLength=1))
dtm_tfidf=weightTfIdf(dtm)
m1=as.matrix(dtm_tfidf)
m=t(m1)