K-means-Clustering

Clustering is the task to create groups inside data that are more similar to each other. Example. Let us assume we have invited papers for a conference and thousands of research paper appeared in my mail box in two days. Now opening each one and saving in different track will take time. So what can I do is to download all of them and ran a clustering algorithm. The result is that I got 5 or 7 clusters of similar type of papers.  The process can be explained by below image.


There are various algorithms for clustering. Most efficient and simple to use are.
1) K-Means Clustering
2) Hierarchical Clustering
3) Agglomerative Clustering


In this blog we will understand K-Means Clustering. It is an unsupervised learning method, i.e we do not provide any information, through data, regarding how to create clusters. But we do tell the number of clusters or groups to be created.
Algorithm of clustering
1) Define the number of clusters. Let's say n=5
2) Assign one data point to each cluster which is a centroid of that cluster. You can choose any data point to be the center of cluster. If we have 50 data points, then 5 are assigned to cluster.
3) Start a loop for rest of 45 points.
4) Start allocation these 45 points to clusters and criteria should be distance from center of cluster. Find the distance of each point from five cluster centre and assign it to nearest cluster.
5) once allocation is done and iteration-1 is complete.
6) Recalculate the centroid by taking mean of all data points of each cluster and
again start from step-3.
Lets implement this one in R on text data. Text wtitten with ## are comments and will not appear in coding, if copied.
##Step1: Read the data from a file and put in dataframe mydata
install.package("tm")
install.package("wordcloud")
library(tm)
library(wordcloud)
mydata<-read.table(file="new.txt", sep=",",stringsAsFactors=FALSE)
##update the column name as term
colnames(mydata)=c("term")
dim(mydata)
## generate the corpus of the text data

mycorpus = Corpus(VectorSource(mydata$term)
## inspect the corpus for data upload
inspect(mycorpus[1:10])
## Data cleaning of the corpus
cleancorpus=tm_map(corpus,toLower)
cleancorpus=tm_map(corpus,removePunctuation)
cleancorpus=tm_map(corpus,removeNumbers)
cleancorpus=tm_map(corpus,removeWords,stopwords("english"))
cleancorpus=tm_map(corpus,stripWhitespace)
## Create DocumentTermMatrix of the data
dtm=TermDocumentMatrix(cleancorpus,control=list(minWordLength=1))
dtm_tfidf=weightTfIdf(dtm)
m1=as.matrix(dtm_tfidf)
m=t(m1)

 











Translate

Monte Carlo Simulation with R

Stochastic Modeling A stochastic model is a tool for modeling data where uncertainty is present with the input. When input has cert...