Text-Analytics-3

An important part of text analysis is finding a pattern. Let is say we have 100,000 search queries. Now we want to find out those search terms where there is a difference of only 1 or 2 space.
Example -  
1. Ledbulb and led bulb are same
2. T-shirt, T shirt, Tshirt are same
3. PVC and P V C are same

Here I present a code in R which will help to find such terms. This code can be improved to find terms, where there is a difference of s/es/-/ed. So we can say this is a base version and improvements will come in due run of course.
Step1 : Load the file of query terms in R and clean this data

## load the file has two columns- search term, count of search term
data<-read.table(file="term.csv", sep=",",stringsAsFactors=FALSE,header=T)

## Create another column and assign the column names
data[,3]=data[,1]
colnames(data)<-c("terms","count","collapse")
## Take out a sample as it better to test code on sample data
output=data.frame(data[1:100,],stringsAsFactors=FALSE)
## Remove any leading or trailing spaces from column 1 and put in another set
out=data.frame(trimws(output[1:100,1],which=c("both")),stringsAsFactors=FALSE) out[,2:3]=output[,2:3]
v<-vector(length=1)    

Step 2 : Once cleaning is done, remove all the space from search term. This loop written below will help in this. Although this can be done by apply function, but I am presenting a simple approach, which non R programmers can also understand.
## Remove all the spaces from all the termsfor (i in 1:nrow(out))
{
  v[1]<-out[i,1]
  vec<-unlist(strsplit(v[1], " "))
  v[1]<-paste(vec, sep="",collapse="")
  out[i,3]=v[1]
}
Step 3 : Now we have two pair of data, one with space and another without space. Next is search for similar words in without space data and pair it up with space data.
## find the words similar but without spaces
for (i in 1:nrow(out))
 {
 v[1]<-out[i,3]
 test<-out[ endsWith(out[,3],v[1]) & startsWith(out[,3],v[1]) ,1]
 test<-unique(test)
 if (length(test)>1 )
  {
   for (j in 1:length(test))
    {      
          ##if (out[i,1] != test[j])
              out[i,j+3]=test[j]
    }
  }
}


Step 4 : Write the data in a file.
## nchar will select only those row from dataset which have similar search terms.
flag=nchar(out[,4], keepNA=FALSE)>2
write.table(out[flag,1:4], "output.csv")

Output will look like this.



"mens tshirts" "mens t shirts"
"plywood" "ply wood"
"tshirts" "t shirts"
"power bank" "powerbank"
"ro water purifier" "r o water purifier"
"tshirts" "t shirts"
"t shirt" "tshirt"
"pen drive" "pendrive"
"pharma ceutical products" "pharmaceutical products"
"tshirt printing machine" "t shirt printing machine"
"solar street light" "solar streetlight"
"plywood" "ply wood"
"ro water purifier" "r o water purifier"
"mens tshirts" "mens t shirts"
"power bank" "powerbank"
"bathroom accessories" "bath room accessories"
"tshirts" "t shirts"
"bi cycle" "bicycle"
"solar street light" "solar streetlight"
"bi cycle" "bicycle"
"pen drive" "pendrive"
"t shirt" "tshirt"
"pharma ceutical products" "pharmaceutical products"
"bathroom accessories" "bath room accessories"
"tshirt printing machine" "t shirt printing machine"


 


Any questions/update/code improvements are welcome.

No comments:

Post a Comment

Translate

Monte Carlo Simulation with R

Stochastic Modeling A stochastic model is a tool for modeling data where uncertainty is present with the input. When input has cert...