R programming: Text Analytics-2

In previous blog I explained how generate the bar graph and word cloud of most frequent words from any text. Now we will do bi gram and trigram analysis of text data. Let us first understand what is n gram. It is a contiguous sequence of words from any text of length n. So bi gram stands for sequence of 2 words, trigram stands for set of 3 words and so on.
Example: "R is used in text analytics"
1-gram : R, is, used, in, text, analytics
2-gram : R is, is used, used in, in text, text analytics
3-gram : R is used, is used in, used in text, in text analytics
Google has digitized 5-billion books but it is impossible for someone to read all of them. So what they have done, they generated 3-gram data from these books and prepared a dataset for analysis. So they can tell how many times line "Pursuit of happiness" is used in 1801, 1802......2008. This way they have generated a table of 2-billion line or 2-billion n grams which tells a lot of about history, cultural changes etc. It is known as culturomics (like genomics). It has been discussed in detail in one of Ted talk. Google has ngrams.googlelabs.com where you can type any word and generate its chart. Below graph displays the variation in the use of word "love" between 1800 and 2000.
N gram analysis can help us in understanding the motive of a document. It can explain sentiment of text and also in predicting next word in sequence. We will see each one of these one by one.
First of all we will load the data, generate corpus and clean it.

install.package("tm")
install.package("wordcloud")
install.package("RWeka")
install.package("SnowballC")

R programming

Other Blogs from Author

Text Analytics-2

Bi gram Analysis

No comments:

Post a Comment

Translate

Monte Carlo Simulation with R