Archive for April, 2016

Data Analytics – Text Mining the AOL 2006 Web Search Logs

By Kato Mivule

  • In this example, we use R to implement text mining of the 2006 AOL web search logs.
  • The 2006 AOL data set is multivariate – containing more than one variable.
  • In this example, we are interested in the Query variable – containing the actual queries that were issued by users.

The 2006 AOL Web Search Logs WordCloud Sample

The AOL Dataset

  • The AOL dataset published by AOL in 2006, is one of the few public datasets available that contain real web search logs.
  • While data curators at AOL claimed to have “anonymized” the dataset by removing personal identifiable information (PII), researchers were able to re-identify some individuals in the published dataset.
  • The AOL dataset contains the following five variables (attributes):
    • AnonID: stores the identification number of the user.
    • Query: stores the actual query that the user executed.
    • QueryTime: The date and time the query was executed.
    • ItemRank: AOL’s ranking of the query.
    • ClickURL: The actual URL link that the user clicked on after querying.

Big Data Analytics

  • The big data problem then becomes apparent; we are looking at about 1.5 gigabytes of data.
  • There are about 20 Million web queries collected from about 650, 000 individuals over a 3 month period.
  • However, I have only labeled a sample of about 500 Megabytes of data for machine learning purposes. We could use this sample as a start.

In this case we use R Studio for the implementation.

You can download R Studio here… https://www.rstudio.com/

You can download the 2006 AOL web search logs here… http://www.gregsadetsky.com/aol-data/

Step 1 – Select the Query attribute for analysis.


  • We are interested in doing text-mining analytics on the actual queries that were issued by the users.
  • In this example, we only use about 55,000 rows of data (queries) from the Query attribute – about 1.8 megabytes of data for our text analytics.


Step 2 – Load the text-mining libraries in R

  • In this case, we use the “tm” library for text-mining.


Step 3 – Read/input the text file containing the query data into R Studio.

filePath1 <- "~/Desktop/AOL_QUERY_PRIVACY1/AOL_Data_Q1.txt"


  • In this example, we use about 50,000 rows of text data, about 1.8 megabytes.

Step 4 – Load the data into a corpus

  • Text is transformed in a corpus, which is a set of documents. In this case, each row of data in the query variable (attribute) becomes a document.
docs1 <- Corpus(VectorSource(text1))


Step 5 – Do a Text Transformation

  • Replace special characters,“/”, “@”, “|” , etc, with space.
  • The tm_map() function is used to replace special characters in the text.
toSpace <- content_transformer(function (x , pattern ) 
gsub(pattern, " ", x))
docs1 <- tm_map(docs1, toSpace, "/")
docs1 <- tm_map(docs1, toSpace, "@")
docs1 <- tm_map(docs1, toSpace, "\\|")


Step 6 – Transform and clean the text of any unnecessary characters.

# Transform text to lower case
docs1 <- tm_map(docs1, content_transformer(tolower))
# Transform text by removing numbers
docs1 <- tm_map(docs1, removeNumbers)
# Transform text by removing english common stopwords
docs1 <- tm_map(docs1, removeWords, stopwords("english"))

Transform text by removing specific stopwords. These words tend to be very common in the AOL Web search logs and offer no meaning at this point, but are a distraction in the analytics by giving us unneeded stats – “www”, “www.”, “http”, “https”, “.com”, “.org”, “aaa”.

docs1 <- tm_map(docs1, removeWords, c(“www”, “www.”, “http”, “https”, 
“.com”, “.org”, “aaa”))
# Transform text by removing punctuations
docs1 <- tm_map(docs1, removePunctuation)
# Transform text by removing whitespaces
docs1 <- tm_map(docs1, stripWhitespace)


Step 6 – Transform by Text Stemming – reducing English words to their root.

  • In this step, English words are transformed and reduced to their root word, for example, the word “running” is stemmed to “run”.
docs1 <- tm_map(docs1, stemDocument)


Step 7 – Construct a document matrix.

  • A document matrix is a table made up of each unique word in the text and how many times it appears – basically the table of word frequency.
  • The column names in the table are each unique word and row names represent the documents in which those words appear.
  • The TermDocumentMatrix() function from the “tm” library is used to generate the document matrix.
  • The document matrix is what we use for the frequency analytics of words in the text to finally derive conclusions.
  • The matrix() function transforms the TermDocumentMatrix into a matrix.
  • The sort () function transforms text by sorting it in descending or ascending order.
  • The frame () function creates is a list of variables of the same number of rows with unique row names, given class “data.frame“, in this case we create variables of words and their frequencies.
  • The head () function returns the top 10 most frequent words in the data.frame, d1.
dtm1 <- TermDocumentMatrix(docs1)
m1 <- as.matrix(dtm1)
v1 <- sort(rowSums(m1),decreasing=TRUE)
d1 <- data.frame(word = names(v1),freq=v1)
head(d1, 10)


Step 8 – Do a frequency analysis

  • At this point we can take a look at the most frequent terms, or in this case queries issued.
  • The findFreqTerms() function is used to find the frequent terms or words used in the term-document matrix.
  • In this case, we want to get words at appear at least 200 times.
findFreqTerms(dtm1, lowfreq = 200)

Results from the head () function


Words that appear at least 200 times.


Step 9 – Plot the word frequencies

  • The barplot () function is used to plot the 10 most frequent words from the 2006 AOL sample we used.
barplot(d1[1:10,]$freq, las = 2, names.arg = d1[1:10,]$word, col =“grey”, 
main = "Most frequent words", ylab = "Word frequencies")



Step 10 – Generate a word cloud of words that appear at least 200 times.

  • The wordcloud () function from the wordcloud library is used to plot the word cloud for the queries under analysis.
  • The words parameter in the wordcloud represents the words to be plotted.
  • The freq parameter represents the word frequencies.
  • The freq parameter sets a limit on words with a certain frequency to be plotted.
  • The words parameter sets the maximum number of words to be plotted.
  • The order parameter plots words in random order. If set to false, words will be will be plotted in decreasing frequency.
  • The per parameter vertically proportions words with a 90-degree rotation.
  • The colors parameter sets the color of words from the least to most frequent.
wordcloud(words = d1$word, freq = d1$freq, min.freq = 50, max.words=200, 
random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))


The word cloud.



From this data sample, we can conclude that terms such as “Google”, “Yahoo”, “Free”, “Myspace”,”craigslist”, were popular search terms then on the AOL search engine in 2006.


  1. Michael Arrington, “AOL Proudly Releases Massive Amounts of Private Data”, TechCrunch, Accessed Feb 17 2016, Available Online at: http://techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data/
  1. Michael Barbaro and Tom Zeller, “A Face Is Exposed for AOL Searcher No. 4417749”, New York Times, Accessed Feb 17 2016, Available Online at: http://www.nytimes.com/2006/08/09/technology/09aol.html
  1. Greg Sadetsky, “AOL Dataset”, Accessed Feb 17 2016, Available Online at: http://www.gregsadetsky.com/aol-data/
  1. com – Text-mining and Word Clouds – http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know






Read Full Post »


Data Analytics Challenges

By Kato Mivule

Part 1, Part 2, Part 3, Part 4, Part 5

There are a number of challenges any data analytics project will have to give serious consideration. Below, we identify and classify five main challenges that any data analytics specialist will always encounter: (i) the problem definition challenge, (ii) the data preprocessing challenge, (iii) the big data challenge, (iv) the unstructured data challenge, and (v) the evaluation of results challenge [1] [2] [3] [4].

The problem definition challenge

  • Many data analytics problems are not well defined by stakeholders and therefore require a collaboration between domain and technical specialists to determine the data to be used and outcomes, and the algorithms needed to accomplish the tasks.

The data preprocessing challenge

  • Most of the datasets presented for data analytics comes incomplete and not in the required format to properly apply data analytics algorithms. During the data preprocessing stage, data has to be placed in a format suitable for the data mining and machine learning algorithms.
    • Missing values: during the preprocessing phase, missing values in the data have to be replaced with average values or the most frequent values.
    • Noisy data: incorrect and invalid values have to be removed and replaced as well.
    • Irrelevant values: values that offer no insight to the problem being solved are removed at this stage.
    • Outliers: these are values that are too high or too low that they affect the overall outcome of the data analytics results. For example, a dataset containing the salary of both the CEO and Janitor might not reflect well on the average salary of workers in that organization. Yet simply removing the outliers would be a loss of valuable data, and therefore a challenge to data analytics.

The big data challenge

  • The exponential growth of data on a daily basis presents an ever growing challenge for data analytics in terms of computation resources needed to analyze such datasets.
  • The big data problem includes both the large datasets and the high dimensionality of such data sets, i.e., the large number of attributes or variables in a given dataset.

The unstructured data challenge

  • Traditional data analytics always worked with structured data in well defined data structures such as, text, numeric, and date. However, of recent, due to the exponential growth and use of the internet, data is directly stored to data warehouses in unstructured formats that include multimedia formats such as images, video, GIS data, etc.
  • Therefore unstructured data is a challenge for data analytics in that it has to be preprocessed to the right format before the analytics process.

The evaluation of results challenge

  • Another challenge related to the problem definition challenge in data analytics is the evaluation of results. In both cases the user or stakeholders do not state precisely what they want to analyze and even when results are presented, it becomes a challenge to both the technical experts in interpreting and visualizing the results in a meaningful way to the stakeholders.
    • Result interpretation: Results that may be correctly interpreted by the data analytics specialists might sound meaningless to clients.
    • Result visualization: large datasets always present a visualization challenge, therefore the data analytics specialists are faced with presenting results in a succinct, understandable, and meaningful way to the users.
  • The data analytics specialist has to find the balance between presenting too much and too little information.


[1] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 9-10.

[2] David Boulton and Martyn Hammersley, “Analysis of unstructured data”, Chapter 10, Data Collection and Analysis, Editors: Roger Sapsford, Victor Jupp, SAGE, 2006, ISBN: 9780761943631, Page 243.

[3] Nathalie Japkowicz, Jerzy Stefanowski, “Big Data Analysis: New Algorithms for a New Society”, Volume 16 of Studies in Big Data, Springer, 2015, ISBN: 9783319269894, Pages 4-10.

[4] Jiawei Han, Jian Pei, Micheline Kamber, “Data Mining, Southeast Asia Edition”, The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 2006,ISBN: 9780080475585, Page 47.

Read Full Post »

%d bloggers like this: