Feeds:
Posts
Comments

Archive for the ‘Tutorials and Papers’ Category

Data Analytics Simplified – A Tutorial – Part 5 – The Data Analytics Process

By Kato Mivule

Part 1, Part 2, Part 3, Part 4

KDD_Process1

The Data Analytics Knowledge Discovery Process

The data analytics process involves using algorithms to extract, mine, and discover meaningful information patterns and knowledge in data [1].

However, the data would have to undergo a series of transformational processes before meaningful information patterns can be extracted.

There are five main phases in the data analytics process [1] [2] [3] [4] [5]:

Problem formulation

  • The first step of the data analytics process is to articulate the problem and questions that need to be solved and answered by the data analytics process.
  • The question or problem to be solved has to be domain specific.
  • This helps with the correct data selection process. For example local grocery stores might want to use traffic pattern data to predict when customers with cars are most likely to stop my a certain store.

Data Selection

  • The second step of the process is to select an appropriate dataset based on a specific domain.

Data Preprocessing

  • The third step is to transform the selected dataset into a format that could most appropriate for analytics algorithms.
  • This process is also called data cleaning, in which missing values are removed or replaced with averages. Data outliers maybe removed or replaced with appropriate values. Data with different data types are also corrected at this stage.
  • Data analytics work involves a considerable time preprocessing data to ensure correct analysis.

Data Transformation

  • Data from different sources is then converted into a common format for analysis.
  • This could include reducing the data into appropriate sample sizes, adding labels for classification and changing file types to make the data suitable for analytics tools.

Data Mining

  • In this phase, appropriate data mining and machine learning algorithms are chosen.
  • The analyst then could make a choice to employ supervised or unsupervised learning algorithms.
  • In some cases, depending on the problem formulation, both supervised and unsupervised learning algorithms will be chosen.
  • However, parsimony is important – keep it simple. You don’t need to implement all algorithms to extract meaningful knowledge from the data and come up with a correct model.

Evaluation

  • In this phase evaluation of results produced by the data mining algorithms is done.
  • The extracted knowledge is then presented to the stakeholders in a clear manner.
  • Visualization of results is done at this stage.
  • A report analyzing and interpreting the results to convey meaning is done at this stage.
  • Again parsimony is important. A concise and understandable visualization of results is preferred.

References

[1] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 9-10.
[2] Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. “The KDD process for extracting useful knowledge from volumes of data.” Communications of the ACM 39, no. 11 (1996): 27-34.
[3] Dhar, Vasant. “Data science and prediction.” Communications of the ACM 56, no. 12 (2013): 64-73.
[4] Panov, Panče, Larisa Soldatova, and Sašo Džeroski. “OntoDM-KDD: ontology for representing the knowledge discovery process.” In Discovery Science, pp. 126-140. Springer Berlin Heidelberg, 2013.
[5] Sacha, Dominik, Andreas Stoffel, Florian Stoffel, Bum Chul Kwon, Geoffrey Ellis, and Daniel A. Keim. “Knowledge generation model for visual analytics.” Visualization and Computer Graphics, IEEE Transactions on 20, no. 12 (2014): 1604-1613.

Read Full Post »

Data Analytics Simplified – A Tutorial – Part 3 

By Kato Mivule

Keywords: Supervised Learning, Unsupervised Learning

Part 1, Part 2

Data analytics can be broken down into two main categories as we saw in the previous post, predictive and descriptive data analytics. Furthermore, data analytics tasks can be categorized as follows:

Supervised learning tasks: This involves algorithms that group data into classes and make predictions based on previous examples – thus learning by example [1]. In other words data is grouped into predetermined classes based on previous history of categorizing the data. The data in supervised learning is always labeled (classes) and basically divided into:

  • Training data – which is used for setting up examples on how to group the data, and thus create a model for future grouping of the data.
  • Testing data – which is used to make predictions, based on the example data.

Unsupervised learning tasks: This involves algorithms that do not need predetermined classes to categorize the data [2]. In such cases, the data self determines the groups or clusters into which similar data values collect together.

  • Predictive data analytics methods are largely supervised learning methods.
  • Descriptive data analytics methods are mainly non-supervised learning methods.

 

Data_Analytics_Tutorial3

Data Analytics – Supervised and Unsupervised Learning Tasks

Briefly, predictive data analytics includes the following three main data analytics methods, more details shall follow later:

  • Classification analysis

Classification involves grouping or prediction of data into predefined categorical classes or targets. The classes in which the data is grouped are chosen before analyzing the data based on the characteristics of the data [3].

  • Regression analysis

Regression involves the prediction or grouping of data items into predefined numerical classes or targets based on a known mathematical function such as a linear or logistic regression function [4].

  • Time series analysis

Time series analytics involves examining the values of a data attribute that have been recorded over a period of time, in a chronological order. Predictions are made based on the history of the values observed over time [5] [6].

Descriptive data analytics includes the following unsupervised data analytics methods, more details shall follow later:

  • Clustering

Clustering involves grouping data into classes but without any predefined or predetermined classes and targets. The classes in which the data is grouped are self-determined by the data, such that similar items collect around the same clusters [7].

  • Summarization

Summarization is the generalization of data into groups that are related with descriptive statistics such as the mean, mode, and median [8].

  • Association Rules

Associative rules involve using a set of IF-THEN rules or functions to categorize data with similar relationships in the same groups [9].

  • Sequence Discovery

Similar to time series analysis, sequential discovery or sequential pattern analysis, involves finding the related patterns in sequential data – that is data in chronological order, based on statistical properties [10].

References

[1] Olivier Pietquin, “A Framework for Unsupervised Learning of Dialogue Strategies”, Presses univ. de Louvain, 2004, ISBN 9782930344638, Page 179.
[2] Nick Pentreath, “Machine Learning with Spark”, Packt Publishing Ltd, 2015, ISBN 9781783288526, Page 197.
[3] Ashish Gupta, “Learning Apache Mahout Classification Community experience distilled”, Packt Publishing Ltd, 2015, ISBN 9781783554966, Page 14.
[4] Robin Nunkesser, “Algorithms for Regression and Classification: Robust Regression and Genetic Association Studies”, BoD – Books on Demand, 2009, ISBN 9783837096040, Page 4.
[5] Gebhard Kirchgässner, Jürgen Wolters, Uwe Hassler, “Introduction to Modern Time Series Analysis”, Springer Texts in Business and Economics, Edition    2, illustrated, Springer Science & Business Media, 2012, ISBN 9783642334368, Page 1.
[6] George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel, Greta M. Ljung, “Time Series Analysis: Forecasting and Control”, Wiley Series in Probability and Statistics, Edition 5, John Wiley & Sons, 2015, ISBN 9781118674925, Page 1.
[7] Brett Lantz, “Machine Learning with R”, Edition 2, Packt Publishing Ltd, 2015,ISBN 9781784394523, Page 286.
[8] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 8.
[9] Norbert M. Seel, “Encyclopedia of the Sciences of Learning”, Springer Science & Business Media, 2011, ISBN 9781441914279, Page 2910.
[10] Wikipedia, “Sequential Pattern Mining”, Wikipedia, Accessed, February 15, 2016, https://en.wikipedia.org/wiki/Sequential_pattern_mining

Read Full Post »

Data Analytics Simplified – A Tutorial – Part 2

By Kato Mivule

Part 1

Keywords: Data analytics, Database querying

Data_Analytics2

Types of Data Analytics

While data analytics might involve querying a database, the difference between data analytics and standard database querying for information can be described as follows [1]:

  • Query: The search query might not be well formulated in data analytics but is always well formulated for database queries.
  • Data: The data is usually well organized for better analytics results that is, cleaned and preprocessed for analytics; for example removing missing values. However, for database queries, the data is not necessarily cleaned before querying.
  • Results: While basic descriptive statistics could be derived from querying a database, data analytics results are usually the statistical analysis information patterns of the data.

Data Analytics Algorithms and Models

Algorithms

  • Data analytics involves applying algorithms to derive information patterns from data [1].
  • An algorithm is a step-by-step process to accomplish a certain task. In data analytics, algorithms are used in effort to fit a model (classification) to the data being analyzed [2].

Data Models

  • A data model is conceptual design that assumes how the data will be categorized or classified [3].
  • In other words, a data model in this case is presupposed copy of what is expected of the data being analyzed [4].

Therefore data analytics involves the following tasks [1]:

  • Using various computation algorithms to extract meaningful information patterns in data.
  • Creating models for extracting meaningful unknown patterns of information.
  • Using data analytics algorithm in attempts to fit a model to the data being examined.
  • Using computation algorithms that assess the data and determine the model that best fits those characteristics of the data being observed.

Additionally, data analytics algorithms are made up of three components [1]:

  • Models: The aim of the algorithm is to fit the model to the data being analyzed.
  • Conditions: A set of conditions is used to select and fit a model on the data.
  • Data Exploration: Data analytics algorithms involve exploration of the data being analyzed.

Furthermore, data analytics can be divided into two major categories [1]:

  • Predictive analytics – involves making future predictions using the data being analyzed.
  • Descriptive analytics – involves learning new unknown patterns in the data being analyzed without making any future predictions.

References

[1] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 3.
[2] Wikipedia, “Algorithm”, Accessed February 5, 2016, https://en.wikipedia.org/wiki/Algorithm
[3] Paulraj Ponniah, “Data Modeling Fundamentals: A Practical Guide for IT Professionals”, John Wiley & Sons, 2007, ISBN:9780470141014, Page 360.
[4] Wikipedia, “Data Model”, Accessed February 5, 2016 https://en.wikipedia.org/wiki/Data_model

Read Full Post »

Data Analytics Simplified – A Tutorial,  Part 1

By Kato Mivule  

Keywords: Data Science, Data Mining, and Data Analytics 

Data_Analytics_Tutor1

Data Analytics

 

Introduction

  • Is it Data Science, Data Mining, and Data Analytics? All mean the same thing. There is a big craze with the term “data science” but there is not that much difference between a data mining, data analytics, and data science.
  • I prefer the term “data analytics”; in my opinion it carries less pomp and actually infers someone given to the serious science and art of meticulously discovering and extracting meaningful knowledge and wisdom from data and information.
  • I use the term “meaningful” in this regard since each company or entity for that matter seeks to extract information patterns or knowledge from a particular domain that is meaningful and useful to that particular organization. For example, law enforcement and local grocery stores might derive different patterns of information from car traffic data.
  • The goal of this tutorial is to provide a simplified and demystified understanding of data analytics a.k.a. data science, and provide examples that anyone can implement of their own.
  • The quantity of data is only increasing to the point that most of that data goes unprocessed – without anyone actually finding out what such data means.
  • Today folks are talking about big data – referring to extremely large quantities of data, and dark data referring to data that remain unprocessed, with no knowledge or information extracted from it.
  • The need for individuals who can extract information, knowledge, and wisdom from such data is a huge need and will remain so for the foreseeable future. We hope this set of tutorials provides a quick and simplified way to learn and apply data analytics skills to extract knowledge and wisdom from a diverse set of data domains.

Textbook

  • Although I will consider other resources, for this tutorial, I shall use as the textbook, “Data Mining: Introductory and Advanced Topics”, by Margaret H. Dunham, Prentice Hall, 2003, ISBN: 0-13-088892-3.

What is Data?

  • Merriam-Webster – data are facts or information used generally for calculations, analysis, and planning [1].
  • Wikipedia – data are a collection of unorganized symbols, such as, alphabets, numbers, and images [2].

What is Information?

  • Merriam-Webster – Information is knowledge, facts, or details collected about something [3].
  • Wikipedia – Information is an organized collection of data – symbols, alphabets, numbers, and images, to inform [4].

What is Knowledge?

  • Merriam-Webster – Knowledge is the totality of what is recognized in terms of fact, information, and ideas learned by humanity [5].
  • Wikipedia – Knowledge is an understanding of someone or something based on facts, information, descriptions, or skills, acquired through learning [6].

What is Wisdom?

  • Merriam-Webster – Knowledge that is achieved through various experiences in life and philosophic learning [7].
  • Wikipedia – Wisdom is the skill and ability to reason and then take action based on facts, information, knowledge, experience, and understanding [8].

Data Analytics

  • Data analytics is the process of using computation systems and statistics to find meaningful unknown information patterns in data [9] [10].

References

[1] Wikipedia, “Data”, Accessed January 25, 2016, https://en.wikipedia.org/wiki/Data
[2] Merriam-Webster, “Data.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 29 Jan. 2016. <http://www.merriam-webster.com/dictionary/data&gt;.
[3] Merriam-Webster, “Information.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 29 Jan. 2016. <http://www.merriam-webster.com/dictionary/information&gt;.
[4] Wikipedia, “Information”, Accessed January 28, 2016 https://en.wikipedia.org/wiki/Information
[5] Merriam-Webster, “Knowledge.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 29 Jan. 2016. <http://www.merriam-webster.com/dictionary/knowledge&gt;.
[6] Wikipedia, “Knowledge”, Accessed January 28, 2016      https://en.wikipedia.org/wiki/Knowledge
[7] Merriam-Webster, “Wisdom.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 29 Jan. 2016. <http://www.merriam-webster.com/dictionary/wisdom&gt;.
[8] Wikipedia, “Wisdom”, Accessed, January, 28, 2016, https://en.wikipedia.org/wiki/Wisdom
[9] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 3.
[10] Thomas A. Runkler, “Data Analytics: Models and Algorithms for Intelligent Data Analysis”, Springer Science & Business Media, 2012, Page 2, ISBN: 9783834825896

Read Full Post »

Kato Mivule |arXiv.orgSIED_2014_Pic2 | PDF

Abstract
While a number of data privacy techniques have been proposed in the recent years, a few frameworks have been suggested for the implementation of the data privacy process. Most of the proposed approaches are tailored towards implementing a specific data privacy algorithm but not the overall data privacy engineering and design process. Therefore, as a contribution, this study proposes SIED (Specification, Implementation, Evaluation, and Dissemination), a conceptual framework that takes a holistic approach to the data privacy engineering procedure by looking at the specifications, implementation, evaluation, and finally, dissemination of the privatized data sets.

http://arxiv.org/abs/1309.6576

Read Full Post »

Kato Mivule and Claude Turner

Abstract
It is obligatory that organizations by law safeguard the privacy of individuals when handling data sets containing personal identifiable information (PII). Nevertheless, during the process of data privatization, the utility or usefulness of the privatized data diminishes. Yet achieving the optimal balance between data privacy and utility needs has been documented as an NP-hard challenge. In this study, we investigate data privacy and utility preservation using KNN machine learning classification as a gauge.
http://arxiv.org/abs/1309.3964

http://arxiv.org/pdf/1309.3964v1

Read Full Post »

By Kato Mivule

The latest state-of-the-art in data privacy, proposed by Cynthia Dwork (2006).

Differential privacy enforces confidentiality by:

  • Returning perturbed aggregated query results from databases.

  • Such that users cannot discern if any data item has been altered or not.

  • An attacker cannot derive info about any data item in the database.

According to Dwork (2006):

  • Two databases D1 and D2 are considered similar, if they differ in only one element or row,

  • That is D1 Δ D2 = 1.

  • Therefore, a privacy granting procedure qn satisfies  –differential privacy if results to the same query run on database D1 and again run on database D2 should probabilistically be similar, and satisfy the following condition:

P[qn(D1) ∈ R] / P[qn(D2) ∈ R] ≤ exp()

  • Where D1and D2 are the two databases.
  • P is the probability of the perturbed query results D1and D2 respectively.

  • qn() is the privacy granting procedure (perturbation).

  • qn(D1) the privacy granting procedure on query results from database D1 .

  • qn(D2) the privacy granting procedure on query results from database D2 .

  • R is the perturbed query results from the databases D1and D2 respectively.

  • exp()the exponential epsilon value.

The probability of the perturbed query results qn(D1) divided by the probability of the perturbed query results qn(D2) should be less or equal to a exponential epsilon value.

That is to say, if we run the same query on database D1, and then run the same query again on database D2, our query results should probabilistically be similar.

If the condition can be satisfied in the existence or nonexistence of the most influential observation for a specific query, then this condition will also be satisfied for any other observation.

The effect of the most influential observation for a given query is Δf, assessed as follows:

Δf = Max||f(D1) – f(D1)|| for all possible observed values of D1 and D2

According to Dwork (2006), the results to a query are given as

  • f(x) + Laplace(0, b) noise addition

  • Where b = Δf/

  • x represents a particular observed value of the database

  • f(x) represents the true result to the query

  • Then the result would satisfy -differential privacy

  • The Δf must consider all possible observed values of D1 and D1

Differential Privacy Ilustration

Differential Privacy Ilustration

Example:

What is the GPA of students at Geek Nerd State University?

We compute the maximum possible difference between two databases that differ in exactly one record for a specific query.

Δf = Max||f(D1) – f(D2)||

Let Min GPA = 2.0 for smallest possible GPA

Let Max GPA = 4.0 for largest possible GPA

Δf = | Max GPA – Min GPA|

Δf = 2.0

The parameter b of the Laplace noise is set to Δf/ = 2.0/0.01 = 200

Laplace (0, 200) noise distribution.

Variance of the noise distribution = 2* 200^2 = 80000

A small value of 0.01 is chosen. Smaller yield greater privacy by the procedure.

However, utility risks degeneration with a much more smaller value of .

For example, at 0.0001, will give b as 20000, Laplace (0, 20000) noise distribution.

The unperturbed results of the query +

Noise from Laplace (0, 200) =

Perturbed query results satisfying -differential privacy.

SQL: SELECT GPA FROM Student + Laplace Noise (0, 200) = -differential query results.

Pros and Cons

  • Grants across-the-board Privacy.

  • Easy to implement with SQL for aggregated data publication.

  • Utility a challenge as statistical properties change with a much smaller .

  • Noise takes into account the outliers and most influential observation.

  • Example, income of Warren Buffet verses income of janitor in Omaha Nebraska.

  • Balance between Privacy and Utility still an NP-Hard challenge.

References

[1] C. Dwork, “Differential privacy,” in in ICALP, vol. 2, 2006, pp. 1-12. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.83.7534

[2] K. Muralidhar and R. Sarathy, “Does differential privacy protect terry gross’ privacy?” in Privacy in Statistical Databases, ser. Lecture Notes in Computer Science, J. Domingo-Ferrer and E. Magkos, Eds.    Berlin, Heidelberg: Springer Berlin / Heidelberg, 2011, vol. 6344, ch. 18, pp. 200-209. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-15838-4_18

Read Full Post »

Older Posts »

%d bloggers like this: