Archive for February, 2016

Data Analytics Simplified – A Tutorial – Part 4 – Statistics

By Kato Mivule

Part 1, Part 2, Part 3

Basic Statistics


The Normal Distribution – Credit: Wikipedia

  • Data analytics is based largely on statistics and statistical modeling. However, a basic knowledge of descriptive and inference statistics should suffice for beginners.

Descriptive statistics

  • Descriptive statistics are values used to summarize the characteristics of the data being studied. Such descriptions could include, values such as, the mean, maximum, minimum, mode, and median traits of values in a variable [1]. For example, we could say that for the following variable Y with values {2, 4, 3, 5, 7}, the maximum values is 7, while the minimum value is 2.

Inference statistics

  • Inference statistics are used to infer and make conclusions about what the data being studied means. In other words, inference statistics is an attempt to deduce the meaning of the data and its descriptive statistics [2]. For this tutorial, we shall look at covariance and correlation as part of the inference statistics. While some data scientists spend time dissecting inference statistics in efforts to design new machine learning algorithms, in this tutorial, we are concerned with using data analytics as a tool to extract knowledge, wisdom, and meaning from the data. So, we shall focus on the application side rather than on the design side of data science. In other words, we shall be looking at how to apply data science algorithms to solve data analytics problems and leave the design of such algorithms for the advanced stage. Therefore, basic statistics such as, normal distribution, mean, variance, standard deviation, covariance, and correlation will be given consideration.

Descriptive statistics

  • Mean

The Mean μ, is the average of numeric values after the total of those values has been computed. The summation (total) of the values, symbolized by is taken and then divided by the n, the number of values as show in the following formula [3]:


  • Mode

The mode is the value or item that appears most – the frequent, in an attribute or dataset [4].

  • Median

The median represents the value that separates a data sample into two parts halves – the first half from the second half [4].

  • Minimum

The minimum is the smallest value in a variable or dataset [5].

  • Maximum

The maximum is the largest value in a dataset or variable [5].

  • Range

The range is the measure of dispersion and spread of values between the minimum and maximum in a variable [6]. 

  • Frequency Analysis

A frequency analysis is a count of times items or values that appear in a data sample. The frequency metric can be described using the following formula [7]:


Where items(c) is the set of items in a dataset c.

  • Normal Distribution

The Normal Distribution N (μ, σ2), (Gaussian distribution), is a bell shaped continuous probability distribution used as an estimation to show how values are spread around the mean value. The normal distribution can be computed as follows [3]:


The parameter μ represents the mean, the point of the peak in the bell curve. The parameter σ2 symbolizes the variance, which is the width of the distribution. The annotation N (μ, σ2) represents a normal distribution with mean μ and variance σ2. Therefore X is descriptive of the normal distribution of the variable X between N (μ, σ2). The distribution with μ = 0 and σ 2 = 1 is known as the standard normal.

  • Variance

The Variance σ2, is a calculation of how data distributes itself in approximation to the mean value. Variance can be calculated as follows [3]:


Where the symbol σ2 is the variance. The symbol µ is the mean. The symbol X is the data values in the variable X. The symbol N is the total number of values. The symbol ∑ (X – µ)2 is the summation of all data values in the variable X, minus the mean µ.

  • Standard Deviation

The Standard Deviation σ, calculates how data values deviate from the mean, and can be computed by simply taking the square root of the variance σ2 as follows [3]:



Inference statistics

  • Covariance

Covariance Cov (X, Y), is a calculation of how associated the deviations between the data points X and Y are [3]. If the covariance is positive, then data values increase together, else if negative, then for the two data points X and Y, one diminishes while the other increases. If the covariance is zero, then this shows that the data values are each independent. Covariance can be calculated as follows [3]:



  • Correlation calculates the incline of a linear relation between two data points x and y [3]. The correlation is independent of the parts in which the data points x and y are calculated. The correlation values are measured between 0 and 1. However, if the correlation is negative (-1) the linear relationship between the two data points is negative. If the correlation is 0, then the linear relationship between the two data points is non-existent. If the correlation approaches the value of 1, then the linear relationship is exists and gets stronger as the correlation values approaches 1. Correlation can be computed using the following formula [3]:



[1] Wikipedia, “Descriptive statistics”, Accessed 02/26/2016, Available Online: https://en.wikipedia.org/wiki/Descriptive_statistics
[2] Wikipedia, “Statistical inference”, Accessed 02/26/2016, Available Online: https://en.wikipedia.org/wiki/Statistical_inference
[3] Kato Mivule, “Utilizing Noise Addition for Data Privacy , an Overview.” In Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), 2012, Page 65–71.
[4] Larry Hatcher, “Step-by-Step Basic Statistics Using SAS: Student Guide”, SAS Institute, 2003, ISBN 9781590471487, Page 183
[5] Dennis Cox and Michael Cox, “The Mathematics of Banking and Finance”, John Wiley & Sons, 2006, ISBN 9780470028889, Page 37
[6] Jacinta M. Gau, “Statistics for Criminology and Criminal Justice”, SAGE Publications, 2015, ISBN 9781483378442, Page 98.
[7] Philip Resnik. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Arti cial Intell. Res. 11 (1999), 95–130.

Read Full Post »

Data Analytics Simplified – A Tutorial – Part 3 

By Kato Mivule

Keywords: Supervised Learning, Unsupervised Learning

Part 1, Part 2

Data analytics can be broken down into two main categories as we saw in the previous post, predictive and descriptive data analytics. Furthermore, data analytics tasks can be categorized as follows:

Supervised learning tasks: This involves algorithms that group data into classes and make predictions based on previous examples – thus learning by example [1]. In other words data is grouped into predetermined classes based on previous history of categorizing the data. The data in supervised learning is always labeled (classes) and basically divided into:

  • Training data – which is used for setting up examples on how to group the data, and thus create a model for future grouping of the data.
  • Testing data – which is used to make predictions, based on the example data.

Unsupervised learning tasks: This involves algorithms that do not need predetermined classes to categorize the data [2]. In such cases, the data self determines the groups or clusters into which similar data values collect together.

  • Predictive data analytics methods are largely supervised learning methods.
  • Descriptive data analytics methods are mainly non-supervised learning methods.



Data Analytics – Supervised and Unsupervised Learning Tasks

Briefly, predictive data analytics includes the following three main data analytics methods, more details shall follow later:

  • Classification analysis

Classification involves grouping or prediction of data into predefined categorical classes or targets. The classes in which the data is grouped are chosen before analyzing the data based on the characteristics of the data [3].

  • Regression analysis

Regression involves the prediction or grouping of data items into predefined numerical classes or targets based on a known mathematical function such as a linear or logistic regression function [4].

  • Time series analysis

Time series analytics involves examining the values of a data attribute that have been recorded over a period of time, in a chronological order. Predictions are made based on the history of the values observed over time [5] [6].

Descriptive data analytics includes the following unsupervised data analytics methods, more details shall follow later:

  • Clustering

Clustering involves grouping data into classes but without any predefined or predetermined classes and targets. The classes in which the data is grouped are self-determined by the data, such that similar items collect around the same clusters [7].

  • Summarization

Summarization is the generalization of data into groups that are related with descriptive statistics such as the mean, mode, and median [8].

  • Association Rules

Associative rules involve using a set of IF-THEN rules or functions to categorize data with similar relationships in the same groups [9].

  • Sequence Discovery

Similar to time series analysis, sequential discovery or sequential pattern analysis, involves finding the related patterns in sequential data – that is data in chronological order, based on statistical properties [10].


[1] Olivier Pietquin, “A Framework for Unsupervised Learning of Dialogue Strategies”, Presses univ. de Louvain, 2004, ISBN 9782930344638, Page 179.
[2] Nick Pentreath, “Machine Learning with Spark”, Packt Publishing Ltd, 2015, ISBN 9781783288526, Page 197.
[3] Ashish Gupta, “Learning Apache Mahout Classification Community experience distilled”, Packt Publishing Ltd, 2015, ISBN 9781783554966, Page 14.
[4] Robin Nunkesser, “Algorithms for Regression and Classification: Robust Regression and Genetic Association Studies”, BoD – Books on Demand, 2009, ISBN 9783837096040, Page 4.
[5] Gebhard Kirchgässner, Jürgen Wolters, Uwe Hassler, “Introduction to Modern Time Series Analysis”, Springer Texts in Business and Economics, Edition    2, illustrated, Springer Science & Business Media, 2012, ISBN 9783642334368, Page 1.
[6] George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel, Greta M. Ljung, “Time Series Analysis: Forecasting and Control”, Wiley Series in Probability and Statistics, Edition 5, John Wiley & Sons, 2015, ISBN 9781118674925, Page 1.
[7] Brett Lantz, “Machine Learning with R”, Edition 2, Packt Publishing Ltd, 2015,ISBN 9781784394523, Page 286.
[8] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 8.
[9] Norbert M. Seel, “Encyclopedia of the Sciences of Learning”, Springer Science & Business Media, 2011, ISBN 9781441914279, Page 2910.
[10] Wikipedia, “Sequential Pattern Mining”, Wikipedia, Accessed, February 15, 2016, https://en.wikipedia.org/wiki/Sequential_pattern_mining

Read Full Post »

Data Analytics Simplified – A Tutorial – Part 2

By Kato Mivule

Part 1

Keywords: Data analytics, Database querying


Types of Data Analytics

While data analytics might involve querying a database, the difference between data analytics and standard database querying for information can be described as follows [1]:

  • Query: The search query might not be well formulated in data analytics but is always well formulated for database queries.
  • Data: The data is usually well organized for better analytics results that is, cleaned and preprocessed for analytics; for example removing missing values. However, for database queries, the data is not necessarily cleaned before querying.
  • Results: While basic descriptive statistics could be derived from querying a database, data analytics results are usually the statistical analysis information patterns of the data.

Data Analytics Algorithms and Models


  • Data analytics involves applying algorithms to derive information patterns from data [1].
  • An algorithm is a step-by-step process to accomplish a certain task. In data analytics, algorithms are used in effort to fit a model (classification) to the data being analyzed [2].

Data Models

  • A data model is conceptual design that assumes how the data will be categorized or classified [3].
  • In other words, a data model in this case is presupposed copy of what is expected of the data being analyzed [4].

Therefore data analytics involves the following tasks [1]:

  • Using various computation algorithms to extract meaningful information patterns in data.
  • Creating models for extracting meaningful unknown patterns of information.
  • Using data analytics algorithm in attempts to fit a model to the data being examined.
  • Using computation algorithms that assess the data and determine the model that best fits those characteristics of the data being observed.

Additionally, data analytics algorithms are made up of three components [1]:

  • Models: The aim of the algorithm is to fit the model to the data being analyzed.
  • Conditions: A set of conditions is used to select and fit a model on the data.
  • Data Exploration: Data analytics algorithms involve exploration of the data being analyzed.

Furthermore, data analytics can be divided into two major categories [1]:

  • Predictive analytics – involves making future predictions using the data being analyzed.
  • Descriptive analytics – involves learning new unknown patterns in the data being analyzed without making any future predictions.


[1] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 3.
[2] Wikipedia, “Algorithm”, Accessed February 5, 2016, https://en.wikipedia.org/wiki/Algorithm
[3] Paulraj Ponniah, “Data Modeling Fundamentals: A Practical Guide for IT Professionals”, John Wiley & Sons, 2007, ISBN:9780470141014, Page 360.
[4] Wikipedia, “Data Model”, Accessed February 5, 2016 https://en.wikipedia.org/wiki/Data_model

Read Full Post »

%d bloggers like this: