**Data Analytics Simplified – A Tutorial – Part 4 – Statistics**

**By Kato Mivule**

**Basic Statistics**

- Data analytics is based largely on statistics and statistical modeling. However, a basic knowledge of descriptive and inference statistics should suffice for beginners.

**Descriptive statistics**

- Descriptive statistics are values used to summarize the characteristics of the data being studied. Such descriptions could include, values such as, the mean, maximum, minimum, mode, and median traits of values in a variable [1]. For example, we could say that for the following variable
*Y*with values {2, 4, 3, 5, 7}, the maximum values is 7, while the minimum value is 2.

**Inference statistics**

- Inference statistics are used to infer and make conclusions about what the data being studied means. In other words, inference statistics is an attempt to deduce the meaning of the data and its descriptive statistics [2]. For this tutorial, we shall look at covariance and correlation as part of the inference statistics. While some data scientists spend time dissecting inference statistics in efforts to design new machine learning algorithms, in this tutorial, we are concerned with using data analytics as a tool to extract knowledge, wisdom, and meaning from the data. So, we shall focus on the application side rather than on the design side of data science. In other words, we shall be looking at how to apply data science algorithms to solve data analytics problems and leave the design of such algorithms for the advanced stage. Therefore, basic statistics such as, normal distribution, mean, variance, standard deviation, covariance, and correlation will be given consideration.

**Descriptive statistics **

**Mean**

*The Mean* *μ , *is the average of numeric values after the total of those values has been computed. The summation (total) of the values, symbolized by is taken and then divided by the

*n*, the number of values as show in the following formula [3]:

**Mode**

The mode is the value or item that appears most – the frequent, in an attribute or dataset [4].

**Median**

The median represents the value that separates a data sample into two parts halves – the first half from the second half [4].

**Minimum**

The minimum is the smallest value in a variable or dataset [5].

**Maximum**

The maximum is the largest value in a dataset or variable [5].

**Range**

The range is the measure of dispersion and spread of values between the minimum and maximum in a variable [6].** **

**Frequency Analysis**

A frequency analysis is a count of times items or values that appear in a data sample. The frequency metric can be described using the following formula [7]:

Where *items(c)* is the set of items in a dataset c.

**Normal Distribution**

*The Normal Distribution N (μ, σ ^{2})*, (Gaussian distribution), is a bell shaped continuous probability distribution used as an estimation to show how values are spread around the mean value. The normal distribution can be computed as follows [3]:

The parameter *μ* represents the mean, the point of the peak in the bell curve. The parameter *σ ^{2}* symbolizes the variance, which is the width of the distribution. The annotation

*N (μ, σ*represents a normal distribution with mean

^{2})*μ*and variance σ

^{2}. Therefore

*X*is descriptive of the normal distribution of the variable

*X*between

*N (μ, σ*. The distribution with

^{2})*μ = 0*and

*σ 2 = 1*is known as the standard normal.

**Variance**

The Variance σ^{2}, is a calculation of how data distributes itself in approximation to the mean value. Variance can be calculated as follows [3]:

Where the symbol *σ ^{2}* is the variance. The symbol

*µ*is the mean. The symbol

*X*is the data values in the variable X. The symbol

*N*is the total number of values. The symbol

*∑ (X – µ)*is the summation of all data values in the variable

^{2 }*X,*minus the mean

*µ.*

**Standard Deviation**

The Standard Deviation σ*,* calculates how data values deviate from the mean, and can be computed by simply taking the square root of the variance *σ ^{2}* as follows [3]:

**Inference statistics**

**Covariance**

Covariance *Cov (X, Y)*, is a calculation of how associated the deviations between the data points *X* and *Y* are [3]. If the covariance is positive, then data values increase together, else if negative, then for the two data points *X* and *Y*, one diminishes while the other increases. If the covariance is zero, then this shows that the data values are each independent. Covariance can be calculated as follows [3]:

**Correlation**

- Correlation calculates the incline of a linear relation between two data points
*x*and*y*[3]. The correlation is independent of the parts in which the data points*x*and*y*are calculated. The correlation values are measured between 0 and 1. However, if the correlation is negative (*-1)*the linear relationship between the two data points is negative. If the correlation is 0, then the linear relationship between the two data points is non-existent. If the correlation approaches the value of 1, then the linear relationship is exists and gets stronger as the correlation values approaches 1. Correlation can be computed using the following formula [3]:

**References**