Feeds:
Posts

## Data Analytics Simplified – A Tutorial – Part 4 – Statistics

Data Analytics Simplified – A Tutorial – Part 4 – Statistics

By Kato Mivule

Basic Statistics

The Normal Distribution – Credit: Wikipedia

• Data analytics is based largely on statistics and statistical modeling. However, a basic knowledge of descriptive and inference statistics should suffice for beginners.

Descriptive statistics

• Descriptive statistics are values used to summarize the characteristics of the data being studied. Such descriptions could include, values such as, the mean, maximum, minimum, mode, and median traits of values in a variable [1]. For example, we could say that for the following variable Y with values {2, 4, 3, 5, 7}, the maximum values is 7, while the minimum value is 2.

Inference statistics

• Inference statistics are used to infer and make conclusions about what the data being studied means. In other words, inference statistics is an attempt to deduce the meaning of the data and its descriptive statistics [2]. For this tutorial, we shall look at covariance and correlation as part of the inference statistics. While some data scientists spend time dissecting inference statistics in efforts to design new machine learning algorithms, in this tutorial, we are concerned with using data analytics as a tool to extract knowledge, wisdom, and meaning from the data. So, we shall focus on the application side rather than on the design side of data science. In other words, we shall be looking at how to apply data science algorithms to solve data analytics problems and leave the design of such algorithms for the advanced stage. Therefore, basic statistics such as, normal distribution, mean, variance, standard deviation, covariance, and correlation will be given consideration.

Descriptive statistics

• Mean

The Mean μ, is the average of numeric values after the total of those values has been computed. The summation (total) of the values, symbolized by is taken and then divided by the n, the number of values as show in the following formula [3]:

• Mode

The mode is the value or item that appears most – the frequent, in an attribute or dataset [4].

• Median

The median represents the value that separates a data sample into two parts halves – the first half from the second half [4].

• Minimum

The minimum is the smallest value in a variable or dataset [5].

• Maximum

The maximum is the largest value in a dataset or variable [5].

• Range

The range is the measure of dispersion and spread of values between the minimum and maximum in a variable [6].

• Frequency Analysis

A frequency analysis is a count of times items or values that appear in a data sample. The frequency metric can be described using the following formula [7]:

Where items(c) is the set of items in a dataset c.

• Normal Distribution

The Normal Distribution N (μ, σ2), (Gaussian distribution), is a bell shaped continuous probability distribution used as an estimation to show how values are spread around the mean value. The normal distribution can be computed as follows [3]:

The parameter μ represents the mean, the point of the peak in the bell curve. The parameter σ2 symbolizes the variance, which is the width of the distribution. The annotation N (μ, σ2) represents a normal distribution with mean μ and variance σ2. Therefore X is descriptive of the normal distribution of the variable X between N (μ, σ2). The distribution with μ = 0 and σ 2 = 1 is known as the standard normal.

• Variance

The Variance σ2, is a calculation of how data distributes itself in approximation to the mean value. Variance can be calculated as follows [3]:

Where the symbol σ2 is the variance. The symbol µ is the mean. The symbol X is the data values in the variable X. The symbol N is the total number of values. The symbol ∑ (X – µ)2 is the summation of all data values in the variable X, minus the mean µ.

• Standard Deviation

The Standard Deviation σ, calculates how data values deviate from the mean, and can be computed by simply taking the square root of the variance σ2 as follows [3]:

Inference statistics

• Covariance

Covariance Cov (X, Y), is a calculation of how associated the deviations between the data points X and Y are [3]. If the covariance is positive, then data values increase together, else if negative, then for the two data points X and Y, one diminishes while the other increases. If the covariance is zero, then this shows that the data values are each independent. Covariance can be calculated as follows [3]:

Correlation

• Correlation calculates the incline of a linear relation between two data points x and y [3]. The correlation is independent of the parts in which the data points x and y are calculated. The correlation values are measured between 0 and 1. However, if the correlation is negative (-1) the linear relationship between the two data points is negative. If the correlation is 0, then the linear relationship between the two data points is non-existent. If the correlation approaches the value of 1, then the linear relationship is exists and gets stronger as the correlation values approaches 1. Correlation can be computed using the following formula [3]:

References

### 2 Responses

1. […] « Data Analytics Simplified – A Tutorial – Part 4 – Statistics […]

2. […] 1, Part 2, Part 3, Part 4, Part […]