Feeds:
Posts
Comments

Archive for March, 2016

Data Analytics Simplified – A Tutorial – Part 5 – The Data Analytics Process

By Kato Mivule

Part 1, Part 2, Part 3, Part 4

KDD_Process1

The Data Analytics Knowledge Discovery Process

The data analytics process involves using algorithms to extract, mine, and discover meaningful information patterns and knowledge in data [1].

However, the data would have to undergo a series of transformational processes before meaningful information patterns can be extracted.

There are five main phases in the data analytics process [1] [2] [3] [4] [5]:

Problem formulation

  • The first step of the data analytics process is to articulate the problem and questions that need to be solved and answered by the data analytics process.
  • The question or problem to be solved has to be domain specific.
  • This helps with the correct data selection process. For example local grocery stores might want to use traffic pattern data to predict when customers with cars are most likely to stop my a certain store.

Data Selection

  • The second step of the process is to select an appropriate dataset based on a specific domain.

Data Preprocessing

  • The third step is to transform the selected dataset into a format that could most appropriate for analytics algorithms.
  • This process is also called data cleaning, in which missing values are removed or replaced with averages. Data outliers maybe removed or replaced with appropriate values. Data with different data types are also corrected at this stage.
  • Data analytics work involves a considerable time preprocessing data to ensure correct analysis.

Data Transformation

  • Data from different sources is then converted into a common format for analysis.
  • This could include reducing the data into appropriate sample sizes, adding labels for classification and changing file types to make the data suitable for analytics tools.

Data Mining

  • In this phase, appropriate data mining and machine learning algorithms are chosen.
  • The analyst then could make a choice to employ supervised or unsupervised learning algorithms.
  • In some cases, depending on the problem formulation, both supervised and unsupervised learning algorithms will be chosen.
  • However, parsimony is important – keep it simple. You don’t need to implement all algorithms to extract meaningful knowledge from the data and come up with a correct model.

Evaluation

  • In this phase evaluation of results produced by the data mining algorithms is done.
  • The extracted knowledge is then presented to the stakeholders in a clear manner.
  • Visualization of results is done at this stage.
  • A report analyzing and interpreting the results to convey meaning is done at this stage.
  • Again parsimony is important. A concise and understandable visualization of results is preferred.

References

[1] Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2003, Page 9-10.
[2] Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. “The KDD process for extracting useful knowledge from volumes of data.” Communications of the ACM 39, no. 11 (1996): 27-34.
[3] Dhar, Vasant. “Data science and prediction.” Communications of the ACM 56, no. 12 (2013): 64-73.
[4] Panov, Panče, Larisa Soldatova, and Sašo Džeroski. “OntoDM-KDD: ontology for representing the knowledge discovery process.” In Discovery Science, pp. 126-140. Springer Berlin Heidelberg, 2013.
[5] Sacha, Dominik, Andreas Stoffel, Florian Stoffel, Bum Chul Kwon, Geoffrey Ellis, and Daniel A. Keim. “Knowledge generation model for visual analytics.” Visualization and Computer Graphics, IEEE Transactions on 20, no. 12 (2014): 1604-1613.
Advertisements

Read Full Post »

%d bloggers like this: