Data Mining Map by Saeed Sayad

Data Science can be broadly divided into two approaches explaining the past or predicting the future by means of data analysis. Data Science is a multi disciplinary field that combines statistics, machine learning, artificial intelligence and database technology.

Business have accumulated data over the years and with the help of data science we are able extract valuable knowledge from this data. Lets understand how each field in above diagram contribute to Data Science.

Statistics is used in Data Science for collecting , classifying, summarising, organising, analysing and interpreting data. Artificial Intelligence contributes to Data Science by simulating intelligent behaviours from the underlying data. Machine Learning contributes to data science by coming up with algorithms that improve automatically through experience. Database Technology is necessary for collecting, storing and managing data so users can retrieve, add, update or remove data.

Now what do we do in Data Science ??

In Laymans language we analyse data to explain the past or to predict the future.

Explaining the past

For explaining the past we need to do Data Exploration - It is all about describing the data by means of statistical and visualization technique. Data Exploration helps in order to bring important aspects of data into focus for further analysis.

  • Univariate Analysis - Exploring variables one by one . Variables can be of two types categorical or numerical. Each type of variable has is own recommended way for analysis or for graphical plotting.

    • Categorical Variables - A categorical or discrete variable has two or more categories (values). There are two types of categorical variables

      • Nominal Variable - No intrinsic ordering for its categories (eg - Gender)
      • Ordinal Variable - A clear ordering is there (eg - Temperature [low, medium, high]

      Frequency tables are the best way to analyse such variables

      Pie Chart and Bar Chart are commonly used for visual analysis.

    • Numerical Variables - takes any value within a finite or infinite interval eg:- height, weight, temperature, blood glucose. There are two types of numerical variables, intervals and ratio.

      • Interval Variables - Values whose differences are interpret able. (temperature in centigrade). Data in Interval scale can be added or subtracted but cannot be meaningfully multiplied or divided.
      • Ratio Variable - Data in ratio variable has values with a true zero and can be added, subtracted ,multiplied or divided. (eg- weight)

      How do we analyse Numerical variables ? Below table describe the various methods to analyse Numerical variable

      Statistics Visualization Equation Description
      Count Histogram N Number of observations
      Minimum Box Plot Min smallest value among the observations
      Maximum Box Plot Max largest value among the observations
      Mean Box Plot Sum of values / count
      Median Box Plot Middle value below and above lies equal number of values
      Mode Histogram   Most frequent value in observation set
      Quantile Box Plot Q~k~ Cut points that divides observations into multiple groups with equal number of values
      Range Box plot Max - Min Difference between Maximum and minimum
      Variance Histogram Measure of Data Dispersion
      Standard Deviation Histogram Square root of Variance
      Coefficient Of Deviation Histogram Measure of Data Dispersion divided by Mean
      Skewness Histogram Measure of symmetry or asymmetry of data
      Kurtosis Histogram   Measure of whether the data are peaked or flat relative to a normal distribution
  • Bivariate Analysis - Simultaneous analysis of 2 Variables. Explores the concept of relationship between 2 variables. There are 3 types of bivariate analysis

    • Numerical & Numerical -
      • Scatter Plot is a visual representation of two numerical variables, We can infer patterns from this.
      • Linear Correlation - quantifies the linear relationship between two numerical variables
    • Categorical & Categorical -
      • Stacked Column Chart - Compares the percentage each category from one variable contributes to a total across categories of second variable
      • Combination Charts - two or more different type of charts for each variable (bar chart and chart) to show how one variable is affecting other variable
      • Chi Square Test -Used to determine association between categorical variables, based on differences between expected frequencies (e) and observed frequency (n) in one or more categories in the frequency table. The test returns a probability for the computed chi square and degree of freedom, probability of 0 means complete dependency between categorical variable and probability of 1 means two categorical variables are completely independent
    • Numerical & Categorical -
      • Line Chart with error Bars -Error Bars show standard Error in that particular Category.
      • Combination Chart - either line or Bar chart ( line for numerical variable and bar for categorical)
      • Z test and T test - Assess averages of two groups are statistically different from each other If probability between Z is small the difference between two averages is more significant. We use T test when number of observation is less than 30
      • Anova test Assess whether the averages of more than two groups are statistically different from each other, Analysis is appropriate for comparing the averages of a numerical variable for more than two categories of a categorical variable.

Predicting the Future

For predicting the future we make use of models. Hence the name Predictive Modeling. Here we try to predict the outcome. if the outcome is Categorical we call it classification, if it is numerical we call it regression. Descriptive modelling or clustering is the assignment of observations into clusters. Association Rule can help us find interesting association among observations.

Classification Algorithms

Here output variable is categorical. Classification algorithms could be broadly divided into 4 main groups.

  • Frequency Table Based

    • Zero R method - simplest classification method exclusively relies on target and ignores all predictors. ZeroR classifier simply predicts the majority class. It has no predictability power however it is usefull for determining Baseline performance as a benchmark for other classification methods.

      Construct a frequency table and select its most frequent value.

    • One R method - simple yet accurate classifiaction algorithm that generates one rule for each predictor in the data and then selects the rule with smallest total error as its one rule.
      For each predictor,
              For each value of that predictor make a rule as follows:
                      Count how often each value of target appears.
                      Find the most frequent class
                      Make the rule assign that class to this value of the predictor
              Calculate the total error of the rules of each predictor
      Choose the predictor with smallest total error
      
    • Naive Bayes Method - Based on Naive Bayes Theorem with independence assumption between predictors. Naive Bayes model is easy to build, useful for large data set. Naive Bayes often useful and is widely used as it outperforms classification methods. P(c/x)=(P(x/c)P(c))/P(x)P(c/x) = (P(x/c)*P(c))/P(x) P(c/x)P(c/x) - Posterior Probability

      P(x/c)P(x/c) - Likelihood

      P(c)P(c) - Class probability

      P(x)P(x) - Predictor Prior Probability

      P(c/X)=P(x 1 /C)P(x 2 /C)x....P(x n /C)P(C)P(c/X) = P(x~1~/C) * P(x~2~/C) x ....P(x~n~/C) * P(C)