What is Exploratory Data Analysis?

Bard College | Introduction to Data Analytics

What is Exploratory Data Analysis?

  • Explore

  • Visualize

EDA is an approach to data analysis on its own, and a key step before modeling.

Related source: https://www.geeksforgeeks.org/data-analysis/what-is-exploratory-data-analysis/

Get to know your data

  • What is the structure of the data?

    • Data types

    • Variables / features

    • Distributions (min, max, central tendency, spread)

  • Patterns and relationships within and between variables

  • Identify errors, outliers

  • Identify what matters for further exploration and modeling

Univariate analysis

Numerical variables

  • Histograms

  • Box plots

Categorical variables

  • Bar plots

Time and space

  • Time series line plots

  • Spatial mapping

Bivariate analysis

  • Scatter plots

  • Correlation coefficients

  • Correlation plots

  • Cross tabulations (count(), group_by and summarize)

  • Bivariate mapping

  • Covariance

  • Simple linear regression

Multivariate analysis

  • Correlation matrix

  • Principal components analysis

  • Spatial interaction and dependence tests

  • Time series (ARIMA, etc.)

Core EDA steps

  1. Understand the data
    • Data source/prodcution: Who produced the data? Why?
    • Data structure and variables included (numerical, categorical, text)
    • Any limitations identifiable at this stage?
  2. Import and evaluate
    • Load data into R
    • Observations and variables: what rows represent and what is measured?
    • Missing values?

Core EDA steps

  1. Identify missing data
    • Why is the data missing?
    • Remove or fill in missing values?
    • How to fill in missing values (imputation methods like K-nearest neighbors, etc.)
    • How might missing data affect analysis?
  2. Summarize variables
    • Distribution plots
    • Central tendency
    • Variation: Identify spread or variation (standard deviation, box plots)
    • Identify outliers and potential errors (plots and filtering)

Core EDA steps

  1. Wrangle/transform dataset
    • Set factor levels
    • Scaling and normalizing
      • standardize variables
      • Transform using log scale as needed
    • Create new variables (derived variables)
    • Group or aggregate to create new data
  2. Explore relationships
    • Numerical
      • Scatter plots and side-by-side box plots (violin plots)
      • Correlation coefficients (e.g., Pearson, Spearman)
      • Correlation matrix
    • Categorical: Bar plots and frequency tables (count(), etc.)

Core EDA steps

  1. Deal with outliers
    • Use your understanding or research on the topic to undertand which values make sense for the variables in your dataset
    • Explore the interquartile range (the box in a box plot), Z-scores
    • Remove outliers? Interpolate?
  2. Communicate!
    • Provide context. Be critical! Ask why?
    • What are your key findings?
    • What evidence is needed to communicate your findings?
    • What are the limitations of your analysis?
    • What should we explore next? What questions should we be asking of the data now?