Chapter 4. Statistical Analysis with R and Python
Statistics play an important role in analytics because they are used to analyze data, identify patterns, and determine relationships between variables. The results of statistical analysis can be used to make business decisions and predictions or to classify new data, and this wide variety of use cases make it a science worth learning. Additionally, business analysts leverage algorithms to create analytical models. An algorithm is a set of steps, instructions, and procedures that are used to solve a problem or perform a task. In analytics, they are used to perform specific computations, manipulate data, and create models. Descriptive, diagnostic, predictive, or prescriptive analytics are then applied to these models. Together, statistics and algorithms are used to preprocess data, select data elements to be used in problem solving, and compare models to select the best one to solve a problem. This chapter explores the role of statistics, different approaches used, and data visualization in R and Python.
Example Analytical Projects
Before we explore specifics, it is important to understand the context of the problem to be solved, since this determines which statistical approaches will be used. Not every problem will be solved with analytics, but the problems you’ll be tackling as an analyst will be complex, requiring critical thinking to break down, examine, and interpret components. Once the components are identified, the next step is to understand the problem at a level where you can explain and solve it. To solve an analytical problem, analytical methods are applied to come to a conclusion or solution. These methods may include data analysis or hypothesis testing, for example. Problems can’t be solved in isolation and often require evaluating relationships, patterns, and trends, as well as the application of domain-specific knowledge.
To best understand analytical problems, let’s look at some examples that are common across industries. The analytics life cycle discussed in Chapter 3 is used to outline how the analytics problem is solved. As a quick refresher, the life cycle steps include problem identification, data understanding and preparation, model building, model evaluation, and implementation.
Telecom Churn
Consider the following: a major telecom company has contracted with your organization, asking your team to help it understand why customers are canceling their service and switching to other providers. Analytics can assist in managing customer churn by identifying the factors that cause customers to switch service providers. Analytics can also predict if a customer is likely to churn.
You’ll see the techniques mentioned in this example in action over the course of the next few chapters, but we want to provide a high-level overview here so you can get a sense of the end-to-end process:
- Step 1: Define the problem
- A major telecom company wants to understand why it’s losing customers. Your team is tasked with understanding what is causing this churn, or turnover of customers. Churn is a large problem for telecom companies as it means a loss of revenue as well as market share.
- Step 2: Data understanding
- In the data understanding phase, the focus is on collecting and exploring the telecom churn data to grasp its structure and potential insights. This begins with gathering data from sources like customer demographics, usage patterns, and billing information. Exploratory data analysis (EDA) is then performed to identify key trends and relationships, such as how contract length or monthly charges might influence churn rates. Additionally, data quality is assessed to spot and address issues like missing values or inconsistencies, which are crucial for ensuring reliable analysis later on. We will cover EDA in Chapter 5.
- Step 3: Data preparation
- The data preparation phase involves refining the data to make it suitable for modeling. This includes cleaning the data by handling missing values and correcting inconsistencies. Feature engineering is performed to create new variables that could improve predictive power, such as segmenting customers based on usage. The data might also be transformed, normalized, or integrated from multiple sources to create a cohesive dataset. Finally, unnecessary or redundant features may be reduced to simplify the model while retaining essential information. The goal is to produce a clean, well-structured dataset ready for the modeling phase, where predictions about customer churn can be made. Data preparation techniques are covered in Chapter 5.
- Step 4: Building a model
- In the modeling phase, the prepared data is used to build predictive models that help identify patterns and predict outcomes—in this case, customer churn. Various algorithms, such as logistic regression, decision trees, or random forests, are applied to the data to create models that can accurately forecast which customers are likely to churn.
- The process includes selecting the appropriate modeling techniques, training the models on the dataset, and fine-tuning parameters to optimize performance. The models are then evaluated using metrics like accuracy, precision, recall, and area under the curve (AUC) to ensure they effectively distinguish between customers who will churn and those who will stay. The goal is to develop a robust model that provides actionable insights, allowing the business to take proactive steps to reduce churn. Algorithms are covered in Chapter 6.
- Step 5: Implementing a predictive model
- The final step is implementing the predictive model for use and decision making. This is where the value occurs. Potential outcomes may involve offering targeted promotions or incentives to customers who are at high risk of churn.
In this example, analytics were used to help telecom companies identify the drivers of customer churn and predict which customers are most likely to leave. By using this information, companies can develop effective retention strategies and improve customer satisfaction, thereby reducing the impact of customer churn on their business. Let’s explore another type of analytical project that a business analyst could be involved in: A/B testing.
A/B Testing
A/B testing is a statistical method used to compare two versions of a variable (such as a web page, product feature, or marketing campaign) to determine which one performs better in achieving a specific goal, like higher conversion rates or customer engagement. It is an analytical problem because it involves hypothesis testing, data collection, and analysis to draw meaningful conclusions about which version is more effective. For business analysts, understanding A/B testing is crucial as it informs decision making by providing evidence-based insights into what works best for the business. This knowledge is closely related to building predictive models because both processes rely on data-driven approaches to optimize outcomes, validate hypotheses, and make informed predictions about future behavior based on past data.
As part of A/B testing, the focus is on developing a hypothesis on what is expected to happen with each new version of a marketing campaign or product page. The testing of the hypothesis is accomplished using different metrics such as conversion rates, click-through rates, or other performance measures depending on the scenario. A sample is then determined by randomly selecting users or customers that represent the target audience to participate in the test.
In this example, a company has created different versions of its product and marketing campaigns, which are then presented to a sample group of customers. Data is then collected on the performance of each version, and analysts assess the data using statistical methods to determine which variation performs better. The company uses techniques including hypothesis testing, confidence intervals, and Bayesian analysis. Then, the company interprets the results to determine which variation performed better and whether the hypothesis was supported or rejected. Typically, the version that performed the best will then be implemented.
A/B testing, paired with analytics, is used by marketing and product teams to make data-driven decisions and determine which version of a given campaign or product performs the best. This approach enables organizations to improve products and marketing campaigns.
Marketing Campaigns
An analytical project for marketing campaigns involves systematically evaluating the effectiveness of different marketing strategies to optimize outcomes such as customer acquisition, engagement, and conversion rates. This type of project typically includes segmenting the target audience, analyzing past campaign performance, and identifying key factors that drive success. By leveraging data analytics, business analysts can assess the impact of various marketing tactics, refine targeting approaches, and develop more effective, personalized campaigns. Understanding this process is essential for business analysts, as it enables them to provide actionable insights that enhance the efficiency and ROI of marketing efforts, similar to the data-driven methodologies used in building predictive models. Marketing campaigns are used by organizations to promote products and services as well as engage customers. Marketing campaigns are most effective when using the right channels, sending the right messages, and targeting the right audiences. This can be challenging to do well, and analytics play a role in determining the right channels, messages, and audiences.
To optimize a marketing campaign, data is collected on customer behavior and marketing campaign performance. Additional data may be collected for prior or similar marketing campaigns, including website traffic, social media engagement, email open rates, click-through rates, and conversion rates. Then, data can be preprocessed and analyzed to determine patterns, trends, and data relationships.
What does this look like in practice? Using the insights gained from data analysis, campaigns may be optimized by targeting specific customer segments, optimizing messaging and creative content, and adjusting the marketing mix to improve campaign performance. Additionally, A/B testing can be leveraged in marketing campaign optimization. This can involve testing different messaging, creative materials, or call-to-action (CTA) variations to see which performs better. Another analytical technique that may be useful here is attribution modeling. This includes techniques used to determine which marketing channels and actions are having the largest return. Once a campaign is being used, campaign performance is monitored to identify areas that need adjustment or improvement. Analytics is used to ensure marketing campaigns are effective and, ultimately, provide more revenue for the organization.
Financial Forecasting
Financial forecasting is an important analytical project within the finance industry, involving the use of historical data and advanced analytics to predict future organizational performance. Business analysts play a pivotal role in this process by gathering, analyzing, and interpreting data to generate forecasts that inform strategic decisions. Whether it’s assessing potential risks, identifying growth opportunities, or optimizing resource allocation, financial forecasting enables organizations to plan for the future with greater accuracy and confidence. Business analysts collaborate with financial professionals to develop models, conduct scenario and sensitivity analyses, and ensure that forecasts are grounded in robust data insights, making them essential contributors to the financial forecasting process.
Financial forecasting is a crucial process in finance that involves using historical data to predict future organizational performance. This process is essential across various sectors of finance, each with its specific applications and challenges:
- Investment banking
- In investment banking, financial forecasting is vital for mergers and acquisitions (M&A). Analysts use scenario analysis to forecast the financial outcomes of potential mergers, evaluating different deal structures, market conditions, and synergies. For example, they might analyze how a merger between two companies could impact future cash flows and valuations under different economic scenarios.
- Trading
- In the trading sector, time series forecasting is frequently used to predict stock prices, interest rates, and market trends. Traders rely on these forecasts to make informed buy or sell decisions. For instance, trend analysis might help a trader predict a stock’s future performance based on past price movements and market conditions, allowing them to optimize their trading strategies.
- Retail banking
- Retail banks use financial forecasting to predict customer demand for loans, deposits, and other banking products. Sensitivity analysis can help these institutions understand how changes in interest rates or economic conditions could affect loan default rates or deposit inflows. For example, a bank might use these analyses to adjust its interest rate offerings or to plan for potential increases in loan defaults during economic downturns.
- Commercial B2B banking
- In commercial banking, forecasting is crucial for managing credit risk and liquidity. Banks use scenario analysis to assess how different economic conditions could impact the creditworthiness of their business clients. For example, they might forecast the impact of a recession on the repayment abilities of companies in their loan portfolio, allowing them to adjust their lending strategies accordingly.
Across all these sectors, responsible financial forecasting is about making informed predictions based on rigorous data analysis. Scenario analysis and sensitivity analysis are key tools that help identify potential risks and opportunities, enabling financial institutions to make strategic decisions, allocate resources effectively, and manage risk. Whether it’s forecasting market trends for trading, predicting customer behavior in retail banking, or evaluating M&A scenarios in investment banking, accurate financial forecasting is essential for success in the finance industry. Business analysts are integral to this process, ensuring that forecasts are data-driven and aligned with organizational goals.
Healthcare Diagnosis
Healthcare analytics uses medical data and clinical expertise to identify the underlying causes of a patient’s symptoms or health conditions. Analytics is used to identify patterns and relationships that can be used to make accurate diagnoses. Medical data such as electronic health records, medical images, laboratory results, and other relevant information are all used in this analysis.
There are many practical applications for analytics in healthcare. For instance, predictive models can be used to identify patients who are at risk of developing certain health conditions or complications. This can help clinicians to provide early interventions or preventive care to improve patient outcomes. Ultimately, analytics provides clinicians with decision support tools that can assist with diagnosis and treatment planning. This may involve using predictive models or decision support algorithms to suggest diagnostic tests, treatment options, or medication recommendations.
Another application is population health management. This is leveraging analytics to manage the health of populations by identifying health trends, risk factors, and areas for improvement. This can help healthcare organizations to implement targeted interventions that improve health outcomes and reduce healthcare costs. By using different analytical techniques such as classification, the healthcare industry can inform diagnosis and treatment decisions, improve the quality of care, and achieve better health outcomes for patients.
Starting with the Problem Statement
The problem statement is one of the most important parts of an analytical project as it defines the scope and objective of the project. Without a clear problem statement, the project may lack direction and focus, making it difficult to achieve the desired outcomes. The benefits of a clear problem statement include helping to determine the specific problem to be solved and what kind of analytical problem is at hand. Also, a problem statement helps stakeholders to understand what the project aims to achieve, what its limitations are, and what to expect and when.
Once a problem statement is defined, it guides decision making and assists stakeholders to make informed decisions about project design, data collection, analysis techniques, and other aspects of the project. Additionally, it enables the measurement of project success by defining clear objectives and outcomes that can be tracked and evaluated.
Let’s look at an example problem statement regarding telecom churn:
The problem we plan to address is the high rate of customer churn in our telecommunications company. We need to develop a predictive model that can accurately predict which customers are most likely to churn in the next 30 days, so that we can implement targeted retention strategies to retain those customers.
This problem statement clearly defines the problem we aim to solve (high rate of customer churn) and the objective of the project (to develop a predictive model to identify customers most likely to churn). It also provides a time frame for the prediction (next 30 days) and the reason for the prediction (to implement targeted retention strategies). With this problem statement, we can design a predictive modeling project that focuses on developing a model that accurately predicts which customers are at risk of churning and enables the implementation of targeted retention strategies to reduce churn rates.
Let’s consider another example problem statement about low conversion rates for a checkout page:
Our ecommerce website is experiencing a low conversion rate for our checkout page. We want to increase the conversion rate by optimizing the checkout page design. We will conduct an A/B test by creating two versions of the checkout page and randomly assigning visitors to each version. The problem we aim to solve is to identify which version of the checkout page leads to a higher conversion rate.
The problem is clearly identified (low conversion rate), as is the objective (to use A/B testing to determine the best version of the checkout page) that will result in a higher conversion rate (outcome).
The problem statements in both examples outline the approach that will be used in analytics. The telecom churn problem will be addressed with a predictive model, and A/B testing will use experimental design and inferential statistics to better understand the low conversion rates. From these problem statements, it is also likely to identify the data sources and elements that would be collected to start the analytical project (problem statement examples will be discussed in the next section).
Getting to the Analytical Problem
A primary challenge is understanding what kinds of analytical problems are at hand and which analytical approach to take. In the example of telecom churn or A/B testing, these are well-known analytical problems, and the approach to take is clear. However, not all problems are straightforward, and they may require multiple analytical techniques to solve. Here are the most common types of analytical problems:
- Prediction
- Problems that involve predicting future outcomes or events, such as predicting customer churn, stock prices, or demand for a product. Prediction problems are divided between classification and regression.
- Classification (prediction)
- Use cases that involve categorizing data into different classes or categories, such as classifying emails as spam or not spam, or classifying images as dogs or cats.
- Regression (prediction)
- Issues that involve predicting a continuous numerical value, such as predicting the price of a house based on its features.
- Optimization
- Situations that involve finding the best solution or combination of solutions to achieve a specific objective, such as optimizing a supply chain to minimize costs or maximizing revenue.
- Clustering
- Scenarios that involve grouping data into similar clusters or segments based on their characteristics, such as clustering customers based on their purchasing behavior or clustering images based on their content.
- Association
- Use cases that involve finding associations or relationships between different variables, such as identifying which products customers often purchase together. Association identifies relationships between variables in a dataset, while clustering groups similar data points together based on shared characteristics.
- Anomaly detection
- Problems that involve identifying unusual or anomalous behavior or data points, such as detecting fraud or faulty equipment in a manufacturing process.
These different types of analytical problems require different methods and techniques to solve them effectively, and it is important to choose the right approach for each problem based on its specific requirements and goals. In Chapter 1 we highlighted different analytical techniques: descriptive, diagnostic, discovery, predictive, and prescriptive. Each of the techniques or combinations may be adopted to solve a problem. In addition, each technique builds on the other. As an example, descriptive analytics needs to be completed before predictive analytical approaches can be applied.
Let’s break this down a bit further. Descriptive analytics projects focus on describing past events or trends, such as analyzing sales data to understand which products were popular during a specific period. Often the focus of descriptive projects is to answer the question, What has happened?
Discovery analytics projects focus on discovering new patterns or insights in data, such as analyzing social media data to understand customer sentiment about a product or brand. Discover analytics projects focus on answering these questions: What else do I need to know? Is there something I have overlooked?
Diagnostic analytics projects focus on identifying the root cause of a problem or issue, such as analyzing customer complaints to understand the reasons for a decline in customer satisfaction. Diagnostic analytics projects focus on answering the question, What causes the outcome? In addition, it addresses the question, Why is there a problem? The answer to this question allows us to answer the what, why, and how.
Predictive analytics projects focus on predicting future events or trends, such as developing a statistical learning model to predict which customers are most likely to churn. The focus here is the question, What is expected to happen?
Prescriptive analytics projects focus on recommending the best course of action to achieve a specific objective, such as identifying the optimal pricing strategy to maximize revenue. Prescriptive analytics focuses on answering these questions: What is the right action for this problem? How to make something happen?
Going back to the problem statement helps us understand how to design the project. Let’s consider a healthcare organization that wants to improve diagnosis accuracy. Here is the problem statement:
Our healthcare organization needs to improve the accuracy of its breast cancer diagnosis. We aim to develop a classification model that can accurately classify mammogram images as benign or malignant. The problem we aim to solve is to improve the accuracy of breast cancer diagnosis by developing a statistical learning model that can accurately classify mammogram images and reduce the rate of misdiagnosis. We can design a classification project that focuses on developing a statistical learning model that accurately classifies mammogram images and enables accurate breast cancer diagnosis. The outcome will be given to a medical doctor, who will use it to improve their diagnosis.
This problem statement identifies that a classification project will be needed, which means descriptive, diagnostic, discovery and predictive techniques will be used. While the last example is a strong problem statement, it is important to be able to identify weak ones. Let’s look at a version of the previous statement:
Our healthcare organization wants to use statistical learning to do something with mammogram images. We hope it will help in diagnosing breast cancer better. The aim is to create a statistical learning model that does something with the data and will hopefully be useful to doctors.
This statement has a number of issues:
- Lack of specificity
- The statement is vague and does not clearly define the problem or the objective. It mentions “doing something” with the mammogram images without specifying what that something is (e.g., classifying images, improving diagnosis accuracy).
- No clear goal
- It fails to establish a clear goal or outcome, making it difficult to understand what success looks like. The phrase “hopefully, it will be useful to doctors” shows uncertainty and a lack of direction.
- Ambiguous language
- The use of words like “do something” and “hopefully” reflects a lack of focus and commitment to a defined solution, which weakens the purpose and direction of the project.
- Lack of impact
- The statement doesn’t clearly articulate the impact or importance of the project, unlike the strong example, which emphasizes the need to reduce misdiagnosis and improve diagnostic accuracy.
In contrast, the strong problem statement is clear, is focused, and provides a concrete goal (improving the accuracy of breast cancer diagnosis through a specific statistical learning model), which makes it a solid foundation for a successful project.
Let’s consider another problem statement:
Our manufacturing company wants to improve the efficiency of its production process by reducing energy consumption. We aim to develop a regression model that can accurately predict energy consumption based on various production factors such as temperature, humidity, and production volume. The specific problem we aim to solve is to reduce energy consumption and costs by developing a statistical learning model that can accurately predict energy consumption.
This problem statement identifies that a regression project will be needed, which means descriptive, diagnostic, discovery, and predictive techniques will be used.
A good portion of analytical projects lead to prediction techniques, and this requires building up to creating a predictive model. Now that we’ve walked through a few example problem statements, you have a better understanding of what information they should convey and how this impacts the analytical approaches you use. If you receive a project with a weak problem statement, don’t be afraid to ask questions of the stakeholders to better understand the parameters before you begin work. This will save time for everyone involved and will lead to greater quality in your work.
Many analytical problems will focus on prediction. As outlined in this section, both classification and regression are called out as prediction approaches.
Classification
Classification is a type of supervised learning problem that is used to predict a categorical or discrete target variable based on a set of input features or variables. All forms of classification models focus on determining the relationship between the input features (predictors) and the target variable (prediction), leveraging this relationship for predictive capabilities. For example, a classification model could predict whether an email is spam or not based on the email’s content. The predicted value or target would be a flag of Spam or Not Spam. While classification can result in a number of categories, when the output is limited to two results, this is referred to as a binary classification.
Regression
Regression is also a type of supervised learning problem that involves predicting a continuous numerical target variable based on a set of input features. A regression model could predict the price of a house based on its location, size, number of bedrooms, and other relevant features. Regression is widely used in various domains such as finance, healthcare, and engineering to model the relationships between variables and make predictions about future values of the target variable.
What Do We Want to Measure?
Determining what needs to be measured is a critical step in formulating a plan for an analytics project because it helps to define the project’s scope and goals. It provides clarity on the specific business problem that needs to be addressed, which in turn helps to determine the appropriate analytical approach, methods, and data required for the project. Without a clear understanding of what needs to be predicted, it is easy to get lost in the sea of available data and analytical methods, leading to misguided analyses.
It is important to determine the specific measurement that will be the outcome of a model. For example, if the goal of the project is to predict customer churn in a telecommunications company, the focus will be on analyzing customer behavior patterns and factors that drive customer retention. Churn is actually a churn rate, which is a continuous number, which also identifies the analytical problem further: this would be a regression problem. We described classification and regression earlier in the chapter, but knowing what we want to predict also tells us more about our analytical problem. If we were predicting a label, this would be a classification problem.
Analysis Approaches
Regression and classification problems have different goals and require different techniques for EDA in statistical learning. How you approach EDA will be influenced by the type of prediction problem you have. For predictive analytic problems, the primary goal is to identify the relationships between the input variables (known as predictors or features) and the target (known as the predicted value). Depending on the type of problem—either regression or classification—different statistical analysis will be completed.
For regression problems, the focus is on analyzing the distribution and correlation of the input variables, identifying outliers, and detecting nonlinear relationships. EDA visualization techniques such as scatter plots, histograms, and correlation matrices are commonly used in regression analysis. For classification problems, analyzing the distribution and correlation of the input variables for each class, identifying class imbalances, and detecting nonlinear relationships are common activities. EDA visualization techniques such as bar charts, box plots, and heatmaps are commonly used in classification analysis. Examples for both cases are covered in a later section.
In both cases, EDA plays a critical role in identifying data quality issues, understanding the relationships between variables, and selecting appropriate feature engineering and preprocessing techniques. Effective EDA can help ensure that the model is able to learn the underlying patterns in the data and make accurate predictions. Let’s explore this further.
EDA
EDA is the process of analyzing and visualizing data to understand its characteristics, patterns, and relationships. EDA is an important step in the statistical learning process as it helps to identify important features, detect anomalies and outliers, and understand the underlying structure of the data. The main goals of EDA in statistical learning are as follows:
-
To gain a deeper understanding of the data and its characteristics, such as its distribution, range, and correlation between variables
-
To identify any data quality issues, such as missing or incorrect values, outliers, or duplicates
-
To explore potential relationships between the input variables and the target variable
-
To determine appropriate feature engineering and data preprocessing techniques that can improve the performance of the model
Expected outcomes of EDA in statistical learning may include the identification of important variables, the detection of data quality issues, the identification of relationships between variables, and the selection of appropriate feature engineering and data preprocessing techniques. Statistical analysis techniques such as correlation analysis and hypothesis testing are also used. Effective EDA can help ensure that the statistical learning model is trained on high-quality data and is able to make accurate predictions.
Unsupervised Learning
Unsupervised learning is a type of statistical learning where the goal is to find patterns or structure in data without any labeled output or target variable. In other words, the algorithm is given a set of input features and is expected to identify meaningful patterns or relationships in the data on its own, without any guidance. During EDA, unsupervised learning can be explored to gain a deeper understanding of the data and to identify hidden patterns or relationships that may not be immediately obvious. Unsupervised learning can be used to identify clusters or groups of similar data points, detect anomalies or outliers, and reduce the dimensionality of the data.
Common unsupervised learning techniques include clustering, dimensionality reduction, and anomaly detection. Clustering involves grouping similar data points together based on their similarity, while dimensionality reduction techniques aim to reduce the number of features or variables in the data while preserving as much of the original information as possible. Anomaly detection aims to identify data points that are significantly different from the rest of the data and may indicate errors or outliers. For instance, the discovery of clusters in a dataset can result in different predictive models. When we apply clustering to a dataset, it can help to identify patterns or groups of data points that have similar characteristics or behavior.
If we use these clusters to create a predictive model, the resulting model may be different from one that is built without clustering. This is because the clusters may highlight specific relationships between the input features and the target variable that are not apparent in the original dataset.
For example, imagine that we have a dataset of customer transactions from an online store, and we want to build a predictive model to forecast customer purchases. If we use clustering to group customers based on their transaction history or demographics, we may identify distinct customer segments that have different purchase patterns. We can then use these customer segments as input features in our predictive model, which may result in a more accurate and targeted model. However, it’s important to note that the resulting predictive model may not necessarily be better than a model built without clustering, and the choice of which approach to use ultimately depends on the specific problem and the available data. Code examples of clustering and other unsupervised learning methods will be reviewed in Chapter 5.
Statistical Analysis for Regression
The analysis plan that will be used in EDA is determined by the prediction approach (regression or classification). All datasets will be analyzed for content quality, patterns, relationship, and other issues that could impact the accuracy of the potential model. Linear regression is the primary regression approach used in analytics, and we will explore it in upcoming chapters. Here, I’ll focus on how statistical analysis is used in preparing for a linear regression model. This allows us to understand the underlying structure, patterns, and characteristics of the data. As a result of EDA, an analyst will:
-
Identify influential variables that can assist in predicting an outcome
-
Discover and detect outliers and anomalies
-
Test assumptions about the data to determine if they exist
-
Explore relationships between variables to help in feature selection and engineering
-
Visualize the data to make analysis easier
-
Clean and process data
Most of this list applies to any predictive model, but with linear regression, statistical analysis relies on current properties being present in the data. For instance, understanding the distribution of the variables is important in analyzing data for regression models. In many regression models, especially linear regression, we assume that the residuals or errors are normally distributed. Checking the distribution of the dependent variable (and possibly the independent variables) can help determine if a normality assumption is reasonable or if data transformations are required to meet this assumption. Distribution can be used to identify outliers, which can impact the accuracy of the regression model.
Correlation analysis assists in identifying relationships between the predictors and the predicted value. A strong correlation between the predictor and predicted outcome means that it might be a good predictor for the regression model, which is part of determining the correct predictors. Correlation can also be used to identify multicollinearity, which occurs when two or more independent variables are highly correlated with each other. Last, correlation is used to determine independence between the variables, as required, to use regression models.
In addition to analyzing distribution and correlation, EDA focuses on determining the content quality of the data. One focus is identifying and addressing missing data. Techniques like mean imputation, median imputation, or interpolation can be used to address missing data. Another focus is to determine what transformation is needed for the data. Categorical data such as a user’s favorite color cannot be used as input into a regression model as the regression model needs numerical input.
One common technique to handle categorical variables in classification problems is one-hot encoding, also known as dummy variable encoding. One-hot encoding converts each categorical variable into a binary vector with a length equal to the number of categories in the variable. Each element in the vector corresponds to a category, and its value is 1 if the data point belongs to that category, and 0 otherwise.
For example, suppose we have a categorical variable color
with three categories: red, green, and blue. One-hot encoding this variable would result in three binary variables: color_red
, color_green
, and color_blue
. If a data point has a red
color, the color_red
variable would be 1, and the color_green
and color_blue
variables would be 0. One-hot encoding is a feature engineering method and used to prepare the data for the modeling phase.
When dealing with a categorical variable like a PIN code in EDA, the challenge arises when the number of unique categories is very high, such as thousands of different PIN codes. This can make it difficult to extract meaningful patterns or use the variable effectively in a model. One approach to handle this is to aggregate the PIN codes into broader geographic regions, such as grouping them by city, state, or region. Alternatively, if geographic specificity is important, dimensionality reduction techniques like PCA or clustering can be applied to create meaningful groups from the PIN codes. Another option is to assess the impact of each PIN code category on the target variable and retain only the most significant ones, while the rest can be grouped into an “Other” category to reduce dimensionality.
The end goal of EDA is to identify the most important variables to be used as predictors for the regression model and prepare the data to build the model. Distribution, correlation, outliers, and converting data to ensure numerical inputs are the focus for statistical analysis for regression models.
Analysis for Classification
Classification, in contrast to regression, has similar but different analysis goals for EDA. The same goals exist: testing assumptions about the data, exploring data relationships, cleaning and processing the data, and identifying influential variables to predict the outcome. However, because classification does not rely on statistical relationships, the focus of analysis will slightly differ from regression. For instance, analyzing the distribution and correlation of variables depends on the classification algorithm used, but reviewing these areas can provide important insights that can improve the accuracy and interpretability of the model.
Analyzing the distribution of variables can help to identify potential issues like skewness or outliers that may affect the performance of the classification model. For example, distribution can identify if the target variable is highly imbalanced (where one class is more populous than the other). A significant imbalance can lead to poor model performance, as the model might be biased toward the majority class. If class imbalance is detected, techniques such as resampling, synthetic data generation, or adjusting the classifier’s decision threshold can be applied as part of the EDA process.
Skewness refers to the asymmetry in the distribution of a variable’s values, where data points are more concentrated on one side of the distribution than the other. In classification models, skewed variables can impact the model’s performance by leading to biased predictions, especially if the skewed variable is influential in determining the outcome. For instance, if a feature with significant skewness has outliers or an imbalanced distribution, the model might overly rely on these extreme values, potentially misclassifying cases that fall into the tails of the distribution. Additionally, skewed features can distort the decision boundaries of algorithms like logistic regression or support vector machines. To mitigate these effects, it’s common practice to transform skewed variables using techniques such as logarithmic or Box-Cox transformations, which can help create a more balanced distribution and improve the robustness of the classification model.
Correlation is used in evaluating data, and it could impact the choice of algorithm. For instance, if logistic regression is used, multicollinearity is a concern. Multicollinearity can cause unstable models or models that cannot easily be interpreted. Regression-based algorithms such as logistic regression or others, including linear discriminant analysis (LDA), least absolute shrinkage and selection operator (LASSO) regression, or ridge regression, can be impacted where highly correlated variables can influence the coefficient estimates. Support vector machine (SVM) and k-nearest neighbors (KNN) can also be impacted negatively by multicollinearity. On the other hand, some classification algorithms are more robust to correlated variables, such as tree-based algorithms and neural networks. These algorithms will be explored more in upcoming chapters.
Strong correlations between predictors and the target variable help identify the most important predictors, making correlation an important factor in selecting predictors. Analyzing correlations between features can also help identify opportunities for feature engineering. Combining or transforming correlated features can create new, more informative features that may improve the performance of your classification model.
Role of Hypothesis Testing
Hypothesis testing is a statistical tool used to determine whether a particular hypothesis is supported by the data or not. In regression or classification problems, hypothesis testing can be used to evaluate the significance of individual predictor variables or to test the overall significance of the model.
As part of EDA, hypothesis testing can be used to determine if individual predictor variables in regression models are significant in explaining the variation in the target variable. The null hypothesis could be tested to see if the coefficient of a predictor variable is equal to zero, and if rejected, the predictor is determined to be significantly related to the target variable.
With classification models, hypothesis testing is used to determine whether the model performs better than a baseline model. The null hypothesis in this context typically states that there is no difference in accuracy between the model and the baseline, which could be a model predicting the most frequent class or simply predicting the average outcome. For example, the null hypothesis might state, “The accuracy of the classification model is equal to the accuracy of a baseline model that predicts the most frequent class.” Hypothesis testing can also be applied to compare the performance of different models, such as a simple model versus a more complex one. If the null hypothesis is rejected, it indicates that the classification model provides a statistically significant improvement over the baseline model or the simpler model. This process ensures that any observed improvement in model performance is not due to random chance but rather reflects a meaningful enhancement.
Overall, hypothesis testing can play a critical role in assessing the significance of predictor variables and the model as a whole in regression and classification problems. It helps to identify the most important predictors and evaluate the overall performance of the model, which can guide further model improvement and selection.
Visualization in Analytics
Visualizing data in EDA allows analysts and data scientists to understand the underlying patterns, trends, and relationships within the data. By creating visualizations of the data, analysts can identify outliers, potential errors, or missing values, and better understand the distribution of the data. In addition, visualizations assist in identifying potential relationships between variables that are not easily identified using numerical analysis alone. Visualizations can also help in identifying any class imbalances, biases, or inconsistencies in the data that can affect the performance of any models built using the data.
There are many tools that can be used to support visualization in analytics, but our focus will be on how to leverage R and Python to complete visualizations.
Visualization in R and Python to Support EDA
Both R and Python are powerful analytical languages that contain several popular visualization libraries. Each of the libraries has similar functionality but with a different focus. The choice of library will depend on the use of the visualization. Table 4-1 outlines the different visualization libraries for Python and when each might be used, and Table 4-2 outlines the libraries for R. This is not an exhaustive list, but I’ve included the most popular libraries.
Library | Description | Use case |
---|---|---|
Matplotlib | A versatile and widely used library that provides a comprehensive range of plotting functions. |
|
Seaborn | A statistical data visualization library built on top of Matplotlib. It provides high-level and aesthetically pleasing plots with minimal coding. |
|
Plotly | An interactive and web-based visualization library that supports a wide range of chart types. |
|
Bokeh | Another library for creating interactive and web-based visualizations, with a focus on providing more control over the look and feel of the plots. |
|
ggplot2 | A port of the popular ggplot2 library in R, which allows for creating visually appealing graphics. See the use case in Table 4-2. |
R also has many visualization libraries. ggplot2 is an example of a powerful visualization library and has been ported to Python as referenced in Table 4-1.
Library | Description | Use case |
---|---|---|
Base R | This graphics package comes with R and provides simple, quick, and easy-to-use plotting functions. |
|
ggplot2 | A powerful and widely used library based on the Grammar of Graphics. It allows for more sophisticated and customizable visualizations. |
|
lattice | Another popular library for creating complex and customizable visualizations, particularly for multivariate data. |
|
Plotly | An interactive and web-based visualization library, which can create interactive plots using both ggplot2 and base R graphics. |
|
Shiny | Not a visualization library but rather an R package for building interactive web applications. However, it can be used in conjunction with other visualization libraries to create interactive visualizations. |
|
Many other visualization libraries are available in both Python and R, each with its own strengths and weaknesses, so the choice of library ultimately depends on the specific requirements and goals of the visualization project.
Regression Visualization
Exploring data through visualizations is an important step in understanding relationships and patterns within the data when building regression models. Some visualization techniques that can be used to explore data for regression problems include histograms and density plots. For example, these can be used to check for normality and to identify potential outliers. In this section, I’ll walk you through examples that show the visualization libraries being applied for various scenarios in analyzing datasets for regression.
For the examples in this section, we will use the mtcars, a well-known dataset in the R programming environment, often used for teaching and demonstration purposes in statistical analysis and data visualization. It contains data on 32 different car models from the 1974 Motor Trend US magazine. The dataset includes 11 variables related to various aspects of automobile performance and design, such as miles per gallon (mpg
), number of cylinders (cyl
), displacement (disp
), horsepower (hp
), rear axle ratio (drat
), weight (wt
), quarter mile time (qsec
), engine type (vs
), transmission type (am
), number of forward gears (gear
), and number of carburetors (carb
). These variables allow for extensive analysis of relationships between car characteristics, such as the correlation between engine size and fuel efficiency or the impact of weight on performance metrics. The mtcars dataset is particularly useful for exploring regression models, creating visualizations, and practicing data manipulation techniques.
Scatter plots
Scatter plots show the relationship between two continuous variables. They can be used to identify patterns and relationships between the independent and dependent variables. Figure 4-1 shows an example scatter plot in R:
library(ggplot2) ggplot(data = mtcars, aes(x = mpg, y = disp)) + geom_point()
Figure 4-2 shows a scatter plot in Python:
import statsmodels.api as sm mtcars = sm.datasets.get_rdataset('mtcars') mtcars = mtcars.data import matplotlib.pyplot as plt import seaborn as sns sns.scatterplot(x = 'mpg', y = 'disp', data = mtcars) plt.show()
Both figures feature box plots that show the distribution of a continuous variable and can be used to identify potential outliers. Figures 4-1 and 4-2 show a scatter plot that displays the relationship between miles per gallon (mpg
) and displacement (disp
) for the cars in the mtcars dataset. The plot typically shows a negative correlation between mpg
and disp
, meaning that as the displacement of a car’s engine increases, the miles per gallon tend to decrease. This indicates that cars with larger engines (higher displacement) are generally less fuel-efficient. This visualization helps in understanding how engine size (displacement) impacts fuel efficiency (mpg
) in the dataset.
Box plots
Figure 4-3 shows an example box plot in R:
ggplot(data = mtcars, aes(x = factor(cyl), y = disp)) + geom_boxplot()
Figure 4-4 shows an example box plot in Python:
sns.boxplot(x = 'cyl', y = 'disp', data = mtcars) plt.show()
A box plot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. Its purpose is to provide a visual summary of key aspects of the data distribution.
Both figures show a boxplot that displays the distribution of engine displacement (disp
) across different categories of cylinders (cyl
) in the mtcars dataset. The boxplot shows how engine displacement varies for cars with different numbers of cylinders (4, 6, and 8 cylinders). Each box represents the interquartile range (IQR) of displacement for that cylinder group, with the line inside the box indicating the median displacement. The plot typically shows that cars with more cylinders (e.g., 8 cylinders) tend to have higher engine displacements compared to cars with fewer cylinders (e.g., 4 cylinders). This is visible through the shift in the median and the overall higher range of values for higher cylinder counts.
Density plots
Another form of visualization is density plots. These show the distribution of a continuous variable and can be used to identify potential skewness. Figure 4-5 shows an example of density plots in R:
ggplot(data = mtcars, aes(x = mpg)) + geom_density()
Figure 4-6 shows an example density plot in Python:
sns.kdeplot(x = 'mpg', data = mtcars) plt.show()
The density plots shown in Figures 4-5 and 4-6 visualize the distribution of miles per gallon (mpg
) values in the mtcars dataset. It highlights the most common mpg ranges, with the peak of the curve indicating where the majority of the cars’ fuel efficiencies lie. The plot also shows the spread of mpg values and any skewness, helping to understand the variability in fuel efficiency across different car models. This visualization is significant for quickly assessing the overall distribution of fuel efficiency in the dataset.
Heatmaps
Heatmaps show the relationship between two continuous variables using colors to represent the magnitude of the relationship. Figure 4-7 shows an example of heatmaps in R:
ggplot(data = mtcars, aes(x = hp, y = disp)) + geom_bin2d() + scale_fill_gradient(low = "white", high = "blue")
Figure 4-8 shows an example heatmap in Python:
sns.histplot(x = 'hp', y = 'disp', data = mtcars, cmap = 'Blues') plt.show()
These visualization techniques can help identify patterns, trends, and outliers in the data, which can guide feature selection and engineering efforts to prepare the data for regression modeling. Both figures show the relationship between horsepower (hp) and engine displacement (disp) in the mtcars dataset. The data points are grouped into bins, with the color intensity (from white to blue) indicating the density of points within each bin. Darker blue areas represent higher concentrations of cars with similar horsepower and displacement values. This visualization is useful for identifying patterns and clusters in the data, such as regions where cars tend to have similar performance characteristics.
Classification Visualization
Visualization for classification problems is completed for the same purposes as regression in EDA. One visualization technique that can be used to explore data for classification problems is examining the distribution of categorical variables. Let’s consider the visualization techniques that can be applied here.
For these examples, we will use the Iris dataset. The Iris dataset is a classic dataset in machine learning and statistics, often used for benchmarking algorithms and exploring data analysis techniques. It contains 150 observations of iris flowers, with each observation belonging to one of three species: Iris setosa, Iris versicolor, and Iris virginica. The dataset includes four numerical features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. These features describe the physical dimensions of the iris flowers, and the dataset is commonly used for tasks such as classification, clustering, and visualization to differentiate between the three species based on these measurements.
Bar plots
Bar plots show the distribution of categorical variables and can be used to identify potential class imbalances. Figure 4-9 shows an example of bar plots in R:
library(ggplot2) library(dplyr) # Create a new categorical variable based on Petal.Length iris <- iris %>% mutate(PetalLengthCategory = cut(Petal.Length, breaks = c(0, 2, 4, 6, Inf), labels = c("Short", "Medium", "Long", "Very Long"))) # Create the bar chart ggplot(data = iris, aes(x = Species, fill = PetalLengthCategory)) + geom_bar(position = "dodge") + labs(title = "Distribution of Petal Length Categories by Species", x = "Species", y = "Count", fill = "Petal Length") + theme_minimal()
Figure 4-10 shows a bar chart in Python:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load the iris dataset iris = sns.load_dataset('iris') # Create a new categorical variable based on Petal.Length bins = [0, 2, 4, 6, float('inf')] labels = ['Short', 'Medium', 'Long', 'Very Long'] iris['PetalLengthCategory'] = pd.cut(iris['petal_length'], bins=bins, labels=labels) # Create the bar chart plt.figure(figsize=(10, 6)) sns.countplot(data=iris, x='species', hue='PetalLengthCategory', palette='viridis') # Add labels and title plt.title('Distribution of Petal Length Categories by Species') plt.xlabel('Species') plt.ylabel('Count') plt.legend(title='Petal Length') # Show the plot plt.show()
Bar charts are a fundamental tool in EDA for visualizing and comparing categorical data. Each bar represents a category with its height or length proportional to the value or frequency of that category. Bar charts excel in revealing trends and differences among categories and in providing an intuitive understanding of data distribution. They are particularly effective in EDA for identifying patterns and outliers and for making initial assessments about the relationship between discrete variables.
The bar chart reveals significant differences in petal length distribution among the three iris species. Iris setosa primarily has short petals, while Iris versicolor shows a majority of medium-length petals. Iris virginica displays a wider range, with a significant portion of its petals falling into the long and very long categories. This variation in petal length categories highlights distinct morphological differences between the species, which can be crucial for species identification and classification. The chart effectively illustrates how petal length is a distinguishing characteristic among the iris species.
Parallel coordinates plot
A parallel coordinate plot is used to visualize multivariate data by plotting each feature on a separate axis and connecting the data points across axes. This chart helps to identify how different features contribute to the classification of each instance. Figure 4-11 shows an example of a parallel coordinate plot in R:
install.packages("GGally") # Load required libraries library(ggplot2) library(GGally) # Load the iris dataset data(iris) # Create a parallel coordinates plot ggparcoord(data = iris, columns = 1:4, groupColumn = 5, scale = "uniminmax") + theme_minimal() + labs(title = "Parallel Coordinates Plot for Iris Dataset", x = "Features", y = "Scaled Values")
Figure 4-12 shows a parallel coordinate plot in Python:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from pandas.plotting import parallel_coordinates # Load the iris dataset iris = sns.load_dataset("iris") # Create a parallel coordinates plot plt.figure(figsize=(10, 5)) parallel_coordinates(iris, 'species', colormap=plt.get_cmap("Set1")) # Set the title and labels plt.title("Parallel Coordinates Plot for Iris Dataset") plt.xlabel("Features") plt.ylabel("Values") # Show the plot plt.show()
The parallel coordinates plot provides a comprehensive view of how the different features (sepal length, sepal width, petal length, and petal width) vary across the three iris species. Each line represents an individual observation, and the plot visually highlights the distinct patterns among species. For example, Iris setosa exhibits clear separation from the other two species, particularly in petal length and petal width, indicating these features are key differentiators. Iris versicolor and Iris virginica show more overlap, but differences in petal measurements still allow for distinction. This plot effectively demonstrates the multidimensional relationships between features and their role in distinguishing iris species.
Violin plots
Violin plots display the distribution of a feature across different target classes. Violin plots combine aspects of box plots and kernel density plots, providing additional insights into the data distribution. Figure 4-13 shows an example violin plot in R:
# Load required libraries library(ggplot2) # Load the iris dataset data(iris) # Create a violin plot p <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_violin(fill = "lightblue", draw_quantiles = c(0.25, 0.5, 0.75)) + labs(title = "Violin Plot for Sepal Length by Iris Species", x = "Species", y = "Sepal Length") # Show the plot print(p)
Figure 4-14 shows an example violin plot in Python:
import seaborn as sns import matplotlib.pyplot as plt # Load the iris dataset iris = sns.load_dataset("iris") # Create a violin plot sns.violinplot(data=iris, x='species', y='sepal_length') # Set the title and labels plt.title("Violin Plot for Sepal Length by Iris Species") plt.xlabel("Species") plt.ylabel("Sepal Length") # Show the plot plt.show()
Violin plots are valuable in EDA for visualizing the distribution of numeric data across different categories. They combine elements of box plots and density plots, showing the median, interquartile range, and density of the data, thereby offering a deeper insight into the data distribution than traditional box plots. Violin plots are particularly useful in comparing multiple distributions and highlighting differences in both the central tendency and variability of data across categories. Violin plots are also adept at revealing multimodal distributions.
The violin plot provides a detailed visualization of the distribution of sepal length across the three iris species. It reveals not only the central tendency but also the variability and distribution shape of sepal length within each species. Iris setosa shows a relatively tight distribution with a higher concentration of lower sepal lengths, while Iris virginica has a broader distribution with higher sepal lengths. Iris versicolor’s distribution lies in between, with a moderate spread. This plot highlights the differences in sepal length among the species, making it clear that sepal length is a distinguishing feature, particularly between Iris setosa and the other two species.
Contour plots
Contour plots (or 2D density plots) help visualize the joint distribution of two continuous features. This can help you identify clusters, trends, or patterns in the data. Figure 4-15 shows an example contour plot in R:
# Load required libraries library(ggplot2) # Load the iris dataset data(iris) # Create a contour plot p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_density_2d() + labs(title = "Contour Plot for Sepal Length vs. Sepal Width by Iris Species", x = "Sepal Length", y = "Sepal Width") # Show the plot print(p)
Figure 4-16 shows an example contour plot in Python:
import seaborn as sns import matplotlib.pyplot as plt # Load the iris dataset iris = sns.load_dataset("iris") # Create a contour plot sns.kdeplot(data=iris, x="sepal_length", y="sepal_width", hue="species", cmap="viridis") # Set the title and labels plt.title("Contour Plot for Sepal Length vs. Sepal Width by Iris Species") plt.xlabel("Sepal Length") plt.ylabel("Sepal Width") # Show the plot plt.show()
The contour plot visualizes the relationship between sepal length and sepal width across the three iris species, highlighting regions of high data density. The contours indicate where the data points for each species are most concentrated, with distinct patterns emerging for each species. Iris setosa is well-separated from the other two species, clustering in a region with shorter sepal lengths and higher sepal widths. Iris versicolor and Iris virginica show some overlap, but Iris virginica generally occupies areas with longer sepal lengths. This plot effectively demonstrates the variation and clustering of sepal dimensions across species, aiding in understanding how these features differentiate the iris species.
These different visualization techniques can help analyze feature imbalance and feature contribution to create more accurate classification models. Contour plots are essential in EDA for visualizing three-dimensional data in two dimensions. They represent data points on a grid, using contour lines (or level curves) to show areas of similar values. These plots are particularly useful for identifying patterns, trends, and gradients in the data, as well as potential outliers. Contour plots excel in displaying the topography of a surface, making them invaluable for analyzing geographical data, heatmaps, and various scientific and engineering applications.
Summary
This chapter introduced you to different types of analytical projects. Understanding the problem to be solved is the start to understanding how an analytical project will be approached. For example, this impacts the approach to analysis and determines whether the project is a regression or classification project. You should now have an understanding of different considerations for how EDA should be approached and reviewed and be able to tackle visualization in R and Python to support EDA. In the next chapter, we dive deeper into EDA.
Get Modern Business Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.