An Introduction to Statistical Hypothesis Testing on the Predicting Heart Failure Dataset
In the previous article How to Perform an Exploratory Data Analysis with Python, the heat map suggested that there is a relationship between age and death event as well as serum creatinine and death event. Those relationships can be either correlations or associations. A correlation is a relationship between two continuous variables, such as age and serum creatinine. An association is a relationship between a continuous variable and a categoricial variables, such as age and death event, or between categorical variables, such as smoking and death event. Associations between age and death event as well as serum creatinine and death event do not always mean that age or serum creatinine cause death event. We will need to find the cause of this relationship.
In this article, we will introduce the tools to analyse the relationship of two attributes. We will discuss the following aspects:
- Difference between Populations, Samples, Parameters, and Statistics
- What is a Hypothesis?
- What is Statistical Hypothesis Testing?
- Parametric vs. Non-Parametric Hypothesis Tests
- Level of Significance
- Basic Decision Theory
- Limitations of Statistical Significance Testing
It is of great important to understand the difference between a population and a sample:
A population is an individual or group that represents all the members of a group or category of interest. For our dataset we are interested in the group of patients with a heart failure. So, we could define the population as follows:
A sample is drawn from a larger population. It is an individual group from which data is collected. For example, we would like to know the average age of all patients with heart failure or of all patients with heart failure which survived. The sample is supposed to represent the population. When collecting data from a sample, we would like to use these data to make inferences about the population from the sample drawn.
Parameter vs. Statistic
One way to find the average age is to get a list of every patient with heart failure. If the population is restricted to a hospital, we can then calculate the mean of the age of all patients that are on the list. As a result, this average is a parameter of the population.
However, if the population is all patients with heart failure at a geographical region or in a country, then it might not be possible to get a list of all patients affected. Another way to find the average age is to select a random group of patients with heart failure and calculate the average age of this group. The mean value that we get from this is a statistic.
Potential issues with populations and samples
Population does not need to be large. For example, we would like to know the average age of patients with heart failure at a hospital in the last month only. Then only patients with heart failure that survived or deceased at a hospital in the last month are our population. If the hospital only had 5 patients with heart failure in the last month, then our population only has 5 cases.
Researchers either define the population explicitely or implicitely. For the Predicting Heart Failure dataset it is not explicitely clear whether the population is restricted to a hospital or to a country.
Samples are not always good representations of the population from which they are selected. Therefore, the results may not always generalise from the sample to the populations, especially if the population is not clearly defined.
In the dataset Predicting Heart Failure, the hypotheses are in the following form:
The first step in the process of statistical hypothesis testing is to establish a standard or benchmark. Therefore, we need to develop point of reference against which things may be compared. The primary hypothesis is the null hypothesis. The null hypothesis always suggests that there will be an absence of the effect. In statistcal terms, the null hypothesis can be written like that:
For example, in our dataset Predicting Heart Failure the null hypothesis suggests:
The entire hypothesis process occurs a priori. Therefore, hypothesis are based on assumed principles or deductions from the conclusions of previous research. These hypotheses are generated prior to a new study taking place. One alternative to the null hypothesis is the alternative hypothesis. In statistical terms, the alternative hypothesis can be written like that:
An alternative hypothesis for our dataset could be as follows:
All these questions can be answered with yes or no. Those tests test for the possibility of an effect in two directions — positive (yes) and negative (no). They are known as two-tailed tests. The alternative hypothesis does not include any statement whether the age mean will be larger or smaller for survived patients. It only includes that the mean is different.
One-tailed tests allow for the possibility of an effect in one direction. In statistical terms, the alternative hypothesis can be written as follows:
An one-tailed, alternative hypothesis for our dataset could be as follows:
The objective of statistical hypothesis testing is to find the cause of this relationship. The cause of this relationship can be due to different reasons:
- A variable in the dataset is the cause.
- Something outside the dataset is the cause.
- It occurred just by chance or error.
Statistic Hypothesis Tests are divided into groups:
- Parametric test
- Non-parametric test
Parametric tests make assumptions about the parameters of the population distribution from which the sample is drawn. This is often the assumptipn that the population data are normally distributed. These tests are used when the given data is quantitative and continuous and when the data is of normal distribution then. Non-parametric tests are distribution-free. The non-parametric tests are used when the distribution of the population is unknown.
All relationship tests give as a result a number between zero and one that can be used to show the confidence with the null hypothesis.
Statistical significance refers to the likelihood, or probability, that a statistic derived from a sample represents some genuine phenomenon in the population from which the sample was selected. In other words, statistical significance provides a measure to help us decide whether what we observe in our sample is also going on in the population that the sample is supposed to represent. If we select a random sample from a population, then there is always a chance that it will differ slightly from the population. The question here is:
The alpha level is the standard that is used to determine whether a result is statistically significant. It is the cut-off value that the null hypothesis is correct or not. Frequently, researchers use an alpha level of 0.05. If the probability of a result occurring by chance is less than this alpha level of 0.05, we will conclude that the result did not occur by chance. Therefore, the result is statistically significant. The agreed-upon probability of .05 represents the Type I Error rate that we are willing to accept before conducting statistical analysis.
When deciding to reject the null hypothesis, we conclude that the difference between sample statistic and population parameter is not due to random sampling error. Such errors (rejecting the null hypothesis when it is true) are called Type I Error. Another type of error in hypothesis testing, Type II Errors, occurs when the null hypothesis is retained even though it is false and should have been rejected. Type II Error means that a false Null Hypothesis is accepted, while a Type I Error is that a true Null Hypothesis is rejected.
The first limitation is that a hypothesis is mostly in the form of yes and no questions such as:
Most research questions are designed to go beyond this simple yes-or-no answer. We would like to know how much older deceased patients are than survived patients. We also would like to know how confident we can be that the age difference in the sample data reflects what is happening in the population.
The second limitation is that tests for testing statistical significance are influenced by the sample size. The problem with this process is that when we divide the difference between the sample statistic and the population parameter by the the standard error, the sample size plays a large role. The larger the sample size, the smaller the standard error.
This was a general overview of statistical hypothesis testing. We also discussed the basic tools how to analyse correlations and association between attributes. In the next article, we will show how to perform a statistical hypothesis test step-by-step. If you like this article and you want to read similar articles from me, please clap and follow me to receive an email whenever I publish a new article.
Baker, L. (2019) Associations and Correlations. 1st edn. Packt Publishing. Available at: https://www.perlego.com/book/984430/associations-and-correlations-unearth-the-powerful-insights-buried-in-your-data-pdf (Accessed: 6 December 2022).
Mukhiya, S. K. and Ahmed, U. (2020) Hands-On Exploratory Data Analysis with Python. 1st edn. Packt Publishing. Available at: https://www.perlego.com/book/1443328/handson-exploratory-data-analysis-with-python-perform-eda-techniques-to-understand-summarize-and-investigate-your-data-pdf (Accessed: 6 January 2023).
Urdan, T. (2016) Statistics in Plain English. 4th edn. Taylor and Francis. Available at: https://www.perlego.com/book/2192882/statistics-in-plain-english-pdf (Accessed: 25 November 2022).