[ad_1]

A compilation of learnings about statistics for Data Science, selected from courses that I took.

In this post, I want to share the most important learnings about Statistics for Data Science that I selected and summarized from a few courses that I took along my journey in learning data science so far.

As I constantly write in my posts, I am not a statistician and I don’t have a background in any STEM related area. That said, when I am learning *stats*, sometimes the concepts are a bit harder for me to catch, so I need to transform them to a language that my brain understands, and that is the way I try to share them with you.

Enough of small talk, let’s dive in the post content.

There are a few sources of bias that can distort data, giving you results that make a lot of sense, since it is actual data telling you that, but they won’t reflect the reality because they’re biased.

**Convenience bias**: when one chooses a sample that is easy to get. Let’s say you want to make an election poll and you go outside and start to ask your neighbors. That sample is highly biased, since it does not represent the population, but only your neighborhood, where everyone is supposed to live a similar life.**Non-response bias**: the non-response bias happens when you have too many responding NAs in a poll, thus, the totals will be distorted by it.**Selection bias**: this bias is when you select the people that will make part of your research. Imagine we are researching the average height of the population, but we select only people higher than 180 centimeters to answer. Do you think that this sample is fair? I guess not.**Volunteer bias**: this is the bias of the internet polls, for example, since the ones who will answer are only those interested in doing so, therefore they are already biased in their answers, making the sample not statistically representative.

Knowing data distributions is very important. That is one of the first things I do when I am working on an Exploratory Data Analysis (EDA). But, to be honest with you, for a long time I was not even sure why I had to plot all those histograms and/ or boxplots just to look at the distributions shapes.

Now I know that many of the statistical tests to be performed during the EDA are based on the normal distribution. It is a pre-requisite for many tests, ergo if your variable is not normal, you should not bother applying that test because the result won’t be statistically significant.

Let’s quickly load the dataset tips, built-in the Seaborn package from Python: `df = sns.load_dataset('tips')`

.

Then, we can plot the distributions of the numerical variables.

# create figure

fig, ax = plt.subplots(1,3, figsize=(18,6))# Plots

for idx, var in enumerate(['total_bill', 'tip', 'size']):

g=sns.histplot(data=df, x=var, color='royalblue', ax=ax[idx])

g.set_title(f'Histogram of {var}', size=15);

The distributions are not bell-shaped, thus we do not have normality here. With a simple normality test, that can be confirmed. So, what now? Is this data not good anymore?

Far from that. The data is still good, but that deviance from a normal distribution just tells us that we should not use traditional tests like T-test or ANOVA for analysis of variance for means comparisons, for example. Or, in case we want to perform one of those tests, some good approaches are to perform a sampling or bootstrapping from the variables and then performing the statistical test.

For example, let’s say we wanted to compare the expense average by size of the party. As there are six groups, we need to perform an ANOVA test. We can extract n samples from the dataset for each group and take an average of the expense. Then we will have six normally distributed groups, that can be tested against each other.

Another common transformation is the logarithmic one, or another level of the Box-Cox transformation to make the data normally distributed.

Recently, I wrote about hypothesis test in the link that follows.

In general, all we are doing when we create a hypothesis test is to say this:

Supposing that my null hypothesis (current status) is the truth, what is the probability that the alternative is the new truth.

For example, if we are testing if two averages are coming from the same distribution (usually the normal), then, if the *average-one* is 10 and we assume that as the truth, what is the probability that another distribution with *average-two* 20 comes from the the same dataset? The test result will give us a probability number (the famous p-value) like 0.001, or 0.1% of chance that they are from the same distribution — again, considering that the true average is 10.

There are two types of errors, when performing a hypothesis test. The easiest way I found to think about this is using the justice concept.

First assumption: everyone is innocent, until otherwise proven. Therefore, the truth is that one is innocent.

**Type 1 error**: reject the null hypothesis when it is true. It means that you are judging someone as guilty when that person is actually not. Remember, innocent until proven otherwise.

**Type 2 error**: do not reject the null hypothesis when the alternative is true. This means that one is guilty, but we say that the person is innocent.

It is very difficult to eliminate the errors in hypothesis testing. They will always be present, so the idea is to weigh which one should be in focus. Given that the amount of times that we will commit an error will be the significance level **α, **then 5% value means that we will see a mistake 5 times out of 100.

If the type 1 error is more risky, choose a lower significance level **α**, like 1%. We don’t want to convict an innocent person, so the evidence must be really strong for us to fail the null hypothesis.

If type 2 error is the one to be reduced, choose a higher significance level, like 10%. When we increase the value of **α**, we decrease the Type 2 error rate. Here, we want to not consider the alternative only with strong evidence in favor of Ho.

The confidence interval is a common source of mistake. Many people think that, given a sample of 100 elements, a 95% confidence interval that goes from 1 to 5 means that we have 95% of chance to pick a value from that data, andit will be between 1 and 5.

In fact, the correct reading is: 95% of the times that we pick a random sample with the same size as the original sample (100), coming from the same population, it will give us an average number within that interval.

Boxplots provide a similar result to T-tests for average comparison between groups. Let’s see that in action. Let’s create two normal distributions and perform a t-test.

# Create a df

df = pd.DataFrame( {'v1':np.random.randn(100)*1000,

'v2':np.random.randn(100)*1000})# Pivot it for plotting and testing

df = df.melt(var_name='vars', value_name='val' ,value_vars=df.columns)

Now let’s perform a t-test using `scipy`

package.

# Ho [p-value > 0.05] = No statistical evidence for different avgs

Ha [p-value <= 0.05] = Evidence of statistical different averagessp.stats.ttest_ind(df.query('vars == "v1"')['val'], df.query('vars == "v2"')['val'] )[OUT]:

Ttest_indResult(statistic=-0.9969050118330597, pvalue=0.32002736861405306)

As our p-value is over 5%, there is no evidence to reject the Ho that the variances are similar between both averages. Now, if we plot the boxplots of v1 and v2, here is the result.

If we look at the median values, they are really close and we really can’t tell much difference. That is the same conclusion as the T-test. Now, notice that this was done with two normally distributed random samples, but it illustrates the point we want to make.

Linear regression is everywhere. I won’t create another example in this post, but I will share some notes with you.

**Independent / Explanatory / X**: The variables used to explain the variance in the variable that you will try to predict the result.

**Dependent / Response / Target/ Y:** The variable that is being predicted, estimated.

**Intercept:** Point where the regression line crosses the y axis. This is also understood as *“if the variable x is zero, the standard value of y is the intercept, on average”*. But that is not always a reasonable value. So, it does not always make sense to say that.

**Slope:** The inclination of the regression line. Can be read as *“for each unity increased in x, y increases slope * x”*. For the regression *y = 3+2x*, the slope is 2, so for each unit increased in x, y will increase 2 times.

**Extrapolation**: Be careful with applying a regression model to values that are out of the space of values of your original dataset. For example, if you have a regression with values that range from 0–10 in both axes, if you try to predict a value that is around the 20’s range, the linear regression may not return a reasonable value, since you can’t tell if the linear relationship continues up to the 20’s range. See an illustration below.

**Multicollinearity:** Multicollinearity is when two or more variables have very similar variance, so they behave the same in those terms. Therefore, they will explain the same variances in the target variable, creating redundancy in the model, what makes it less reliable. It is important to remove multicollinear variables from your linear regression models. That can be verified with a statistical test called correlation, being the most famous the *Pearson *and *Spearman’s *methods. They are easily done in Pandas using `df.corr()`

.

**Residuals: **The residuals of a linear regression are very important to validate it statistically. They must be normally distributed, with constant variance. In other words, this means that, as the values are increasing, the regression line is also going up, so the errors tend to be within a certain range that is constant. If the model shows high variance in the low or high numbers, or even both, there is a problem with it being able to correctly predict values.

Well, these are some of the best learnings I have from basic statistics courses for Data Science. I am sure there is so much more to add. Maybe in the future I can post a part two of this article, but I think it is good for now.

I encourage you to look around and study stats, since Data Science, in essence, is just a fancy name for Statistics + Computer Science + Business.

If this content is interesting to you, follow my blog for more.

Find me on LinkedIn as well and tell me you read this article.

[ad_2]

Source link