[ad_1]

## How to create a sampling plan for your data project

No matter how hard you may try to forget your STAT101 course, you’ll likely tend to default to simple random sampling (SRS) as your knee jerk approach. It was, after all, an assumption you were told to make for every single homework problem. I don’t blame you — SRS is a great option when it’s feasible.

But what I find tragicomic is that every single time I ask a fresh batch of students how they suggest approaching a data collection challenge, I hear the word “*just*” as part of the answer. As in, **“Just select them entirely at random.”**

Let’s spend a moment in the real world, shall we?

Imagine that you’re a data scientist who has been hired to estimate the average height of pine trees in the forest pictured below and describe the distribution.

Given the all-you-can-eat buffet of tree height info you can find on the internet, it’s clear that you wouldn’t be the first intrepid tree-measurer to tackle this kind of job. Plenty of people have measured tall trees… How hard could it possibly be?

*(Note: the links in this article take you to my lighthearted explanations of any jargon terms that crop up.)*

If you measured every single tree perfectly, you wouldn’t need statistics; you’d have the facts. But do you *need* the facts? Or are you willing to settle for statistics?

Statistics gives you a way to proceed even if you don’t have all the data you wish you had. Measuring a few trees (sample) rather than the whole blessed forest (population) results in a less perfect but hopefully less expensive perspective on the information you’re interested in. Which is a relief since you don’t even know how many trees there are in this massive forest.

Let’s measure a good enough sample of trees so we don’t have to measure them all!

Thinking statistically, your boss asked you to carry out measurements to the nearest foot on a random sample of **20 trees**, so you followed the advice in our previous article and confirmed that these specs make sense for your project. They do; the stage is set!

What does STAT101 tell you to do next?

I’ve taught this example to over 100 classes of students and when I’ve asked them how we should pick the trees, I’d hear one or both of these (equivalent) responses from someone in the crowd every time:

**“Just select them [entirely/completely] at random.”**

AND/OR

**“Just take a simple random sample.”**

I don’t blame you for defaulting to simple random sampling (SRS) as your knee jerk answer. It’s a fabulous option when it’s feasible. *That’s *not the bit I take issue with.

What I find tragicomic is that every single time, I hear the word “*just*” as part of the answer.

Whoever tells you that the way to take a simple random sample of these trees is to “just select them entirely at random” …does not know how to use the English word “just” correctly. There’s no “just” about this!

Imagine that you passionately hate the great outdoors, so you sneakily outsource the actual tree measuring to someone who can stand fresh air. You’ve hired an avid hiker with no technical background who’s eager to follow any instructions you give, so you tell this person to, er, “just” select 20 trees entirely at random?!

If I were the hiker, I’d “just” grab the first 20 trees which look convenient “just” to teach you a lesson about being careful with your instructions.

** Simple random sampling** and

**and**

*simple random sample***and**

*entirely at random***are all technical terms. They refer to a sampling procedure where**

*SRS***each sampling unit (tree) has the same probability as any other tree of being selected**.

It’s only a true simple random sample (SRS) if comes from equal selection probability. Otherwise it’s just sparkling nonsense.

There’s a reason SRS is the first (and sometimes only) sampling procedure we teach newcomers to statistics and that reason is that it’s… easy. Easy in terms of the calculations, that is. There are other sampling procedures, but they require adjusted calculations which are usually outside the scope of your first year stats course.

There are other sampling procedures, but they require more advanced calculations.

It’s only a true simple random sample (SRS) of trees if it comes from a forest with equal selection probability for every tree, otherwise it’s just sparkling nonsense when you use SRS calculations on it.

Unfortunately, if you analyzed your data the way STAT101 teaches you to do it, but you didn’t actually use a true simple random sampling procedure to get hold of the data, then your results will technically be wrong.

Always strive to give foolproof instructions, because you never know when a wild fool will appear.

If your hiker picks the more convenient trees closer to the edge of the forest, that’s most definitely not a simple random sample. It’s something called a ** convenience sample **— which is a procedure you should avoid like the plague — more on that in a future article. Analyzing such data with SRS math is statistically inappropriate… what if those trees get a different amount of sunlight are are thus unrepresentative of the entire forest? Basing your inferences on them will lead you to the wrong conclusions.

So, what would a professional statistician’s answer look like? To find out, head over to Part 2!

[ad_2]

Source link