[ad_1]

## A high-level overview

Machine learning comes from the idea that machines can learn to program themselves instead of having to be manually programmed. This is because originally, in order to go through routes of logic to find answers, programmers had to manually create if, elif and else statements for the computer to follow to reach the optimal answer (see the amazing book if/then by Jill Lepore). This was a heavy burden on programmers especially when the logic was very complex. Such a burden then led to the idea that computers could be able to learn this logic for themselves based on existing data. In doing so the computers would essentially assign weights, values or statements to the logic flows to get similar or better answers than if they were manually programmed. This would both reduce the burden on the programmer while allowing for more complex and accurate programs than any human could develop.

Since the original idea of machine learning, the field has developed considerably to encompass a wide variety of different techniques and algorithms. This includes models such as Kmeans clustering to identify groups of data points, random forest regression to predict individual values, and support vector machines to predict which groups data points belong to. While these models have developed over time, arguably the key development has been the democratisation of machine learning through the development of a broad number of open source packages and the wide availability of open source datasets. This means that anyone with a computer (arguably a computer isn’t even needed anymore) can implement a machine learning solution on a dataset. The wide variety of techniques and ease of access can make navigating the world of machine learning quite difficult sometimes though. This article aims to provide a high-level to aid in navigating this broad and continually developing community.

The first place to start is the idea that machine learning can be split into two main lines of work: supervised learning and unsupervised learning. In the former, the computer is passed already defined target points, or outcomes, that the model can use to learn against, while in the latter there is no defined target and the final result can be open to interpretation. This can be split further as supervised learning encompasses regression problems, where the target variable is continuous, and classification problems, where the target variable is categorical. Then, under unsupervised learning, we have clustering where the aim is to find patterns and/or groups from the data without an already predefined target. These groups can be defined as follows:

## Regression

Regression is the most common machine learning task that is performed. The aim of this group of models is to be able to predict and/or model the precise value of a target variable. Examples of this include predicting house prices based on factors such as the number of bedrooms, location, number of toilets and size, or modeling the height of individuals based on economic, social and environmental factors. The key thing for this is that the target variable is continuous and we are aiming to model the precise value with the specified inputs.

One of the first methods encountered under this umbrella is that of linear or multi-linear regression. While this method falls under the domain of traditional statistics, indeed it has been around for a while, it could also fall under the category of a machine learning model as the weights can be “learned” by the computer. Here the model will try to learn the optimal weights to assign to the input variables to maximise or minimise a metric related to the output. In the case of linear regression, this is reducing the distance between the actual and predicted value using least squares distance (sum of squares for all values). From this, we can extract the strength and direction of the relationship between the input and target variables and use this to try to predict unseen values.

What this means is that we will often have a series of numerical input values (independent variables) that are used to model the target values (dependent variable). This will often look like:

Where the left-hand side shows the data inputs and the right-hand side shows what the model is expected to look like.

Beyond linear regression there are many different types of regression in machine learning including:

- Ridge Regression
- Lasso Regression
- Decision Tree Regression
- Random Forest Regression
- Neural Networks

Each of these varies in regards to their complexity, implementation, and best use case. For example, if you expect the relationship to be linear, you should probably use a family of linear regression models such as linear regression, ridge regression, or Lasso regression. If you expect the relationship to be non-linear, then Decision Tree Regression, Random Forest Regression, or Neural networks may be preferred.

The model you choose will also depend on what your objective is and what resources you have available. In most cases, it is often best to try a combination of models to see which fits best and to start with the simplest model you can, which is usually linear regression.

## Classification

Classification is another supervised machine learning task that is commonly used in machine learning workflows. The aim of this type of algorithm is to be able to determine which group or class a data point may belong to rather than an exact value. This means that we use categorical variables as our target variable rather than continuous ones. Examples of this technique in use include modeling whether a patient has diabetes, whether a picture contains a dog or a cat or whether a user will re-subscribe to the platform or not in the future.

A common method encountered under this umbrella is that of Support Vector Machines. This is because, at least with only two dependent variables, it can be easy to visualise and understand. The way in which this works is by finding a boundary between data points that act as a cut-off between the different groups. This boundary can then be used for predicting which group new data belongs to by finding out which side of the boundary the data point sits on. The boundary can take many forms such as linear, non-linear or defined by the user but can be easy to see, implement and understand. An example of the boundary created by the model can be seen below.

What this means is that for classification we will often have a set of inputs, whether they are numerical or categorical, that are used to predict final categories. This means that data inputs into these models will often take the form:

Where the left-hand side shows the data inputs and the right-hand side shows what the model may potentially look like.

Standard algorithms that fall under this umbrella include:

- Logistic Regression
- Support Vector Machines
- Decision Tree Classifiers
- Random Forest Classifiers
- Neural Networks

Each of these methods vary in their complexity and implementation meaning that some algorithms will suit different problems. For example, where a linear decision boundary could be best, then support vector machines with a linear boundary may be advisable. However when the relationship may be more complex and less linear, then Random Forest Classifiers may be preferred for this implementation.

The model you choose will also depend on your objectives and resources available. As with regression, in most cases if you can it is probably better to trial a combination of methods so as to get the best result and understand why different algorithms may perform better or worse.

## Clustering

In contrast to the two groups of machine learning above, clustering algorithms are unsupervised machine learning algorithms. This means that they do not have a predefined target to aim for. Instead, the aim of these groups of algorithms is to be able to identify groups from data based on similar characteristics. This is commonly used when we want to identify a set of groups with different behaviors. Examples of this include identifying groups of shoppers such as single individuals, young families or couples, identifying different groups of shows that users commonly watch together, or grouping together music tastes. Once these groups have been identified this can tell us more about behavior and it can lead to targeted inventions. This can include giving out coupons to groups to nudge them into new behaviors or reinforce existing ones or being able to offer suggestions to viewers in terms of movies or tv shows.

A common algorithm within this domain is that of K-means clustering. This algorithm is able to define different groups/clusters by grouping data points by their distance to each other. For this, the Data Scientist has to first define the target number of groups to identify. The algorithm then sets random points in the data and seeks to find the optimal grouping of points within each group by their distance to each other relative to other groups by finding a centroid of that group. Since the Data Scientist often does not know the optimal number of groups beforehand, and there are often no predefined targets, the results can often be open to interpretation. There are some methods for determining this optimal number of clusters, but the results of these can vary.

This means that a series of numerical inputs are fed into the clustering algorithm with no predefined target variable. This can often look like:

Where the left-hand side shows the data inputs and the right-hand side shows what the model may potentially look like.

Common algorithms within this domain include:

- Kmeans clustering
- DB-Scan
- Hierarchical clustering
- OPTICS
- Mean-Shift

As with both regression and classification algorithms, each of these methods vary in their complexity and implementation meaning that some algorithms will suit different problems more than others. Since clustering results themselves can be highly subjective and open to interpretation, it is often best to try multiple different clustering methods to see which results make the most sense. This should mean that the groups are well balanced, clearly defined by their mean or median and easy to label or interpret for non-specialists.

## Conclusions

Machine learning is the process whereby instead of Data Scientists defining the rules of the data, the Machine learns these rules for itself. This can be split into both supervised and unsupervised learning, where in the former there are defined targets that the algorithms are working towards, but in the latter there are none. Both regression and classification fall under the banner of supervised machine learning algorithms while clustering falls under the banner of unsupervised machine learning. When implementing any of these algorithms, if you have the time and the resources, it is often best to try to implement more than one algorithm while trying to keep it is simple as possible. This will ensure that you are double checking your results and that you use the least amount of resources as possible for efficient learning.

While the idea of machine learning has been around for a long time, and even the implementation of some machine learning algorithms has also been around for a while, widespread use of machine learning is only beginning in earnest. This has been facilitated in recent years by the development of deep and wide open source Data Science ecosystems across multiple languages that has enabled any developer with a laptop (in some cases you don’t even need that anymore) to get started with implementing their own machine learning algorithms. In Python this includes the development and integration of a variety of libraries such as pandas, matplotlib, sklearn, statsmodels, tensorflow, keras among others, alongside open sources of data such as Kaggle, Google Cloud Public Data Sets, Data.gov and others.

This means that Data Science practices are still developing although a solid foundation has already been built. There are many opportunities to contribute to this continually growing ecosystem in a variety of different ways and we continue to see this everyday. With this in mind, I am very much looking forward to what comes in the future!

If you want to look at practical examples of any of the covered topics above, then feel free to look at my article on a complete Data Science curriculum for beginners.

[ad_2]

Source link