Understanding how to find Frequent Itemsets and Calculate Confidence and Support
The Apriori algorithm tries to learn association rules, i.e. logical relationships, from transactions in a database. The goal is to find frequently occurring itemsets, e.g. the products that frequently occur in the database.
In everyday life, we often encounter product combinations that can be bought cheaper in a bundle. In many fast-food restaurants, for example, the main course is sold together with a side dish and a drink at a discounted price. The principle behind this is always the same: products that are often bought together are offered at a lower price if they are bought directly as a bundle. This has not only caught on in fast-food restaurants. It can now also be found in supermarkets and fashion online stores, as it makes shopping easier for the customer and leads to higher sales for the seller.
To be able to find such bundles, the Apriori algorithm is used, among other things. For this purpose, the order databases of the providers are searched and an attempt is made to establish association rules. Before we can continue with the exact procedure of the algorithm, we have to clarify a few terms that are used in this context.
The Apriori algorithm uses some concepts and rules which are not really common in our linguistic usage. Therefore, we explain the most important ones briefly here, before we can devote ourselves to the flow of the algorithm.
What is an Itemset?
A set of elements is called an itemset. A set with k elements is also called a k-itemset. The itemset must consist of at least two elements and can represent, for example, the shopping cart during purchase. For example, shopping at the supermarket would have the itemset consisting of bread, butter, milk, and eggs.
How are Support and Confidence related?
Before we can look for frequently occurring bundles, we need metrics that measure how often certain itemsets are bought together. For this purpose, we use confidence and support.
Support measures how often a product is purchased relatively:
Confidence, on the other hand, measures the support for all transactions containing elements A and B, and divides them by the support for A:
For a supermarket, these ratios would be calculated as follows:
Assume the database includes 500 purchases in total. In 300 cases, the purchase included a chocolate bar. So the support for chocolate is:
In turn, in 100 purchases in which chocolate was bought, milk was also bought. The confidence in milk and chocolate thus results from:
This means that chocolate is purchased in 60% of all transactions. Again, in one-third of the “chocolate transactions” milk is also purchased.
What is an Association Rule?
An association rule tries to find certain regularities between two or more elements that reach a given level of support and confidence.
What is a Frequent Itemset?
A frequent itemset is a set of items that occur frequently together and reach a predefined level of support and confidence. Frequent itemsets can be found using association rules.
The Apriori algorithm is one of the methods to find frequent item sets in a dataset. It works in two steps, namely “Join” and “Prune”, which are executed iteratively, i.e. several times in a row.
Join: In this step, itemsets of the set K are formed. K stands for the repetition step.
Prune: In this step, all itemsets that do not reach the predefined support threshold and are therefore considered rare are removed.
The Apriori algorithm makes use of the so-called antimonotonic property. In simple words, it means that if an itemset consisting of several elements does not reach the minimum support, then all supersets (all sets consisting of elements of the itemset) do not reach the minimum support either. In our example, it means that if the milk and cookies itemset does not reach the minimum support, then the milk, cookies, and chocolate itemset cannot exceed the minimum support.
In each new iteration, the itemset is extended by one element, and the “Join” and “Prune” steps are executed again. The algorithm terminates if no new itemset is found in an iteration, but only itemsets from the previous step remain.
For our example, let’s come back to our supermarket. There are a total of six transactions in its database.
For our association rule, we want to achieve support of 50% and a confidence of 70%. These values are completely arbitrary.
Step 1 (K=1): In the first step we search for itemsets with quantity 1, i.e. we count how often the individual products have been purchased in total. This corresponds to the join step in the first stage.
Step 2 (K=1): The Join step is followed by the Prune step, in which we remove all itemsets that do not reach the minimum support. We have chosen support of 50%. Due to the formality of calculating the support, it means that in six transactions the itemset must occur at least three times for the support to be fulfilled. All other itemsets can be dropped:
Therefore, the product/itemset “Bread” falls out, because it occurs only twice.
Step 3 (K=2): After doing the Join and Prune step, we move to the second stage. Here, the itemset size is now 2. The possible itemsets are the combination of all the remaining products from the previous stage.
Step 4 (K=2): In the Prune step, the item sets that do not reach the minimum support of three are removed again. Thus, the combinations (milk, rice) and (noodles, rice) are dropped:
Step 5 (K=3): In this step, we form the itemsets with the quantity 3, which are the quantities that can be formed from the remaining products:
We don’t have a single purchase for these itemsets, but you can still check if the itemsets would reach support. For this purpose, we make use of the Antimonoton property. Here we notice that the itemset (Milk, Chocolate, Rice) consists of the subsets (Milk, Chocolate), (Chocolate, Rice), and (Milk, Rice). However, the itemset (milk, rice) could not reach the support, so the larger itemset (milk, chocolate, rice) cannot be frequent either, even though there are no numbers for it.
With the same reasoning, the itemsets (milk, noodles, rice) and (noodles, chocolate, rice) are also dropped because the itemset (noodles, rice) does not occur often enough. The last remaining itemset can now be used to derive association rules using confidence.
Step 6: We can check the following association rules:
After running the Apriori algorithm, a total of five association rules emerge that withstand our confidence level of 70%. These include the rule “(milk, chocolate) -> (noodles)”. This means that if milk and chocolate have already been purchased, then the purchase of noodles is also very likely.
The Apriori algorithm is a good way to find association rules in large databases with many transactions. Furthermore, the join and prune steps are relatively easy to develop in programming languages such as Python. In addition, there are also modules that further simplify the import of the Apriori algorithm.
However, the algorithm becomes very computationally intensive for large databases. As has already become clear in our example, many combinations and steps have to be calculated, which take up a lot of time and resources even in our simple example. Before implementation, therefore, it must be ensured that the effort is worthwhile.
There are several applications in which association rules are searched and thus the Apriori algorithm is used. Here are some topics in which associations are searched:
- Education: In education, associations are sought, for example, when trying to find out why some students do better in subjects than others.
- Medicine: In the development of new drugs, association rules are studied to determine how a new drug interacts with physical characteristics. To do this, the sample that tested the drug is studied in more detail.
- Biology: Around the world, forest fires are becoming more frequent. Therefore, we are currently investigating which early warning factors, such as extreme drought or high temperatures, are associated with forest fires in order to be able to detect and prevent them as early as possible.
- E-commerce & Recommendation: We learned about the classic example of the Apriori algorithm in this article: shopping cart analysis. Based on the purchasing behavior of previous customers, it attempts to identify associations between products and then uses these for product recommendations.
With a large database, the Apriori algorithm can be very inefficient and take a lot of time. Therefore, there are currently some methods to help make the algorithm more efficient and faster. These include:
- Hash-based Itemset Counting: This method tries to speed up the creation of itemsets. For this purpose, a so-called hash table is used, which gives each product or transaction a unique number. This means that memory-intensive strings do not have to be included and processed, but rather the more efficient hashes.
- Transaction Reduction: Before the Apriori algorithm is executed, the database is searched for repeated transactions, which are then excluded. In the same way, rare transactions are filtered out at an early stage, since they do not play a role in the calculation of the algorithm.
- Partitioning: For this method, the number of database scans is significantly reduced. The basic idea here is that an itemset can only be frequent if it occurs in at least one of two database partitions. This means that the entire database only has to be scanned twice, which can be much more efficient.
- Sampling: Instead of searching for frequent itemsets in the entire database, this approach takes only one examination unit of the database and examines it. However, there is a risk of losing itemsets that are actually frequent. To avoid this risk, comparatively low support should be used.
- Dynamic Itemset Counting: With this method, new potential itemsets can already be found while scanning the database.
- The Apriori algorithm tries to learn association rules, i.e. logical relationships, from transactions in a database. The goal is to find frequently occurring itemsets, e.g. the products that frequently occur in the database.
- The Apriori algorithm works in two steps (“Join” and “Prune”), which are executed iteratively one after the other.
- In each step, an itemset is created, which is larger by one product than in the previous step (“Join”). Then it is examined which of these itemsets fulfills the previously defined support and thus remains.
- Although the Apriori algorithm is relatively easy to understand and implement, it can be very inefficient for large databases and require a lot of computing power. Therefore, there are already some approaches to improve the performance of the algorithm.
- In practice, the Apriori algorithm is used whenever association rules are sought. This includes, for example, medicine, biology, or e-commerce.