What is Apriori Algorithm and How It Works

Sergen Cansiz
4 min readFeb 18, 2020

--

Apriori is one of the algorithm that is used for frequent pattern mining. Frequent pattern mining methods calculate the probabilities of occurrence of each patterns and filter them with a threshold probability value. For instance, Let’s suppose that we have entered a e-commerce web site and start to shopping. After we took the products to our basket, we might seen such notifications or sidebar information that says “People who bought these products also bought these”. Most probably, It is the result of one of the Frequent Pattern Mining algorithm. These algorithms scans the previous transactions of customers and finds that buying one or more product how affects the buying other products. Basically, It generates probabilities of products that we might purchase according to our basket and list them through a threshold value. It finds the products mostly bought at the same time. It creates a pattern from these products and this pattern called frequent pattern. Frequent Pattern Mining methods are used in order to find the probabilities of occurrence of these patterns.

How “Apriori” algorithm works

Apriori is the algorithm that is used in order to find frequent item-sets in given data-sets. By using Apriori algorithm, confidence value or probability of next item that might come after given item set can be obtained. It is based on the idea of obtaining frequencies of created combinations for multiple item sets in different positions by referencing given data-set. Combinations go level by level and although one of the same combinations has to be removed such as (x1,x2) and (x2,x1). At the first level, each unique items’ frequencies has to be found by scanning the data-set and the table that contains these frequencies should be created. At the second level, binary combinations are created over items that has been created at the first level. At each upcoming table’s items should be combined with the first item-sets. For instance, in order to create the third level table, second level table’s item-sets has to be combined with the first table’s items. Each combinations frequencies should be controlled whether same pattern is exist in data-set. These levels goes until no pattern exists in data-set. There is also one more criteria called Minimum Support value alongside of existence of pattern in data set. We can basically say support values is the frequency of item-set in data-set and minimum support value is the value that used in order to decide which item-set is going to be taken into consideration. Each item-set in each level should be equal to or greater than Minimum Support Value. Otherwise that item-set has to be removed from table of belonging level. As a note; Minimum Support Value can’t be less than minimum frequency of singular items. Let’s understand Apriori algorithm through a basic example;

Figure 1. Example Apriori algorithm creating Level Tables

As It mentioned before, Minimum support value can’t be less than minimum frequency of singular items which is shown at L1 table. According to L1 table, minimum frequency (support) is “2”. It means that Minimum support value should be “2” or more than “2”. In this example we can set Minimum Support Value to “2”. Items in L2 Table have been created by taking combination of each singular item in L1 table and replication has removed. Support values for L2 table has calculated over data-set. For example; if you check the data-set, x1 and x2 are together 2 times in the dataset (at the first and second row). Although, x2 and x4 are never together, that’s why support value for x2,x4 is “0”. According to Min. Sup. Value “x1,x4” , “x2,x3” , “x2,x4” and “x3,x4” (red background) has removed from L2 table. At the table L3, combination of L2 table and L1 table has found (something like 3-way combination) and reputations are removed (Not: “x1,x2,x3” and “x2,x1,x3” gives same support value and that’s why they need to be removed). Because of every itemsets’ support value in L3 table are less than Min. Support, L3 table can not be taken into consideration and as result final table has found as L2.

After all levels’ tables created confidence values can be calculated by following formula;

x1 -> x2 — support(x1,x2)/support(x1) = (2/4)*100 = 50%

x1 -> x3 — support(x1,x3)/support(x1) = (2/2)*100 = 100%

Basically, this result shows the probability of existence of x2 (if x1 is exist) is 50%. And for “x1 -> x3” , confidence value has found as 100% . When dataset table is examined , It can be clearly seen that every item set that includes “x1” value, includes “x3” value too. That’s why the probability of existence of x3 while there is “x1” in item-set is 100%. So, this is an basic example. But, if you are working on massive transaction data of any e-commerce website it is impossible to calculate these result by hand. By considering the Apriori algorithm scan data-set over and over again, the system/platform that Apriori runs on should be fast enough. Because of this, Python, R, Spark are the proper environments to run Apriori.

References

  • R. Agrawal, R, Srikant, (1994) Fast Algorithms for Mining Association Rules, 20th VLDB Conference Santiago, Chile

--

--

Sergen Cansiz
Sergen Cansiz

Written by Sergen Cansiz

Data Scientist, Statistician, Python and R Developer

No responses yet