Process Mining for Dummies (part 2)
Event Logs are the foundation for reconstructing process models and thus for process mining. But how does an Event Log look like and what are the steps to be taken in a process mining project? Answers on these questions can be read in our first part of this blog series called: “Process Mining for Dummies“. This time I would like to introduce the process mining algorithms needed for transforming an event log into a process model. The third part of this series will cover the different types of process mining.
Process Mining Algorithms
The main component in process mining is the mining algorithm. It determines how the process models are created. A broad variety of mining algorithms does exist. The following three categories will be discussed in more detail.
- Deterministic mining algorithms
- Heuristic mining algorithms
- Genetic mining algorithms
Determinism means that an algorithm only produces defined and reproducible results. It always delivers the same result for the same input. A representative of this category is the α-Algorithm. It was one of the first algorithms to be able to deal with concurrency. It takes an event log as input and calculates the ordering relation of the events contained in the log. Heuristic mining also uses deterministic algorithms but they incorporate frequencies of events and traces for reconstructing a process model. A common problem in process mining is the fact that real processes are highly complex and their discovery leads to complex models. This complexity can be reduced by disregarding infrequent paths in the models.
Genetic mining algorithms use an evolutionary approach that mimics the process of natural evolution. They are not deterministic. Genetic mining algorithms follow four steps: initialization, selection, reproduction and termination. The idea behind these algorithms is to generate a random population of process models and to find a satisfactory solution by iteratively selecting individuals and reproducing them by crossover and mutation over different generations. The initial population of process models is generated randomly and might have little in common with the event log. But due to the high number of models in the population, as well as selection and reproduction, better fitting models are created in each generation.
The outcomes of mining differ depending on the algorithm used. We shall use the event log displayed in Table 1 to illustrate the models created by different mining algorithms and to provide a picture of how mining works.
Table 1 Sample Event Log
|Case ID||Event ID||Timestamp||Activity||Resource|
|1||1000||01/01/2013||(A) Order Goods||Peter|
|1001||10/01/2013||(B) Receive Goods||Michael|
|1002||13/01/2013||(C) Receive Invoice||Frank|
|1003||20/01/2013||(D) Pay Invoice||Tanja|
|2||1004||02/01/2013||(A) Order Goods||Peter|
|1005||03/02/2013||(B) Receive Goods||Michael|
|1006||05/02/2013||(C) Receive Invoice||Frank|
|1007||06/02/2013||(D) Pay Invoice||Tanja|
|3||1008||01/01/2013||(A) Order Goods||Louise|
|1009||04/01/2013||(C) Receive Invoice||Frank|
|1010||05/01/2013||(B) Receive Goods||Michael|
|1011||10/01/2013||(D) Pay Invoice||Tanja|
|4||1016||15/01/2013||(A) Order Goods||Peter|
|1017||20/01/2013||(C) Receive Invoice||Claire|
|1018||25/01/2013||(D) Pay Invoice||Frank|
|5||1023||01/01/2013||(A) Order Goods||Michael|
|1024||10/01/2013||(B) Receive Goods||Michael|
|1025||13/01/2013||(C) Receive Invoice||Michael|
|1026||20/01/2013||(D) Pay Invoice||Michael|
The event log contains a very limited number of events and cases. But it is nevertheless suitable to demonstrate key aspects that are relevant for process mining and the selection of mining algorithms.
Figure 1 shows a mined process model that has been reconstructed by applying the α- Algorithm to the sample event log. It has been translated into a BPMN model for better comparability. In case 3, the invoice is received before the goods. Due to the fact that both possibilities are included in the event log (goods received before the invoice in cases 1, 2, and 5, and invoice received before the ordered goods in case 3), the mining algorithm assumes that these activities can be carried out concurrently.
Figure 1: Mined Process Model Using the α-Algorithm
When we look at the model in Figure 1 a little bit closer, we can see another important fact. Case 4 is not actually reflected in the process model. No execution sequence in the model is able to reproduce the trace of case 4. It only allows for two possible execution sequences: ABCD and ACBD, but not ACD. The end-gateway after the “Order Goods” activity requires that both following branches are executed. Therefore, it is not possible to have an execution sequence without the execution of activity B. The model is a poor fit because it is not able to reflect all relevant traces in the event log. The model shown in Figure 2 is able to replay all traces. Due to the exclusive-gateways, all three traces ABCD, ACBD and ACD are possible. But now the problem is that there are many more execution sequences that are possible than are reflected in the event log. In fact the process model allows for an infinite set of sequences. Now loops are possible either starting from B to C or from C to B leading to possible sequences with infinite iterations of B and C or C and B. The sequence ABCBCD would, for example, be possible although it is not included as a trace in the event log.
If a process model is too general, it is said to be “under-fitting”. It is insufficiently precise. A major challenge in process mining is finding an adequate solution between fitness, precision, simplicity and generalizability (taken from: Buijs et al.: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery.):
Fitness: Fitness quantifies the extent to which the discovered model can accurately reproduce the cases recorded in the log.
Precision: Precision quantifies the fraction of the behavior allowed by the model which is not seen in the event log.
Simplicity: The complexity of a process model is captured by the simplicity dimension.
Generalizability: Generalization assesses the extent to which the resulting model will be able to reproduce future behavior of the process.
Figure 2: Under-fitting Process Model
Various advanced mining algorithms do exist that can be used for different purposes. Figure 3 illustrates the mined model using the heuristic Fuzzy Miner algorithm. The model does not follow the BPMN notation but instead uses a dependency graph representation. It does not contain any gateway operators but shows the dependencies between different activities. The dependency graph illustrates, for example, that A was followed three times by B and two times by C.
Figure 3: Mined Process Model Using the Fuzzy Miner Algorithm
It is important to identify which requirements need to be considered for achieving the intended objectives for each individual process mining project. The appropriateness of an algorithm should be evaluated depending on the area of application.
In the next series you can read more about the different types of Process Mining. If you can not wait any longer, you can download the scientific paper including several questions for self-study and the corresponding answers:
This article appeared as a long version: Gehrke, N., Werner, M.: Process Mining, in: WISU – Wirtschaft und Studium, Juli edition 2013, pp. 934 – 943