Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, SYSTEM AND APPARATUS FOR TRAINING AN INTERPRETABLE ARTIFICIAL INTELLIGENCE MODEL
Document Type and Number:
WIPO Patent Application WO/2023/062247
Kind Code:
A1
Abstract:
Methods, systems, computer programs and computer readable media for generating an ordered set of decision rules for a decision rules list machine learning model are provided.

Inventors:
MAINALI PRADIP (BE)
Application Number:
PCT/EP2022/078882
Publication Date:
April 20, 2023
Filing Date:
October 17, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ONESPAN NV (BE)
ONESPAN NORTH AMERICA INC (US)
International Classes:
G06N5/025; G06N5/045; G06N5/046
Foreign References:
US20200121236A12020-04-23
Other References:
LETHAM BENJAMIN ET AL: "Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model", THE ANNALS OF APPLIED STATISTICS, vol. 9, no. 3, 5 November 2015 (2015-11-05), pages 1350 - 1371, XP055934764, ISSN: 1932-6157, Retrieved from the Internet [retrieved on 20230623], DOI: 10.1214/15-AOAS848
PRADIP MAINALI ET AL: "ExMo: Explainable AI Model using Inverse Frequency Decision Rules", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 May 2022 (2022-05-20), XP091229821
AMINA ADADIMOHAMMED BERRADA: "Peeking inside the black-box: A survey on explainableartificial intelligence (xai", IEEE ACCESS, vol. 6, 2018, pages 52138 - 52160, XP055717374, DOI: 10.1109/ACCESS.2018.2870052
SCOTT M. LUNDBERGSU-IN LEE: "A unified approach to interpreting model predictions", PROCEEDINGS OF THE 31ST INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, NIPS'17, 2017, pages 4768 - 4777
MARCO TULIO RIBEIROSAMEER SINGHCARLOS GUESTRIN: "why should I trust you?'': Explaining the predictions of any classifier", PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, SAN FRANCISCO, CA, USA, AUGUST 13-17, 2016, 2016, pages 1135 - 1144, XP058276906, DOI: 10.1145/2939672.2939778
MUKUND SUNDARARAJANANKUR TALYQIQI YAN: "Axiomatic attribution for deep networks", PROCEEDINGS OF THE 34TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, vol. 70, 6 August 2017 (2017-08-06), pages 3319 - 3328
CHRISTOPH MOLNAR, INTERPRETABLE MACHINE LEARNING, 2019, Retrieved from the Internet
CYNTHIA RUDIN: "Stop explaining black box machine learning models for high stakes decisionsand use interpretable models instead", NATURE MACHINE INTELLIGENCE, vol. 1, 2019, pages 206 - 215
BENJAMIN LETHAMCYNTHIA RUDINTYLER MCCORMICKDAVID MADIGAN: "Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model", THE ANNALS OF APPLIED STATISTICS, vol. 9, 2015, pages 1350 - 1371, XP055934764, DOI: 10.1214/15-AOAS848
USAMA M. FAYYADKEKI B. IRANI: "Multi-interval discretization of continuous-valued attributes for classification learning", INTERNATIONAL JOINT CONFERENCES ON ARTIFICIAL INTELLIGENCE, 1993, pages 1022 - 1029
Attorney, Agent or Firm:
IPLODGE BV (BE)
Download PDF:
Claims:
33

Claims

1. A method (100) for generating a decision rules list for a decision rules list model, the method comprising the steps of:

- generating (110), using a first training set of training examples, a pool A of candidate decision rules, wherein each training example of said first training set comprises a training input data sample and a label indicating a data class of a set of data classes that the training input data sample belongs to; and

- generating (180) a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules; wherein the step of generating the pool A of candidate decision rules comprises considering each data class of a subset of said set of data classes in turn as a target data class and repeating for each target data class being considered the steps of:

- generating (120) a group of one or more candidate decision rules;

- calculating (130) for each generated candidate decision rule of this group of one or more candidate decision rules a quality score as a function of said generated candidate decision rule, said target data class and said first training set of training examples;

- using (140) the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules; and

- adding (150) the selected candidate decision rules to the pool A of candidate decision rules; wherein the value of said quality score is calculated as a function of said generated candidate decision rule, said target data class and said first training set of training examples such that:

- the quality score is positively correlated with the ratio 34 a. of, on the one hand, the number of training data samples that belong to said target data class and for which said generated candidate decision rule is TRUE, b. to, on the other hand, the total number of training data samples that belong to said target data class; and

- the quality score is negatively correlated with the prevalence of training data samples for which said generated candidate decision rule is TRUE among the training data samples that don’t belong to said target data class. The method of claim 1 wherein: the value of said quality score may be calculated as a function of a class relevance factor and a class selectivity factor, whereby:

- the class relevance factor is a function of said generated candidate decision rule, said target data class and said first training set of training examples, whereby the value of the class relevance factor is positively correlated with the ratio: o of the number of training data samples that belong to said target data class and for which said generated candidate decision rule is TRUE, o to the total number of training data samples that belong to said target data class; and

- the class selectivity factor is a function of said generated candidate decision rule, said target data class and said first training set of training examples, whereby the value of the class selectivity factor is negatively correlated with the prevalence of training data samples for which said generated candidate decision rule is TRUE among the training data samples that don’t belong to said target data class. The method of claim 2 wherein the class relevance factor (CRF) is calculated as an increasing function of: the number of training data samples that belong to said target data class and for which the generated candidate decision rule is TRUE divided by the total number of training data samples that belong to said target data class. The method of claim 2 or claim 3 wherein the class selectivity factor (CSF) is a decreasing function of the number of data classes that contain at last one training data sample for which the generated candidate decision rule is TRUE. The method of any of claims 2 to 4 wherein the class selectivity factor (CSF) is a decreasing function of: the number of training data samples that don’t belong to said target data class and for which the generated candidate decision rule is TRUE, divided by the total number of training data samples that do not belong to said target data class. The method of any of claims 2 to 5 wherein the class selectivity factor (CSF) is, for each individual data class of a number of data classes that are not the target data class, a decreasing function of: the number of training data samples that belong to that individual data class and for which the generated candidate decision rule is TRUE, divided by the total number of training data samples that belong to that individual data class. The method of any of the previous claims wherein no decision rules are generated with a cardinality that exceeds a certain maximum value. The method of any of the previous claims wherein no decision rules lists are generated that comprise a decision rule with a cardinality that exceeds a certain maximum value. The method of any of the previous claims wherein only feature-factored decision rules are generated. The method of any of the previous claims wherein no decision rules lists are generated that comprise a decision rule that is not a feature-factored decision rule. The method of any of the previous claims wherein the probability that a particular generated candidate decision rule will effectively be selected to be added to the pool A of candidate decision rules is positively correlated to or is an increasing function of the value of the quality score that has been calculated for that particular generated candidate decision rule. The method of any of claims 1 to 11 wherein the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules comprises selecting a particular generated candidate decision rule to be added to the pool A of candidate decision rules if the value of the quality score that has been calculated for that particular generated candidate decision rule exceeds a certain threshold value. The method of any of claims 1 to 11 wherein:

- the step of generating a group of one or more candidate decision rules comprises generating a group of more than one candidate decision rules; and

- the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules comprises ranking the generated candidate decision rules in the group according to their quality score and selecting only the candidate decision rules with the highest quality score to be added to the pool A. The method of any of claims 1 to 11 wherein:

- the step of generating a group of one or more candidate decision rules comprises generating a group of more than one candidate decision rules; and

- the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules comprises: dividing the group in two 37 subgroups such that the candidate decision rules in one of the two subgroups have a higher average quality score than the candidate decision rules in the other of the two subgroups, and selecting the candidate decision rules of the subgroup containing the candidate decision rules with the highest quality scores to be added to the pool A. The method of claim 14 wherein: dividing the group in two subgroups such that the candidate decision rules in one of the two subgroups have a higher average quality score than the candidate decision rules in the other of the two subgroups comprises dividing the group in two subgroups such that all the candidate decision rules in one of the two subgroups have a quality score that is higher than quality score of any of the candidate decision rules in the other of the two subgroups. The method of any of the previous claims wherein the generated decision rules list contains only feature-factored decision rules or contains only basic feature-factored decision rules or contains only decision rules with a cardinality that is not higher than a certain maximum cardinality. The method of any of claims 1 to 16 wherein the step of generating, using a first training set of training examples, a pool A of candidate decision rules comprises adding to the pool A only feature-factored decision rules or basic feature-factored decision rules or decision rules with a cardinality that is not higher than a certain maximum cardinality. The method of any of claims 1 to 16 wherein the step of generating a group of one or more candidate decision rules comprises adding to the pool A only feature-factored decision rules or basic feature-factored decision rules or decision rules with a cardinality that is not higher than a certain maximum cardinality. The method of any of the previous claims wherein the step of generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules 38 further comprises only generating decision rules list candidates having a length that does not exceed a certain maximum length. The method of any of the previous claims wherein the step of selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules comprises solving an optimization problem for a reward function that is a function of the generated decision rules list. The method of claim 20 wherein the reward function is a function of an estimate of the accuracy of a model based on or defined by the generated decision rules list. The method of any of the previous claims further comprising performing a discretization of at least one feature before performing the steps of: generating, using a first training set of training examples, a pool A of candidate decision rules, and generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules. A system (200) for generating a decision rules list for a decision rules list model comprising one or more digital data processing components (210) for processing digital data; and one or more memory components (220) for storing data or instructions (e.g., software) to be performed by the digital data processing components, the system adapted to perform a method of any of claims 1 to 22. A computer program comprising a series of instructions that, when executed by a computer system, cause that computer system to perform a method of any of claims 1 to 22. A computer-readable medium on which a computer program is stored that comprises a series of instructions that, when executed by a computer system, cause that computer system to perform a method of any of claims 1 to 22.

Description:
Title

A method, system and apparatus for training an interpretable Artificial Intelligence model

Field of the invention

[0001] The invention relates to a method, system and apparatus for training an interpretable Artificial Intelligence model. The invention relates more in particular to a method, system and apparatus for generating an ordered set of decision rules for a decision rules list machine learning model.

Background of the invention

[0002] Artificial intelligence (Al) and Machine Learning (ML) show promising results in many application domains. Currently, the adoption of Al is mainly for low-stake applications such as recommendation of products in e-commerce. However, in high-stake applications such as finance, medicine, justice, etc., the adoption of Al is not taking place at the same speed. One of the hurdles is the black-box nature of machine learning and deep learning methodologies. Complex deep learning models can achieve higher accuracy, but they are difficult to interpret and understand the rationale behind their decision making. In high- stake applications, understanding the rationale behind decision making is important to build trust in the system. For example, doctors may want to scrutinize the decision of Al, if Al recommends medical interventions that may have serious consequences, such as for example surgery. Also, there is a legal obligation mentioned in the European Union General Data Protection Regulation (GDPR) known as ‘right to explain’; hence autonomous systems may be required to provide explanations to meet legal compliance.

[0003] Two approaches to deal with this problem are on the one hand Explainable Al and on the other hand Interpretable Al.

[0004] In the Explainable Al approach, the predictions are made using a complex black-box model, and a second simpler model is used to help explain what the black-box model is doing locally (Amina Adadi and Mohammed Berrada. Peeking inside the black-box: A survey on exp I a i n a b I e a rtif ic i a I intelligence (xai). I EEE Access, 6:52138-52160, 2018) . Well known model- agnostic approaches are SHAP (Scott M. Lundberg and Su-ln Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Nl PS’17, page 4768-4777, 2017) and LIME ( Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135-1144, 2016) , which provide explanations for any machine learning model. Another method designed specifically for deep neural networks is called Integrated Gradients (IG) ( Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319-3328, 06-11 Aug 2017) .

[0005] In the Interpretable Al approach, the predictions are made using a model that can be directly inspected and interpreted by human experts ( Christoph Molnar. Interpretable Machine Learning. 2019. https://christophm.github. io/interpretable-ml-book/) . A Study carried Out in ( Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisionsand use interpretable models instead. Nature Machine Intelligence, 1:206- 215, 05 2019) strongly recommends using interpretable models for high- stake applications. It used to be a common myth that interpretable models were less accurate, however, recent advances with rule list algorithms have led to the ability to build more accurate interpretable models ( Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisionsand use interpretable models instead. Nature Machine Intelligence, 1:206-215, 05 2019) .

[0006] One type of Interpretable Al models are based on decision rules. An example of an Interpretable Al model based on decision rules is known as Bayesian Rules Lists (BRL) and is explained in (Benjamin Letham, Cynthia Rudin, Tyler McCormick, and David Madigan. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9:1350-1371, 09 2015).

[0007] The aim of the invention is to provide interpretable Al models with an even higher level of accuracy, and in particular to improve the accuracy of models based on the decision rules approach. Technical solution

[0008] A solution to the aforementioned problem is the invention described in the remainder of this description.

[0009] In what follows the following terms and concepts will be used.

[0010] This description deals with Al models that belong to a type of Machine Learning models known as Classifiers. I.e., the models either assign a given input data sample (that represents an observation) to one or more of a set of data classes (or categories), or they provide for each data class of such a set of data classes an estimate of the probability that a given input data sample belongs to that data class. In what follows, the terms data class, class or category may be used interchangeably. In some models, each input data sample is assumed to belong to exactly one category or class. In such models the estimates of the probabilities may sum to a total of exactly 1. In some models there are exactly two classes and each data input data sample is assumed to belong to exactly one of these two categories or classes. These models may be referred to as binary models. An input data sample representing an observation consists of the properties of an observation that are deemed (potentially) relevant for the problem at hand. The individual properties of an observation are termed the features which can be grouped in a feature vector. The various components of the feature vector (i.e., the feature values) of an input data sample may have different types of values. For example, some features may have numeric (integer or real) values (which may be inherently continuous), other features may have categorical values, yet other values may be Boolean values, and still other values may be ordinal values.

[0011] The models are in the form of an ordered list of if-then-else statements whereby: the condition in the if-part of each if-then-else statement is a Boolean function of one or more features of the input data sample, the then-part and/or the else-part of each if-then-else statement contain an outcome or prediction of the model (i.e., the class or classes that an input data sample belongs to according to the model, or the estimates of the probabilities that a given input data sample belongs to each of the classes), or lead to another if-then-else statement of the ordered list of if-then-else statements making up the model.

[0012] In some particular models, the then-part of each if-then-else statement contains an outcome or prediction of the model (i.e., the class or classes that an input data sample belongs to according to the model, or the estimates of the probabilities that a given input data sample belongs to each of the classes), and the else-part leads to another if- then-else statement of the ordered list of if-then-else statements making up the model. The prediction that is in the then-part of such an if-then-else statement and that is thus associated with the Boolean condition function in the if-then-else statement, may be referred to as the prediction part of the if-then-else statement. Other models may have another appearance but may have a fully equivalent formulation that follows this format. Conversely, if a particular model has been obtained using this representation, the model’s representation may subsequently be converted into a different representation, for example, to enhance the efficiency when making predictions for input data samples. If the model is in this format then the model is fully defined by the list of Boolean condition functions in the model’s ordered list of if-then-else statements and the order in which these Boolean condition functions appear in that ordered list of if-then-else statements, along with the prediction parts associated with each of the Boolean condition functions in the model’s ordered list of if-then-else statements. We can therefore represent the model by an ordered set of Boolean condition functions. In what follows, the Boolean condition functions may also be referred to as decision rules and the ordered set of Boolean condition functions as the decision rules list. In symbols, the decision rules list may be represented as: d = [a m i=1 , wherein a, represents the /-th decision rule (or Boolean condition function) and the integer m indicates the number of decision rules in the ordered set of decision rules d. Depending on the context, the term decision rules list may be used in this description in a broad sense (i.e., it includes for each decision rule not only the Boolean condition function but also the prediction part that is associated with that decision rule), or in a narrow sense (i.e., it includes for each decision rule only the Boolean condition function but not a prediction part that may be associated with that decision rule). In what follows, we may refer to a model that is thus defined by a decision rules list as a decision rules list model (even if the actual form of the model that is used for making predictions for input data samples may use a different representation), and we will focus in the remainder of this description on such decision rules list models.

[0013] In some models, the Boolean condition functions of the if-part of the if- then-else statements making up the model may be a function of only some of the features (whereby one condition function may be a function of a first subset of the set of features and a another condition function may be a function of a second subset of the set of features). The number of features that a Boolean condition function or decision rule effectively depends on, may be referred to as the cardinality of that decision rule or Boolean condition function. In some particular models, the Boolean condition functions may be expressed as a logical AND-ing of a number of Boolean condition subfunctions whereby each Boolean condition subfunctions is a function of only a single feature. In what follows, such Boolean condition functions and the corresponding decision rules may be referred to as feature-factored Boolean condition functions or feature-factored decision rules, and models that are defined by a decision rules list containing only feature-factored Boolean condition functions may be referred to as feature-factored decision rules list models. In some cases such a Boolean condition subfunction of a single feature (which may be referred to as a singlefeatured condition subfunction) evaluates to TRUE if and only if that single feature has a specific value. A Boolean condition subfunction of a single feature that evaluates to TRUE if and only if that single feature has a specific value may be referred to as a basic condition subfunction; feature-factored decision rules of which all condition subfunctions are basic may be referred to as basic feature-factored decision rules or shortly as basic decision rules; and decision rules lists wherein all decision rules are basic decision rules may be referred to as basic decision rules lists. Basic decision rules have the advantage that they are generally very easy to interpret. In some cases a Boolean condition subfunction of a single feature evaluates to TRUE if and only if that single feature has a value that belongs to a specific set or group of specific values. In some cases the feature may be a numeric or ordinal feature and that specific set or group of specific values may be an interval.

[0014] Training decision rules list models

[0015] The machine learning algorithms that will be considered in the following paragraph for training decision rules list models or for generating a decision rules list for a decision rules list model are assumed to be supervised machine learning algorithms. I.e., it is assumed that there is a training set of training examples whereby each training example is a pair of a training input data sample and a label that indicates the (or a) class or category that the training input data sample belongs to, and that the model is trained on the basis of that training set. If the label of a training sample indicates that the training input data sample of that training example belongs to a certain class or category, then that training example and that training input sample a said to belong to that class or category.

[0016] Aspects and embodiments of the invention.

[0017] In a first aspect of the invention, a computer based method is provided for generating an ordered set of decision rules (or decision rules list) for a decision rules list model. In some embodiments, the method may comprise any of the methods described elsewhere in this description. In some embodiments, the method may be used with, performed by or implemented on any of the systems and/or apparatus described elsewhere in this description.

[0018] In a first set of embodiments of the method provided by the first aspect of the invention, the method comprises the steps of:

- generating, using a first training set of training examples, a pool A of candidate decision rules (wherein each training example of said first training set comprises a training input data sample and a label indicating a data class of a set of data classes that the training input data sample belongs to); and

- generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules; wherein the step of generating the pool A of candidate decision rules comprises considering each data class of a subset of the set of data classes in turn as a target data class and repeating for each target data class being considered the steps of:

- generating a group of one or more candidate decision rules;

- calculating for each generated candidate decision rule of this group of one or more candidate decision rules a quality score as a function of said generated candidate decision rule, said target data class and said first training set of training examples;

- using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules; and

- adding the selected candidate decision rules to the pool A of candidate decision rules; wherein the value of said quality score is calculated as a function of said generated candidate decision rule, said target data class and said first training set of training examples such that:

- the quality score is positively correlated with the ratio o of, on the one hand, the number of training data samples that belong to said target data class and for which said generated candidate decision rule is TRUE, o to, on the other hand, the total number of training data samples that belong to said target data class; and/or

- the quality score is negatively correlated with the prevalence of training data samples for which said generated candidate decision rule is TRUE among the training data samples that don’t belong to said target data class. In some embodiments, considering each data class of a subset of the set of data classes in turn as a target data class may be done by considering the data classes of the subset one after the other. In other embodiments, considering each data class of a subset of the set of data classes in turn as a target data class may be done by considering at least some of the data classes of the subset in parallel. The subset of the set of data classes comprises at least one data class of the set of data classes. In some embodiments, the subset of the set of data classes comprises all data classes of the set of data classes. In some embodiments, the subset of the set of data classes comprises all data classes except one of the set of data classes.

[0019] Decision rule quality score.

[0020] In a second set of embodiments of the method, the method may comprise any method of the first set of embodiments wherein: the value of said quality score may be calculated as a function of a class relevance factor and a class selectivity factor, whereby:

- the class relevance factor is a function of said generated candidate decision rule, said target data class and said first training set of training examples, whereby the value of the class relevance factor is positively correlated with the ratio: o of the number of training data samples that belong to said target data class and for which said generated candidate decision rule is TRUE, o to the total number of training data samples that belong to said target data class; and

- the class selectivity factor is a function of said generated candidate decision rule, said target data class and said first training set of training examples, whereby the value of the class selectivity factor is negatively correlated with the prevalence of training data samples for which said generated candidate decision rule is TRUE among the training data samples that don’t belong to said target data class.

[0021] In some embodiments of the second set of embodiments, said quality score may be calculated as the product of said class relevance factor and said class selectivity factor. [0022] In a third set of embodiments, the method may comprise any method of the second set of embodiments wherein the class relevance factor (CRF) may be calculated as an increasing function of: the number of training data samples that belong to said target data class and for which the generated candidate decision rule is TRUE divided by the total number of training data samples that belong to said target data class. In some embodiments this function is a monotonically increasing function, In some embodiments this function is a weakly monotonically increasing function. In some embodiments this function is a strictly monotonically increasing function. For example, in some embodiments, the class relevance factor may be calculated as: the number of training data samples that belong to said target data class and for which the generated candidate decision rule is TRUE, divided by the total number of training data samples that belong to said target data class.

[0023] In a fourth set of embodiments, the method may comprise any method of the second or third sets of embodiments wherein the class selectivity factor (CSF) may be a decreasing function of the number of data classes that contain at last one training data sample for which the generated candidate decision rule is TRUE. In some embodiments this decreasing CSF function may be monotonically decreasing. In some embodiments this decreasing CSF function may be weakly monotonically decreasing. In some embodiments this decreasing CSF function may be strictly monotonically decreasing. For example, in some embodiments the class selectivity factor (CSF) may for example be calculated as the function CSF = (1 + NC tot ) / (1 + NC true ), or as the function CSF = 1 + log ((1 + NC tot ) / (1 + NC true )), wherein NC tot is the total number of classes and NC true is the number of classes that contain at last one training data sample for which the generated candidate decision rule is TRUE. Alternatively, NC true may for example be defined as the number of classes for which the number of training data samples for which the generated candidate decision rule is TRUE exceeds a certain threshold (either in an absolute sense or in a relative sense, i.e., relative to the total number of training data samples in each respective class). [0024] In a fifth set of embodiments, the method may comprise any method of the second to fourth sets of embodiments wherein the class selectivity factor (CSF) may be a decreasing function of: the number of training data samples that don’t belong to said target data class and for which the generated candidate decision rule is TRU E, divided by the total number of training data samples that do not belong to said target data class. In some embodiments this decreasing CSF function may be monotonically decreasing. In some embodiments this decreasing CSF function may be weakly monotonically decreasing. In some embodiments this decreasing CSF function may be strictly monotonically decreasing. For example, in some embodiments the class selectivity factor (CSF) may be calculated as: NS tota , / (1 + NS true ) , or as: log ((1 + NS tota ,) / (1 + NS true )), wherein NS tota , is the total number of training data samples that do not belong to said target data class and NS true is the number of training data samples that don’t belong to said target data class and for which the generated candidate decision rule is TRU E.

[0025] In a sixth set of embodiments, the method may comprise any method of the second to fifth sets of embodiments wherein the class selectivity factor (CSF) may, for each individual data class of a number of data classes that are not the target data class, be a decreasing (e.g., monotonically decreasing, weakly monotonically decreasing or strictly monotonically decreasing) function of: the number of training data samples that belong to that individual data class and for which the generated candidate decision rule is TRU E divided by the total number of training data samples that belong to that individual data class. For example, in some embodiments the class selectivity factor (CSF) may be calculated as the product of a number of individual class selectivity factors, one individual class selectivity factor for each of a number of classes that are not the target data class. In some embodiments, there may be an individual class selectivity factor in the product for each class that is not the target data class. Each individual class selectivity factor may be a decreasing (e.g., monotonically decreasing, weakly monotonically decreasing or strictly monotonically decreasing) function of: the number of training data samples for which the generated candidate decision rule is TRUE and that belong to the particular individual data class that the individual class selectivity factor corresponds to and divided by the total number of training data samples that belong to that individual data class. For example, in some embodiments an individual class selectivity factor (ICSF) for a particular individual data class may be calculated as: IS tota , / (1 + IS true ) , or as: log ((1 + IS tota ,) / (1 + IS true )), wherein IS tota , is the total number of training data samples that belong to that particular individual data class and IS true is the number of training data samples that belong to that particular individual data class and for which the generated candidate decision rule is TRUE.

[0026] Generating candidate decision rules to be considered for inclusion in the pool A.

[0027] In general, the number of all possible decision rules that are theoretically possible is enormous, if not unbounded. It is therefore in general not feasible to generate and consider for inclusion in the pool A all theoretically possible decision rules. In practice, only a tiny fraction of the theoretically possible decision rules can be generated and considered. In some embodiments, the step of generating a group of one or more candidate decision rules for a particular target data class is not done in a (completely) random fashion, but is done in a more or less systematic way. A group of one or more candidate decision rules for a particular target data class is preferably generated in a way that some benefit may be expected. For example, the purpose of decision rules list models is to provide models that are easily interpretable. It is reasonable to assume that a lower average cardinality of the decision rules in a decision rules list can in general be expected to make that decision rules list easier to interpret.

[0028] From that perspective it may be desirable to generate only decision rules with a maximum cardinality, i.e., decision rules having a cardinality that does not exceed a certain maximum value. In a seventh set of embodiments, the method may comprise any method of the previous sets of embodiments wherein no decision rules are generated with a cardinality that exceeds a certain maximum value. In an eighth set of embodiments, the method may comprise any method of the previous sets of embodiments wherein no decision rules lists are generated that comprise a decision rule with a cardinality that exceeds a certain maximum value.

[0029] It is furthermore reasonable to assume that decision rules with a simple structure are more easy to interpret than decision rules with a complicated structure. From that perspective it may be advantageous to generate only decision rules with a simple structure. For example, in some embodiments only feature-factored decision rules are generated. In a ninth set of embodiments, the method may comprise any method of the previous sets of embodiments wherein only feature-factored decision rules are generated. In a tenth set of embodiments, the method may comprise any method of the previous sets of embodiments wherein no decision rules lists are generated that comprise a decision rule that is not a feature-factored decision rule. In the discussion of figure 1, an example is given of an embodiment wherein only feature- factored decision rules are generated that have a cardinality that does not exceed a certain given maximum cardinality.

[0030] Termination of the generation and selection process. In some embodiments, performing for a particular target data class the steps of generating, selecting and adding to the pool candidate decision rules (i.e., the steps of: generating a group of one or more candidate decision rules, calculating for each generated candidate decision rule of this group of one or more candidate decision rules a quality score as a function of said generated candidate decision rule, said target data class and said first training set of training examples; using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules; and adding the selected candidate decision rules to the pool A of candidate decision rules) are repeated until some predefined termination criterion is satisfied. The termination criterion may for example take into account the number of candidate decision rules that have been generated for that particular target data class, or the number of candidate decision rules generated for that particular target data class that have been added to the pool A. In some embodiments the termination criterion may also take into account the total number of candidate decision rules that have already been added to the pool A before. In some embodiments the termination criterion for a particular target data class may take into account the number of training data samples belonging to that particular target data class (such that, for example, the number of candidate decision rules that are generated and/or added to the pool for that particular target data class is positively correlated to the number of training data samples belonging to that particular target data class). In some embodiments, the candidate decision rules may be generated in such a way that the quality score is on average expected to decrease as more candidate decision rules are being generated for a particular target data class (i.e., the generation algorithm doesn’t generate the decision rules at random but may be biased such that it is expected to generate the decision rules with the highest quality scores early on in the generation process) and the average quality score of the most recently generated decision rules for a particular data class may be taken into account for the termination criterion.

[0031] Using quality scores for selecting decision rules to be added to the pool A.

[0032] In some embodiments, the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules, is done is such a way that the probability that a particular generated candidate decision rule will effectively be selected to be added to the pool A of candidate decision rules is positively correlated to or is an increasing function of the value of the quality score that has been calculated for that particular generated candidate decision rule. In an 11th set of embodiments, the method may comprise any method of the previous sets of embodiments wherein the probability that a particular generated candidate decision rule will effectively be selected to be added to the pool A of candidate decision rules is positively correlated to or is an increasing function of the value of the quality score that has been calculated for that particular generated candidate decision rule.

[0033] For example, in some embodiments, a particular generated candidate decision rule is selected to be added to the pool A of candidate decision rules if the value of the quality score that has been calculated for that particular generated candidate decision rule exceeds a certain threshold value (i.e., the probability that a particular generated candidate decision rule is selected is a step function of the quality score whereby the probability is 0 if the quality score is lower than the threshold value and the probability is 1 if the quality score is higher than the threshold value). This threshold may be a parameter of an implementation or embodiment of the method and may have a predefined value. In an 12th set of embodiments, the method may comprise any method of the first to 11th sets of embodiments wherein the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules comprises selecting a particular generated candidate decision rule to be added to the pool A of candidate decision rules if the value of the quality score that has been calculated for that particular generated candidate decision rule exceeds a certain threshold value.

[0034] In other embodiments, the candidate decision rules may be generated in groups of more than one, and the generated candidate decision rules in each group may be ranked according to their quality score and only the candidate decision rules with the highest quality score are selected and added to the pool A. For example, the above mentioned groups of generated candidate decision rules may have a fixed size s and the k candidate decision rules of each group with the highest quality scores may be selected and added to the pool A (with k < s and whereby Aand s may be parameters of the learning algorithm).

[0035] In a 13th set of embodiments, the method may comprise any method of the first to 11th sets of embodiments wherein:

- the step of generating a group of one or more candidate decision rules comprises generating a group of more than one candidate decision rules; and

- the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules comprises ranking the generated candidate decision rules in the group according to their quality score and selecting only the candidate decision rules with the highest quality score to be added to the pool A.

[0036] In a 14th set of embodiments, the method may comprise any method of the first to 11th sets of embodiments wherein:

- the step of generating a group of one or more candidate decision rules comprises generating a group of more than one candidate decision rules; and

- the step of using the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules comprises: dividing the group in two subgroups such that the candidate decision rules in one of the two subgroups have a higher average quality score than the candidate decision rules in the other of the two subgroups, and selecting the candidate decision rules of the subgroup containing the candidate decision rules with the highest quality scores to be added to the pool A.

[0037] In a 15th set of embodiments, the method may comprise any method of the 14th set of embodiments wherein: dividing the group in two subgroups such that the candidate decision rules in one of the two subgroups have a higher average quality score than the candidate decision rules in the other of the two subgroups comprises dividing the group in two subgroups such that all the candidate decision rules in one of the two subgroups have a quality score that is higher than quality score of any of the candidate decision rules in the other of the two subgroups.

[0038] Using quality scores when generating candidate decision rules.

[0039] In some embodiments, the quality scores and/or the class relevance factors and/or the class selectivity factors that have been calculated for certain generated candidate decision rules may also be used for the generation of other candidate decision rules. For example, in some embodiments some candidate decision rules may be generated by combining two other earlier generated candidate decision rules of which the quality scores and/or the class relevance factors and/or the class selectivity factors are known, whereby these known quality scores and/or class relevance factors and/or class selectivity factors may be taken into account to decide which earlier generated candidate decision rules could be combined. For example, if for a given target data class two different candidate decision rules have a very high class relevance factor but only a moderate or even low class selectivity factor, then it is not unreasonable to expect that perhaps a candidate decision rule that is obtained by the logical AND-ing of these two candidate decision rules may have a class relevance factor that is still high or very high but a class selectivity factor that is substantially higher than the class selectivity factors of these two candidate decision rules.

[0040] Generating a decision rules list from the pool A of candidate decision rules.

[0041] After the pool A of candidate decision rules has been generated, a decision rules list may be generated by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules. The process of generating the decision rules list is subject to various, sometimes conflicting, requirements, constraints and aims. One requirement may be that the accuracy of the predictions generated by the decision rules list model defined by the generated decision rules list is as high as possible. Another constraint may be that the computation time for generating the decision rules list remains acceptable, especially if the model is supposed to be regularly updated when new training data (i.e., additional or more up-to-date training examples) become available. In some embodiments, the constraint that the computation time for generating the decision rules list remains acceptable may be satisfied by stopping the search for a decision rules list after a certain amount of computing time has been spent or after a certain number of search steps have been taken. Yet another requirement may be that the interpretability of the decision rules list is as high as possible. In some embodiments, the requirement for an acceptable level of interpretability will be automatically satisfied by construction, e.g., because the method only generates candidate decision rules that are deemed to be sufficiently easy to interpret (e.g., only feature-factored decision rules or even only basic feature-factored decision rules, and/or only decision rules with a cardinality that is not too high) and/or constructs decision rules lists that are not longer than a certain maximum length.

[0042] In a 16th set of embodiments, the method may comprise any method of the first to 15th set of embodiments wherein the generated decision rules list contains only feature-factored decision rules and/or contains only basic feature-factored decision rules and/or contains only decision rules with a cardinality that is not higher than a certain maximum cardinality. In a 17th set of embodiments, the method may comprise any method of the first to 16th set of embodiments wherein the step of generating, using a first training set of training examples, a pool A of candidate decision rules comprises adding to the pool A only feature- factored decision rules and/or basic feature-factored decision rules and/or decision rules with a cardinality that is not higher than a certain maximum cardinality. In a 18th set of embodiments, the method may comprise any method of the first to 16th set of embodiments wherein the step of generating a group of one or more candidate decision rules comprises adding to the pool A only feature-factored decision rules and/or basic feature-factored decision rules and/or decision rules with a cardinality that is not higher than a certain maximum cardinality. In a 19th set of embodiments, the method may comprise any method of the first to 18th set of embodiments wherein the step of generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules further comprises only generating decision rules list candidates having a length that does not exceed a certain maximum length.

[0043] Generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules

[0044] In some embodiments, the step of generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules may be performed by treating it as an optimization problem whereby the reward function may for example be the (estimated) accuracy of the model defined by or based on the decision rules list. In a 20th set of embodiments, the method may comprise any method of the first to 19th set of embodiments wherein the step of selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules comprises solving an optimization problem for a reward function that is a function of the generated decision rules list.

In a 21th set of embodiments, the method may comprise any method of the 20th set of embodiments wherein the reward function is a function of an estimate of the accuracy of a model based on or defined by the generated decision rules list.

[0045] Various methods for solving such an optimization problem are known in the art. In some embodiments, a method for generating a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules may be based on Bayesian inference.

[0046] In some embodiments, an initial decision rules list candidate may be randomly constructed by randomly selecting a set of decision rules from the pool A of candidate decision rules and ordering the selected set in some random order. The process of selecting decision rules from the pool A may be completely random or may be subject to a bias. This bias may for example be a function of the cardinality of the candidate decision rules or it may be a function of the quality scores of the candidate decision rules. The length of the initial decision rules list candidate may be a fixed value or may be determined randomly but subject to a bias. Then a chain of decision rules list candidates may be constructed, starting with this initial decision rules list candidate. To extend the chain, a new decision rules list candidate may be generated from the last decision rules list candidate in the chain by modifying the last decision rules list candidate in the chain. There may be three types of modification: re-ordering the decision rules list candidate, selecting another candidate decision rule from the pool A and add the selected candidate decision rule to the decision rules list candidate, or deleting a decision rule from the decision rules list candidate. The choice of modification may preferably be random, (completely random or subject to some bias). If the new decision rules list candidate satisfies some acceptance criterion, then the chain of decision rules list candidates may be extended with this new decision rules list candidate, else the process of generating a new decision rules list candidate and verifying whether it satisfies the acceptance criterion may be repeated. The acceptance criterion may for example be based on the value of a reward function of the new decision rules list candidate. For example, the chain of decision rules list candidates may be extended with the new decision rules list candidate if it has a better reward function value than the last decision rules list candidate in the chain or if the reward function value of the new decision rules list candidate is not worse than a certain margin below the reward function value of the last decision rules list candidate in the chain. This process for extending the chain of decision rules list candidates may be repeated until some stop criterion is met. For example the process may be repeated until the chain has a certain length or until a certain number of decision rules list candidates have been generated and considered. When the chain is completed, one of the decision rules list candidates, for example the last decision rules list candidate of the chain or the decision rules list candidate in the chain with the highest reward function value, may be selected as the solution. In some embodiments, multiple chains may be generated and the best decision rules list candidate with the highest reward function value in all the chains may be selected as the solution.

[0047] For some embodiments, estimating the accuracy of a decision rules list candidate may be done in the following two-stage process.

[0048] First, each decision rule in the decision rules list candidate may be completed with a corresponding prediction part associated with that decision rule. The decision rules list candidate may be completed by determining for each decision rule in the decision rules list candidate the corresponding prediction part, wherein determining for each decision rule the corresponding prediction part means determining for each class a predicted probability that if an input data sample triggers that decision rule then the input data sample belongs to that class. In this context an input data sample is said to trigger a decision rule of the decision rules list if the Boolean condition function of that decision rule evaluates to TRUE for that input data sample but the Boolean condition functions of the earlier decision rules in that decision rules list (i.e., the decision rules corresponding to if-then-else statements that are higher up in the tree of if-then-else statements corresponding to that decision rules list) evaluate to FALSE. One way to determine the predicted probability for a given class in the prediction part of a particular decision rule, is to set that predicted probability of prediction rule for that class to the ratio:

- of: the number of training examples in a second training set of training examples that trigger that decision rule and that belong to that class,

- to: the total number of training examples in that second training set of training examples that trigger that decision rule.

In some embodiments, the second training set of training examples may be different from the first training set of training examples. In other embodiments, the second training set of training examples may be that same as the first training set of training examples.

[0049] Then, the accuracy of the predictions of the completed decision rules list candidate may be tested. For example, in some embodiments the accuracy of the completed decision rules list candidate may be tested by applying the model defined by the completed decision rules list candidate to the training input data samples of all training samples of a third training set of training examples and comparing for each training example the predicted probabilities for the training input data sample of the training example to the label of that training example. For example, one measure of the accuracy may be calculated by counting the number of training examples for which the class of the label matches the class with the highest predicted probability and dividing that counted number by the total number of training examples.

In some embodiments, the third training set of training examples may be different from the first and/or second training set of training examples. In other embodiments, the third training set of training examples may be that same as the first and/or second training set of training examples.

[0050] Discretization of features

[0051] In some cases, the total number of all possible values of certain features may be very high. This may for example be the case for numeric features where the total number of all possible values may be limited only by the resolution of the digital representation of the underlying numeric values. For example if the digital representation of a numeric feature is an unsigned 32-bit integer, then in principle there are 2 A 32 possible values for that feature. In such cases where there are features of which the total number of all possible values may be very high, basic feature-factored decision rules (and a fortiori basic decision rules list models that only contain such basic feature-factored decision rules) may perform poorly for example because their class relevance factor may be very low.

[0052] One solution to this problem may be to perform a discretization step wherein certain features are discretized. In the context of this description, discretizing a feature refers to the process of partitioning the set of all possible values for that feature whereby the number of subsets in the partition is (much) smaller than the total number of all possible values for that feature, and then replacing each original value of that feature by a symbolic value that indicates the subset of the partition that the original value belonged to. For a numerical value, the partitioning may correspond to a set of contiguous intervals. For example, a feature that represents an outside temperature represented by an unsigned integer may be discretized whereby the original feature values are replaced by a symbolic value that indicates whether the original temperature value was either below freezing point or above (or at) freezing point. The exact way of partitioning the set of all possible original values for a feature (i.e., how many subsets and which boundaries for the subsets) may in some cases have a significant impact on the usefulness of the resulting discretized feature. One way of discretizing features is explained in (Usama M. Fayyad and Keki B. Irani.

Multi-interval discretization of continuous-valued attributes for classification learning. In International Joint Conferences on Artificial Intelligence, pages 1022-1029, 1993). [0053] In some embodiments, the method may comprise any method of the first to 21th set of embodiments further comprising the step of performing first (i.e., before the other steps are being performed) a discretization of at least one feature of which the total number of all possible original values exceeds a certain maximum.

[0054] In a second aspect of the invention, a system is provided for generating an ordered set of decision rules for a decision rules list model. In some embodiments, the system may be adapted to perform any of the methods described elsewhere in this description in connection to the first aspect of the invention.

[0055] In a third aspect of the invention, a computer program is provided that comprises a series of instructions that, when executed by a computer system, cause that computer system to perform a method for generating an ordered set of decision rules for a decision rules list model. In some embodiments, the method thus performed may comprise any of the methods described elsewhere in this description in connection to the first aspect of the invention.

[0056] In a fourth aspect of the invention, a computer-readable medium is provided on which a computer program is stored that comprises a series of instructions that, when executed by a computer system, cause that computer system to perform a method for generating an ordered set of decision rules for a decision rules list model. In some embodiments, the method thus performed may comprise any of the methods described elsewhere in this description in connection to the first aspect of the invention. In some embodiments, the computer program may comprise any of the computer programs described elsewhere in this description in connection to the third aspect of the invention.

[0057] Applications

[0058] The methods and systems described in this description may be used for training interpretable Machine Learning data models in a variety of applications. For example, such models may be used when authenticating users that are seeking remote access to computer systems and/or computer-based applications. Other models may be used to detect fraudulent electronic transactions (such as for example credit card transactions or money transfer orders submitted through internet or mobile banking applications), or may be used to take decisions on loan applications, credit worthiness, insurance applications, ... Still other models may be used in the healthcare sector, for example to propose diagnoses.

[0059] From probability to actual classification

[0060] In a number of applications, a mere estimate or prediction of the probabilities that a given input data sample belongs to the various classes is not sufficient by itself, but an actual classification into one particular class is required. In such cases an additional step may be required that converts the estimate or prediction of the probabilities that a given input data sample belongs to the various classes into an actual classification into one particular class. For example, the given input data sample may be classified into the class with the highest predicted probability.

[0061] Inventor’s insights

[0062] A combination of a number of insights by the inventor forms a basis for the invention. One insight is that in general the average quality of the decision rules in the pool A of candidate decision rules has an influence on the overall accuracy of the resulting decision rules list, even if a lot of computing time is invested in trying to find an optimal decision rules list from the candidate decision rules in the pool A. On the one hand the pool A of candidate decision rules should contain as much different decision rules as possible to increase the probability that it includes good decision rules that can be included in an ordered set d of decision rules to be used in an accurate model. On the other hand, it should be avoided that as few as possible bad decision rules are included in the pool A of candidate decision rules since the presence of too much low quality decision rules makes it harder for the selection algorithm to select a good ordered set of decision rules.

[0063] Another insight is that, typically, the step of selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules takes a lot more computing time than the step of generating that pool A of candidate decision rules using a training set of training examples; and that therefore making the step of generating the pool A of candidate decision rules more complicated may multiply the computing time required to perform this step but may (in a relative sense) only marginally affect the overall computing time for the training of the decision rules list model.

[0064] Still another insight is that a decision rule is likely to be of high quality (in the sense that it has a strong predictive power) if it applies on the one hand to a lot of samples of a particular class and it applies on the other hand to only a few samples of other classes and that it may therefore be advantageous to compute for the generated candidate decision rules a quality score that takes these two aspects into account.

[0065] Advantages

[0066] Experiments have shown that decision rules list models that have been trained using the method of the invention tend to provide more accurate predictions than decision rules list models that have been trained with existing training methods on the same training data sets using comparable computing resources. This confirms the expectation that a method to generate a pool A of decision rules to select the decision rules for a decision rules list that leads to a pool A with decision rules that have a higher average quality than the decision rules in a pool A generated by a prior art method, effectively results in more accurate models without losing interpretability and without requiring significantly more computing resources for the training phase as a whole.

[0067] More details of the various embodiments of the different aspects of the invention described above are provided in the paragraphs below.

Brief Description of the Drawings

[0068] The foregoing and other features and advantages of the invention will be apparent from the following, more particular description of embodiments of the invention, as illustrated in the accompanying drawings.

[0069] Figure 1 schematically illustrates an exemplary method according to an aspect of the invention.

[0070] Figure 2 schematically illustrates an exemplary system according to an aspect of the invention.

Detailed description

[0071] Some implementations of the present invention are discussed below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the invention. Various specific details are provided in order to enable a thorough understanding of the invention. However, it will be understood by a person skilled in the relevant art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention. Various modifications to the described embodiments will be apparent to persons skilled in the art, and the general principles of the embodiments described in detail below may be applied to other embodiments.

[0072] Figure 1 schematically illustrates an exemplary method according to an aspect of the invention.

[0073] In one embodiment, a method (100) according to the invention may comprise the following steps:

- generating (110), using a first training set of training examples, a pool A of candidate decision rules; and

- generating (180) a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules; wherein the step of generating (110) the pool A of candidate decision rules comprises considering each data class of a (sub)set of data classes in turn as a target data class and repeating for each target data class being considered the steps of:

- generating (120) a group of one or more candidate decision rules;

- calculating (130) for each generated candidate decision rule of this group of one or more candidate decision rules a quality score as a function of said generated candidate decision rule, said target data class and said first training set of training examples;

- using (140) the quality scores that have been generated for each candidate decision rule of the group of one or more candidate decision rules to select which of the candidate decision rules of the group of one or more candidate decision rules to add to the pool A of candidate decision rules; and

- adding (150) the selected candidate decision rules to the pool A of candidate decision rules; wherein the value of said quality score is calculated as a function of said generated candidate decision rule, said target data class and said first training set of training examples such that:

- the quality score is positively correlated with the ratio o of, on the one hand, the number of training data samples that belong to said target data class and for which said generated candidate decision rule is TRUE, o to, on the other hand, the total number of training data samples that belong to said target data class; and/or

- the quality score is negatively correlated with the prevalence of training data samples for which said generated candidate decision rule is TRUE among the training data samples that don’t belong to said target data class.

[0074] The method (100) may further comprise the steps of any of the methods described elsewhere in this description. In particular, the method (100) may further comprise the steps of any of the methods for generating an ordered set of decision rules for a decision rules list model discussed in connection to the first aspect of the invention.

[0075] A particular exemplary embodiment of method (100) is described in the following paragraphs.

[0076] The decision rules generated by the method of this embodiment are by construction all of the basic feature-factored type. The features may have been discretized (as explained elsewhere in this description). The step of generating (110), using a first training set of training examples, a pool A of candidate decision rules is based on the concept of n-grams and may be performed as follows.

[0077] The first training set of training examples is partitioned in subsets whereby each particular subset comprises all the training examples belonging to a particular one of the classes. Then, the following steps are repeated for a number P of permutations of the order of the features (P is a parameter of the method embodiment). For a given permutation, i.e., for a given order of the features, each of the subsets of training examples is converted into a serialized sequence of feature values. Converting a subset of training examples into a serialized sequence of feature values may be done by first initializing the sequence to be empty and then appending for each training example in the subset all the feature values of the training input data sample of that training example to the sequence in the given feature order. Then for each of the serialized sequences of feature values all different n- grams of feature values in both serialized sequences of feature values are determined (wherein an n-gram of feature values is a consecutive sequence of at most n feature values in the serialized sequence; n is a parameter of the method embodiment). Each n-gram of feature values defines a decision rule namely the basic feature-factored decision rule with a cardinality of maximally n and consisting of the logical AND-ing of all the basic condition subfunctions corresponding to the feature values appearing in that n-gram feature value, i.e., the basic feature- factored decision rule defined by that n-gram of feature values is the basic feature-factored decision rule that is TRUE for each input data sample having the same combination of feature values as that n-gram of feature values, but is FALSE for any other input data sample. For example, the 3-gram of feature values x-y-z wherein x, y and z are the values of respectively the features A, B and C defines the basic feature-factored decision rule ((feature_A == x) AND (feature_B == y) AND (feature_C == z)). Then a quality score is calculated for each of the decision rules defined by all the different n-grams of feature values in both serialized sequences of feature values. For example the quality score may be defined as the product of a class relevance factor and a class selectivity factor. The class relevance factor may for example be defined as the number of occurrences of the specific n-gram feature corresponding to that decision rule in the serialized sequence of feature values divided by the number of training examples in the subset corresponding to that serialized sequence of feature values. The class selectivity factor is meant to emphasize the decision rules corresponding to n-grams of feature values that occur only in a few serialized sequences of feature values and de-emphasize the decision rules corresponding to n-grams of feature values that occur only in a few serialized sequences of feature values. For example, the class selectivity factor CSF may be given by the formula CSF(decision_rule) = 1 + log( ( 1+ N) / (1 + df(decision_rule)) ), whereby N is the number of classes and df(decision_rule) is the number of serialized sequence of feature values that have at least one occurrence of the feature values n-gram corresponding to that decision_rule. Then for each serialized sequence of feature values (i.e., for each class) the decision rules determined for that serialized sequence of feature values (i.e., for that class) are ranked according to their quality scores and for each serialized sequence of feature values (i.e., for each class) the top k of decision rules with highest quality scores are selected and added to the pool A (wherein k is a positive integer that is a parameter of the method embodiment). The order of features is then permuted and the above steps are repeated for the new order of features.

[0078] Finally, when this process of generating, selecting and adding decision rules to the pool A has been done for P permutations of the feature order, the pool A of candidate decision rules may be randomly shuffled and the step of generating (180) a decision rules list by selecting from said pool A of candidate decision rules a set of decision rules and ordering this set of selected decision rules may be performed as described elsewhere in this description.

[0079] In some embodiments, the method (100) may be performed by any of the systems described elsewhere in this description. The method (100) may be performed by any system discussed in connection to the second aspect of the invention. In particular, the method may be performed by any of the systems described in the discussion of Figure 2.

[0080] Figure 2 schematically illustrates an exemplary system according to an aspect of the invention. [0081] In one embodiment a system (200) according to the invention may comprise the following components:

- one or more digital data processing components (210) for processing digital data;

- one or more memory components (220) for storing data or instructions (e.g., software) to be performed by the digital data processing components.

[0082] The system (200) may comprise any of the systems described elsewhere in this description. In particular it may comprise any of the systems for generating an ordered set of decision rules for a decision rules list model discussed in connection to the second aspect of the invention. The system may be adapted to perform any of the methods described elsewhere in this description. In particular the system may be adapted to perform any of the methods for generating an ordered set of decision rules for a decision rules list model of the first aspect of the invention, or it may be adapted to perform any of the methods described in connection to Figure 1. The system may comprise one or more computers. Each of the one or more computers may comprise one or more of the one or more digital data processing components (210) and/or one or more of the one or more memory components (220).

[0083] The one or more data processing components (210) for processing digital data may for example comprise one or more microprocessors and/or one or more CPUs (Central Processing Unit and/or one or more ASICs (Application Specific Integrated Circuit) and/or one or more FPGAs (Field Programmable Gate Array). The one or more data processing components (210) may be adapted to perform any of the methods described elsewhere in this description. In particular the system may be adapted to perform any of the methods for generating an ordered set of decision rules for a decision rules list model of the first aspect of the invention, or it may be adapted to perform any of the methods described in connection to Figure 1. The one or more data processing components (210) may be adapted to perform any of the computer programs discussed in connection to the third aspect of the invention. The one or more data processing components (210) may be connected to the one or more memory components (220) and may be adapted to perform one or more computer programs stored on the one or more memory components (220).

[0084] The one or more memory components (220) for storing data or instructions (e.g., software) to be performed by the digital data processing components (210) may for example comprise one or more a RAM (Random Access Memory) memory modules and/or one or more hard disks. The one or more memory components (220) may be adapted to store a set of instructions (e.g., software) to be performed by the digital data processing components. For example the one or more memory components (220) may be adapted to store a set of instructions that, when performed by the digital data processing components (210), cause the digital data processing components (210) to perform any of the methods described elsewhere in this description, such as any of the methods for generating an ordered set of decision rules for a decision rules list model of the first aspect of the invention, or any of the methods described in connection to Figure 1. The one or more memory components (220) may comprise any of the computer- readable media described in connection to the fourth aspect of the invention. The one or more memory components (220) may also be adapted to store data that are used or generated by the one or more data processing components (210) when performing any of the methods described elsewhere in this description, such as any of the methods for generating an ordered set of decision rules for a decision rules list model of the first aspect of the invention, or any of the methods described in connection to Figure 1. These data may for example include one or more training sets of training examples, candidate decision rules, the pool A of candidate decision rules, decision rules list candidates, ...

[0085] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of one or more implementations may be combined, deleted, modified, or supplemented to form further implementations. Accordingly, other implementations are within the scope of the appended claims. In addition, while a particular feature of the present invention may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. In particular, it is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations of the present invention are possible. Thus, the breadth and scope of the present invention should not be limited by any of the above described exemplary embodiments; rather the scope of at least one embodiment of the invention is defined only in accordance with the following claims and their equivalents.