Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR IN-DEPTH DEFENSE AGAINST ADAPTIVE GRAY-BOX ADVERSARIAL SAMPLES
Document Type and Number:
WIPO Patent Application WO/2023/072375
Kind Code:
A1
Abstract:
The present invention provides a method of providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples. According to embodiments, the method comprises inspecting, by a detector component, input samples submitted to the ML classifier to identify input samples that appear as noisy versions of previously submitted queries and/or as highly distorted input samples; and forwarding, by the detector component, only those input samples to the ML classifier that were not identified as noisy versions of previously submitted queries or as highly distorted input samples.

Inventors:
ANDREINA SÉBASTIEN (DE)
KARAME GHASSAN (DE)
LI WENTING (DE)
MARSON GIORGIA AZZURRA (DE)
Application Number:
PCT/EP2021/079689
Publication Date:
May 04, 2023
Filing Date:
October 26, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NEC LABORATORIES EUROPE GMBH (DE)
International Classes:
G06N20/00; G06F21/57
Foreign References:
CN110768971A2020-02-07
Other References:
LIU CHANGRUI ET AL: "Defend Against Adversarial Samples by Using Perceptual Hash", COMPUTERS, MATERIALS & CONTINUA, vol. 62, no. 3, 1 January 2020 (2020-01-01), pages 1365 - 1386, XP055939191, ISSN: 1546-2226, Retrieved from the Internet DOI: 10.32604/cmc.2020.07421
LI GAOLEI ET AL: "DeSVig: Decentralized Swift Vigilance Against Adversarial Attacks in Industrial Artificial Intelligence Systems", IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 16, no. 5, 6 November 2019 (2019-11-06), pages 3267 - 3277, XP011773612, ISSN: 1551-3203, [retrieved on 20200213], DOI: 10.1109/TII.2019.2951766
CHANGRUI LIU ET AL.: "Defend Against Adversarial Samples by Using Perceptual Hash", COMPUTERS, MATERIALS & CONTINUA, vol. 62, no. 3, 2020, pages 1365 - 1386
VISHAL MONGABRIAN L. EVANS: "Perceptual Image Hashing Via Feature Points: Performance Evaluation and Tradeoffs", IEEE TRANS. IMAGE PROCESS., vol. 15, no. 11, 2006, pages 3452 - 3465, XP055357459, DOI: 10.1109/TIP.2006.881948
Attorney, Agent or Firm:
ULLRICH & NAUMANN (DE)
Download PDF:
Claims:
C l a i m s

1. A method of providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples, the method comprising: inspecting, by a detector component, input samples submitted to the ML classifier to identify input samples that appear as noisy versions of previously submitted queries and/or as highly distorted input samples; and forwarding, by the detector component, only those input samples to the ML classifier that were not identified as noisy versions of previously submitted queries or as highly distorted input samples.

2. The method according to claim 1 , further comprising: returning, for an input sample that was identified as noisy version of a previously submitted query, the prediction made for the previously submitted version of the query.

3. The method according to claim 1 or 2, further comprising: rejecting, by the detector component, an input sample that was identified as highly distorted input sample.

4. The method according to any of claims 1 to 3, wherein identifying an input sample x as noisy version of a previously submitted query comprises: deriving a fingerprint of the input sample x by computing its perceptual hash h; and comparing the fingerprint h with the respective fingerprints of all previously submitted inputs samples.

5. The method according to claim 4, further comprising: if no match is found for an input sample x with fingerprint h, processing input sample x with the ML model (210) of the ML classifier, returning the corresponding prediction y of the ML model (210) and storing the pair (h,y) in memory.

6. The method according to claim 4 or 5, further comprising: if a match (h*,y*) is found with h* = h, returning y* as prediction for x. 7. The method according to any of claims 1 to 6, wherein identifying an input sample x as a highly distorted input sample comprises: deriving, by the detector component, a value expressing the noise embedded in an input sample x and comparing it with a pre-defined threshold.

8. The method according to claim 7, wherein entropy is used as a metric for measuring the noise embedded in an input sample x.

9. The method according to claim 7 or 8, further comprising: if the noise is above the threshold, rejecting the input sample x as an adversarial input sample; and if the noise is below the threshold, classifying the input sample x by processing it with the ML model (210) and returning the corresponding prediction y.

10. The method according to any of claims 1 to 9, further comprising: limiting the number of admissible queries by refreshing the classifier after a threshold number of input samples submitted from all users.

11. A system comprising one or more processors which, alone or in combination, are configured to provide for execution of a method of providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples, the method comprising: inspecting input samples submitted to the ML classifier to identify input samples that appear as noisy versions of previously submitted queries and/or as highly distorted input samples; and forwarding only those input samples to the ML classifier that were not identified as noisy versions of previously submitted queries or as highly distorted input samples.

12. The system according to claim 11 , wherein the ML classifier comprises a first stateful detector component that is configured to keep memory of previously submitted input samples and to detect similarities with new inputs. - 16 -

13. The system according to claim 11 or 12, wherein the ML classifier comprises a second detector component that is configured to detect highly distorted input samples by utilizing a predefined noise metric, determining a noise level embedded in an input sample according to the predefined noise metric and comparing the determined noise level with a pre-defined threshold.

14. The system according to claims 12 and 13, wherein the first and the second detector component are arranged in a pipeline fashion upstream the ML model (210) of the ML classifier.

15. A non-transitory computer readable medium having stored thereon computer executable instructions for performing a method for providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples, the method comprising: inspecting input samples submitted to the ML classifier to identify input samples that appear as noisy versions of previously submitted queries and/or as highly distorted input samples; and forwarding only those input samples to the ML classifier that were not identified as noisy versions of previously submitted queries or as highly distorted input samples.

Description:
METHOD AND SYSTEM FOR IN-DEPTH DEFENSE AGAINST ADAPTIVE GRAY-BOX ADVERSARIAL SAMPLES

The present invention relates to a method of providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples.

Current machine-learning (ML) algorithms are vulnerable to adversarial attacks, specifically evasion attacks, that can jeopardize the usefulness of Al-based systems. These attacks aim at bypassing a trained ML model by creating adversarial samples, i.e. , specially crafted inputs that are misclassified by the ML model despite being indistinguishable, to a human observer, from naturally occurring inputs.

Evasion attacks represent a major threat in several critical applications of Al that rely on correct classification in hostile settings, where an attacker may have incentives to fool the classification system. Adversarial samples can indeed harm real-world Al applications such as image recognition (e.g., self-driving cars, surveillance cameras), speech recognition (e.g., voice assistants such as Sih, Alexa, and Google home), recommendation systems (e.g., targeted suggestions by Netflix, Amazon), finance (e.g., algorithmic trading, fraud prevention) and many more.

Several defensive mechanisms have been proposed for thwarting adversarial samples, but most existing solutions turned out to be insecure against adaptive attacks (i.e., attacks that are fully aware of defensive mechanisms in place and are explicitly designed to bypass these mechanisms). Only few defenses, such as adversarial training and certified defenses, show some robustness to adversarial samples. However, these solutions come with high costs even for simple dataset and do not scale to realistic deployments. Currently, no practical defense is known that can resist adaptive white-box attacks: as long as the attacker knows the internals of the ML model and the defensive mechanisms in place, it can fine-tune its strategy to circumvent both the model and the defense.

The adaptive white-box attack model is extremely pessimistic and, arguably, unrealistic for most real-world deployments. Notably, evasion attacks have been demonstrated also in the more realistic adaptive gray-box model, where the attacker is given only oracle access to the classification system. In this setting, the attacker can submit arbitrarily chosen inputs to the classifier and obtain the corresponding prediction, without knowing anything else (or very little) about the system.

Basically, prior art considers adaptive gray-box attacks only against “plain” ML models for classification, without any defense in place (a.k.a. vanilla classifiers). Thus, while it is well-established that adaptive gray-box attacks can evade vanilla classifiers, it is unclear whether these attacks are still successful if the classifier employs a suitable defense.

Changrui Liu et al.: “Defend Against Adversarial Samples by Using Perceptual Hash”, in Computers, Materials & Continua, CMC, vol.62, no.3, pp.1365-1386, 2020 describe a defense mechanism to detect adversarial samples based on perceptual hashing that is designed in a very weak threat model assuming a specific (and naive) attack strategy. Namely, the attacker generates adversarial samples by progressively perturbing a genuine image, and by testing all the obtained intermediate perturbed images on the target classifier. The proposed defense is based on the detection of inputs that are recognized as perturbed versions of previously submitted inputs. Perceptual hashing is used with the goal of destroying the process of perturbation generation by comparing similarities of images. However, the proposed defense can be easily bypassed by an attacker that tests its intermediate adversarial samples locally, on a different model.

It is therefore an object of the present invention to improve and further develop a method of the initially described type for providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples in such a way that a strong, in-depth defense is realized for realistic attack scenarios.

In accordance with the invention, the aforementioned object is accomplished by a method of providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples, the method comprising inspecting, by a detector component, input samples submitted to the ML classifier to identify input samples that appear as noisy versions of previously submitted queries and/or as highly distorted input samples; and forwarding, by the detector component, only those input samples to the ML classifier that were not identified as noisy versions of previously submitted queries or as highly distorted input samples.

Furthermore, the above mentioned object is accomplished by a system comprising one or more processors which, alone or in combination, are configured to provide for execution of a corresponding method of providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples, as well as by a non-transitory computer readable medium having stored thereon computer executable instructions for performing a corresponding method for providing security for a machine learning, ML, classifier against adaptive gray-box adversarial samples.

Generally, the scope of embodiments of the present invention is to enhance the security of ML classification systems for supervised learning in the presence of evasion attacks. Embodiments of the invention propose a layered defense against adaptive evasion attacks in a gray-box model and in a realistic adaptive gray-box model. The proposed solution comes along with the advantage of being less expensive than prior art defenses designed for the adaptive white-box setting.

According to an embodiment of the invention, it may be provided that the ML classifier is configured to return, for an input sample that the detector component identified as noisy version of a previously submitted query, the prediction made for the previously submitted version of the query. Returning a previously made prediction provides a security enhancement for the ML classifier since it prevents attackers from learning valuable information about the ML model in place.

According to an embodiment of the invention, it may be provided that input samples that the detector component identified as highly distorted input samples are rejected by the detector component. In other words, respective input samples are not forwarded to the ML model, i.e. the classifier refuses to make a prediction for such input samples and rejects them as adversarial. According to an embodiment of the invention, the identification of an input sample x as noisy version of a previously submitted query may be realized based on the concept of perceptual hashing. More specifically, a specific (predefined) fingerprint of the input sample x may be derived by computing its perceptual hash h. Then, the fingerprint h may be compared with the respective fingerprints of all previously submitted inputs samples, which may be stored in a memory of the ML classifier. If no match is found for an input sample x with fingerprint h, it may be provided that the input sample x is processed with the ML model of the ML classifier, returning the corresponding prediction y of the ML model and storing the pair (h,y) in the memory. On the other hand, if a match (h*,y*) is found with h* = h, it may be provided that the ML classifier returns y* as prediction for x.

With respect to the above mentioned prior art document by Changrui Liu et al. it should be noted that the approach described therein likewise uses perceptual hashing to detect similarities between a given input and previously submitted inputs. However, prior art uses perceptual hashing to make adversarial samples ineffective, while embodiments of the present invention use it at an earlier stage, to prevent the attacker from learning valuable information about the target-model’s behavior. With other words, the inventiveness does not stem from the use of perceptual hashing to detect similarities (this has been done before in other contexts), rather in the use of this similarity-detection method to limit the effectiveness of a gray-box attacker when building a substitute model.

According to an embodiment of the invention, the identification of an input sample x as a highly distorted input sample may be realized by the detector component by deriving a value expressing the noise embedded in an input sample x and by comparing it with a pre-defined threshold. For instance, entropy may be used as a metric for measuring the noise embedded in an input sample x. However, as will be appreciated by those skilled in the art, other metrics (e.g. , based on detecting edges) may be implemented as well.

Generally, as defined in the literature, entropy is the measure of state disorder, randomness or uncertainty. In computer science and in the context of embodiments of the present invention, the entropy of a variable may be understood in a sense as defined by Claude Shannon, i.e. as a measure of the amount of information of a variable’s possible values. This typically represents the minimum number of bits required on average to store all the outcomes of the variable using a perfect encoding. If looking, for instance, at image classification Al, typical images can be compressed due to neighboring pixels often having similar colors (e.g. the pixels of the blue sky are mostly similarly blue, although the shade may change), which means the typical actual entropy of an image is much lower than the entropy of pure randomness. When creating an adversarial sample, the adversary adds some noise over the image, which yields to a higher entropy due to the added randomness on the pixels.

According to embodiments, if the noise is above the threshold, the input sample x may be rejected as an adversarial input sample, as already described above. On the other hand, if the noise is below the threshold, it may be provided that the input sample x is classified by processing it with the ML model and returning the corresponding prediction y.

With regard to a particularly strong security enhancement, it may be provided that both detection mechanisms described above, i.e. detecting input samples that appear as noisy versions of previously submitted queries on the one hand and detecting input samples that appear as highly distorted input samples on the other hand, are combined with each other. To this end, according to an embodiment, the ML classifier may comprise a first (stateful) detector component (configured to keep memory of previously submitted input samples and to detect similarities with new inputs) and a second detector component (configured to detect highly distorted input samples by utilizing a predefined noise metric, determining a noise level embedded in an input sample according to the predefined noise metric and comparing the determined noise level with a pre-defined threshold). Both detector components may be arranged in a pipeline fashion upstream the ML model of the ML classifier.

According to embodiments of the invention, the ML classification system may be configured to limit the number of admissible queries (submitted by any user). For instance, this may be realized by refreshing the classifier after a threshold number of input samples submitted from all users. There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the dependent claims on the one hand and to the following explanation of preferred embodiments of the invention by way of example, illustrated by the figure on the other hand. In connection with the explanation of the preferred embodiments of the invention by the aid of the figure, generally preferred embodiments and further developments of the teaching will be explained. In the drawing

Fig. 1 is a schematic view illustrating an adaptive gray-box attack for crafting adversarial samples based on a substitute model,

Fig. 2 is a schematic view illustrating a scheme for the detection of similar queries via perceptual hashing according to an embodiment of the present invention, and

Fig. 3 is a schematic view illustrating a scheme for the detection of high distortion adversarial queries according to an embodiment of the present invention.

Generally, embodiments of the present invention relate to a defense mechanism that is designed to increase the robustness of ML classifiers against adversarial samples in an adaptive gray-box model. Embodiments of the invention focus on two major aspects, namely i) restricting the adversarial queries and/or ii) limiting the input distortion of adversarial samples.

In the context of the present invention, existing adversarial models are deemed unrealistic and therefore existing defenses do not tackle the correct problem. Consequently, a novel adaptive gray-box model is defined where the adversary is aware of a (varying) percentage of the training data used to train the target model, the architecture of the target model and the complete details of the defense. The adaptive gray-box adversary is allowed to query the target model during the training phase to perform data augmentation in order to increase the similarity between the surrogate model and the target model. This differs from a black-box adversary that is usually considered as only having full knowledge of the training data, no knowledge of the defense and no possibility to query the target model for data augmentation, and the white/box adversary that is typically considered as having full access to the target model and its defense.

With respect to the potential attacks against an ML classifier, embodiments of the invention assume an adaptive gray-box model, where the attacker is aware of the defensive mechanisms in place, but has limited knowledge about the internals of the trained ML model. In this setting, an attacker aims at bypassing an ML classifier at inference time, i.e. , after the classifier has been trained, and the attacker has no influence on the training process.

Fig. 1 schematically illustrates an ML classification system 100 in its training phase 110 and in its interference phase 120 together with an adaptive gray-box attack scenario 130, where the attacker is given adaptive gray-box access to the targeted ML classification system, denoted by ft in the sequel, in the sense that the attacker can query arbitrary inputs x to ft and obtain the corresponding prediction y made by the classifier. The attacker is not assumed to know any other details of the classification system under attack, however, it may have some information such as a portion of the labeled data used for training the target model. To capture adaptive attacks, it is assumed in the context of the present disclosure that the attacker is aware of any defensive mechanism in place to thwart adversarial samples.

Most adaptive gray-box evasion attacks build a substitute model f s as an emulation of the target model ft (substitute-based attacks). In such an attack scenario, the attacker operates as follows (wherein the encircled T in Fig. 1 indicates the attacker’s operation including the following Steps 1.-3., while the encircled ‘2’ in Fig. 1 indicates the attacker’s operation including Steps 4. and 5.):

1. The attacker collects an initial set of labeled data (x,yi) and proceeds with a first training phase to obtain an initial substitute model. Existing attacks use as labeled data, e.g., a portion of the training data for the target model.

2. The attacker collects (unlabeled) samples {x’J and labels them with the corresponding predictions made by the target model ft, thereby obtaining a set of labeled data {(x’i.y’i)} where y’i = ft(x’i). Existing attacks generate synthetic samples x’i by adding a small amount of noise to the initial samples Xi.

3. The attacker repeats Step 2, each time generating new samples {x’} from the current samples {x}, until the substitute model f s has gone through sufficient training epochs (e.g., upon reaching a desired accuracy on a test set).

4. The attacker generates adversarial samples against f s using a white-box strategy (this is possible because the attacker has white-box access to f s ).

5. The attacker finally submits the adversarial samples x’, generated to bypass the substitute model f s , to attack the target model ft.

Embodiments of the present invention combine two defensive layers, each targeting a different capability of an adaptive gray-box adversary, so that the resulting system is hard to bypass as long as the attacker is restricted to the adaptive gray-box scenario. It is applicable to image-recognition applications, however, as will be appreciated by those skilled in the art, it could be easily adapted to also cover other ML applications.

Extensive empirical testing shows the following factors have a crucial impact on the success rate of an adaptive attacker in the adaptive gray-box setting: a. Knowledge of training data, i.e. , the fraction of labeled samples known to the adversary that have been used for training the target model ft. b. Permitted queries, i.e., which and how many samples the adversary submits to the target model ft to learn its behavior and transfer it to the substitute model f s . c. Adversarial distortion, i.e., the maximum tolerated amount of adversarial noise added to a genuine sample for turning it into a successful adversarial sample.

Ideally, a defensive mechanism against adaptive gray-box attacks should therefore:

A. Reduce as much as possible the amount of training data known to the attacker. B. Restrict the number and/or type of queries the adversary can pose to the target model.

C. Limit input distortion so that the attacker is restricted to small (thus harder to craft) perturbations.

Reducing the amount of training data known to the attacker (A) is possible by simply keeping the data, as well as the trained parameters of the model, secret. As will be easily appreciated by those skilled in the art, standard methods can be used to apply this measure.

Embodiments of the present invention focus on restricting the adversarial queries (B) and limiting the input distortion of adversarial samples (C).

Fig. 2 schematically illustrates the implementation of a first defensive layer into an ML classification system 200 according to an embodiment of the present invention. This first defensive layer leverages the observation that, as long as the attacker is not aware of (a significant portion of) the training data that were used to train the target model 210, denoted ft, creating a faithful substitute model requires labeling synthetic samples. This, in turn, requires querying the target model ft on slightly distorted versions of the same input. Namely, given a sample x, the attacker creates a synthetic sample x’ by adding noise, i.e. , x’ =x + r for some suitable (small) noise r. In what follows, samples x and x’ as above are referred to as ‘twin samples’.

Embodiments of the present invention prevent this strategy of generating synthetic data by inspecting every submitted query and identifying inputs that appear as noisy versions of previously submitted queries, and by reacting differently when candidate twin samples are found. According to a specific embodiment, it may be provided that the ML model 210 is augmented with a stateful apparatus that keeps memory of previously seen inputs and detects similarities with new inputs.

An idealized implementation of this mechanism could keep all seen queries as part of its state, and would then compare every new input with all previously submitted ones. However, this is rather inefficient and impractical for real-world applications, such as ML-as-a-service, that need to support hundreds or even thousands of queries for each user.

To overcome this issue, embodiments of the present invention propose a practical instantiation that avoids storing the entire input but keeps in memory only a predefined fingerprint of it. According a specific embodiment, the instantiation may rely on perceptual image hashing, as described, e.g., in Vishal Monga, Brian L Evans: Perceptual Image Hashing Via Feature Points: Performance Evaluation and Tradeoffs. IEEE Trans. Image Process. 15(11): 3452-3465 (2006), the entire contents of which is hereby incorporated by reference herein. Perceptual image hashing as described herein refers to a special type of hash function that is invariant under perceptually insignificant distortion, while being sensitive to high distortion. With other words, twin images/samples are likely to have colliding perceptual hash values, H(x’) = H(x), while independent images/samples are extremely unlikely to collide.

Whenever a new input x is submitted to the ML classification system 200, as shown at S20, its perceptual hash h = H(x) is computed, as shown at S21 , and compared with the hashes of previously seen samples, as shown at S22. If a match is found, the input is not processed by the ML model 210 and the previous prediction is returned, as shown at S23. Otherwise, the input is forwarded to the ML model 210, as shown at S24, and classified normally, as shown at S25. Furthermore, the perceptual hash h of the respective input, along with the model prediction y, is stored in memory, as shown at S26.

Intuitively, reacting to identifications of twin samples by returning previously made predictions as described above, prevents attackers from learning valuable information about the target model 210, which results in a security enhancement. At the same time, the classification functionality is guaranteed to honest users that happen to submit ‘by chance’ the same image as other users, thereby achieving functionality preservation.

The defensive layer described in connection with Fig. 2 renders data augmentation - undertaken with the purpose of training a faithful substitute model - ineffective. As long as the attacker has no prior knowledge about the training data (which is the case for most realistic applications of ML), this layer makes it harder to train a faithful substitute model and, therefore, to generate successful adversarial samples.

Moreover, using perceptual hashing to fingerprint previously seen inputs enables a much faster search and greatly reduces storage requirements - compared to a naive implementation that keeps in its state all submitted inputs and compares every new input with those stored in memory. This construct allows the classification system 200 to process heavy input loads. Therefore, the present invention offers a strong security enhancement without hampering scalability.

Fig. 3, in which like reference numbers denote like components/procedural steps as in Fig. 2, schematically illustrates the implementation of a second defensive layer into an ML classification system 200 according to an embodiment of the present invention. This second defense layer aims at detecting highly distorted, adversarial inputs. Such apparatus allows enhancing security by restricting the attacker to generate low-distortion adversarial samples (which are harder to detect but also much harder to generate, especially in an adaptive gray-box setting).

More specifically, according to the embodiment illustrated in Fig. 3 an additional detection layer is embedded into the ML classification system 200, so that whenever a high-distortion input is detected, the classifier refuses to make a prediction (i.e. , the input is rejected as adversarial). As will be appreciated by those skilled in the art, this mechanism does not degrade the functionality of the system, because honest users do not submit distorted inputs.

In more detail, according to the embodiment of Fig. 3, whenever a new input x is submitted to the ML classification system 200, as shown at S20, the entropy of the input is inspected, as shown at S27, and analyzed whether the input’s entropy E x is above a configurable threshold p, as shown S28. If the entropy E x is determined to be above the threshold p, it is assumed that the respective input is distorted and suspected to be an adversarial input. As a consequence, instead of being processed by the ML model 210, the respective input is simply rejected, as shown at S29. Otherwise, the input is forwarded to the ML model 210, as shown at S30, and classified normally, as shown at S31.

It should be noted that high distortion can be detected using different metrics, and those metrics may be more or less suitable for specific image-recognition applications. For instance, instead of inspecting the entropy of the input, as described in connection with the embodiment of Fig. 3, highly distorted images may likewise be identified by detecting edges or similar techniques. In any case, the introduction of a defensive layer adapted to detect highly distorted images limits the attacker to submitting low-distortion inputs, which are harder to craft, especially in an adaptive gray-box setting.

It is important to note that despite the apparent similarities, detecting high-distortion adversarial samples as described above in connection with Fig. 3, is an orthogonal problem to identifying (synthetic) twin samples, as described above in connection with Fig 2. Indeed, although generated starting from some genuine image x, an adversarial sample x’ = x + p (where p is the adversarial perturbation) cannot be recognized by comparing x’ and x, because the initial sample x is not known to the detector. In contrast, in the case of twin queries, the detector always has access to both x and x’, which makes it easier to detect similarities.

Since, as mentioned above, the individual defensive mechanisms described in connection with Fig. 2 and Fig. 3, respectively, offer orthogonal security, embodiments of the present invention provide a ML classification system, in which both defense mechanisms are combined with each other. For instance, both defense mechanisms may be implemented in the sense of a pipeline, wherein an incoming sample submitted to the ML model is first analyzed by the defense mechanism of Fig. 2 (i.e. to detect similar queries, e.g. via perceptual hashing), and wherein the sample, in case it has not been determined to be a twin sample of a previously submitted sample, is next analyzed by the defense mechanism of Fig. 3 (i.e. to detect whether the sample is a high distortion adversarial sample). By combining both defense mechanisms, a strong in-depth defense can be realized. According to a further embodiment of the present invention it may be provided to limit the overall number of queries that any user (and hence an adversary) can submit to the system 200. One way to implement this security measure without degrading functionality could be to periodically refresh the classifier once a prespecified query budget is reached by the users of the system collectively (to ensure protection against Sybil attacks). Refreshing the classifier has the effect of introducing unpredictability into the system, thereby making it more challenging for an attacker to predict the system’s behavior in order to mount a successful attack. For instance, refreshing the classifier could be implemented by rotating used ML models 210 and/or defensive layers, by adding randomization, etc.

Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.