MODELLING CAUSATION IN MACHINE LEARNING - MICROSOFT TECHNOLOGY LICENSING LLC

Title:

MODELLING CAUSATION IN MACHINE LEARNING

Document Type and Number:

WIPO Patent Application WO/2024/063912

Kind Code:

Abstract:

A method comprising: sampling a first causal graph from a first graph distribution modelling causation between variables in a feature vector, and sampling a second causal graph from a second graph distribution modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables. The method further comprises: identifying a parent variable which is a cause of a selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph. A machine learning model encodes the parent to give a first embedding, and encodes information on the confounded pair give a second embedding. The embeddings are combined and then decoded to give a reconstructed value. This mechanism may be used in training the model or in treatment effect estimation.

Inventors:

MA CHAO (US)
ZHANG CHENG (US)
ASHMAN MATTHEW (US)
DEFANTE MARIFE (US)
FASSIO KAREN (US)
JENNINGS JOEL (US)
HILMKIL AGRIN (US)

Application Number:

PCT/US2023/031103

Publication Date:

March 28, 2024

Filing Date:

August 25, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MICROSOFT TECHNOLOGY LICENSING LLC (US)

International Classes:

G06N3/0455; G06N5/025; G06N7/01; G06N3/084; G06N5/01; G16H20/00; G16H50/30

Other References:

WANG HAOTIAN WANGHAOTIAN13@NUDT EDU CN ET AL: "Estimating Individualized Causal Effect with Confounded Instruments", PROCEEDINGS OF THE 2020 5TH INTERNATIONAL CONFERENCE ON BIG DATA AND COMPUTING, ACMPUB27, NEW YORK, NY, USA, 14 August 2022 (2022-08-14), pages 1857 - 1867, XP058851519, ISBN: 978-1-4503-7547-4, DOI: 10.1145/3534678.3539335
MATEJ ZE\V{C}EVI\'C ET AL: "Relating Graph Neural Networks to Structural Causal Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 October 2021 (2021-10-22), XP091071398
SINDY L\"OWE ET AL: "Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 February 2022 (2022-02-21), XP091146009
GEFFNER ET AL.: "Deep End-to-End Causal Inference", MICROSOFT RESEARCH, Retrieved from the Internet

Attorney, Agent or Firm:

CHATTERJEE, Aaron C. et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS 1. A computer-implemented method comprising: a) selecting a selected variable from among variables of a feature vector; b) sampling a first causal graph from a first graph distribution and sampling a second causal graph from a second graph distribution, the first causal graph modelling causation between variables in the feature vector, and the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector; c) from among of the variables of the feature vector, identifying a parent variable which is a cause of the selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph; d) inputting an input value of the parent variable into a first encoder, resulting in a respective embedding of the parent variable; e) inputting the input value of the parent variable into an inference network, resulting in a latent value modelling the respective confounder of the confounded pair, and inputting the latent value into a second encoder, resulting in an embedding of the confounder of the confounded pair; f) combining the embedding of the parent variable with the embedding of the confounder of the confounded pair, resulting in a combined embedding; and g) inputting the combined embedding into a decoder, resulting in a reconstructed value of the selected variable. 2. The method of claim 1, wherein: c) comprises: identifying each of the variables of the feature vector that is a respective parent variable of the selected variable according to the first causal graph, and identifying each respective confounded pair that comprises the selected variable and a respective one of the identified parent variables according to the second causal graph; d) comprises: inputting a respective input value of each of the parent variables into a respective first encoder for the parent, resulting in a respective embedding of each respective parent variable; e) comprises: for each identified confounded pair, inputting the input value of the respective parent variable into the inference network, thereby resulting in a respective latent value of each of the respective confounders, and inputting each of the latent values into a respective second encoder for the respective latent value, resulting in a respective embedding of each of the respective confounders; f) comprises combining the embeddings of all the identified parent variables and confounders together, thereby resulting in said combined embedding; and g) comprises inputting the combined embedding into a respective decoder associated with the selected variable, resulting in a reconstructed value of the selected variable. 3. The method of claim 2, comprising: i) for a given training data point comprising a given combination of input values of the variables of the feature vector, repeating a)-g) over multiple selections wherein in each selection e) comprises, for each identified confounded pair, inputting the input value of the respective parent variable and the input value of the selected variable into the inference network in order to thereby result in the respective input value, and wherein each of the multiple selections selects a different one of the variables of the feature vector as the selected variable thereby resulting in a respective reconstructed value, the multiple selections together thereby resulting in a respective reconstructed version of the training data point comprising the reconstructed values for the training data point; ii) evaluating a measure of difference between the training data point and the reconstructed version; and iii) training parameters of the inference network, first and second encoders, decoders, and first and second graph distributions, based on the evaluated measure. 4. The method of claim 3, comprising repeating i)-iii) over multiple input data points, each comprising a different combination of input values of the variables of the feature vector. 5. The method of claim 4, comprising, after the training over the multiple training data points, observing the second graph distribution to estimate whether a confounder exists between a target pair of the variables of the feature vector. 6. The method of claim 5, comprising, after the training over the multiple training data points, observing the latent value of the respective confounder between the target pair of variables, resulting from the inference network, as an estimated value of the respective confounder. 7. The method of claim 5, wherein the variables of the feature vector model properties of a real-world entity or an environment thereof, and the method comprises, in dependence in said estimating of whether a confounder exists between the target pair of variables, determining whether or not to apply a treatment to one of said target pair of variables to affect the other of said target pair of variables. 8. The method of claim 7, wherein one of: the real-world entity comprises a living being, and the treatment comprises a medical treatment to the living being or the environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real-world entity comprises a network or software, and the treatment comprises an act of controlling the network or software. 9. The method of claim 2, comprising: I) setting the input value of an intervened-on variable of the feature vector, other than the selected variable, to a specified value; and II) estimating an effect of the intervened-on variable on the selected variable based on the reconstructed value of the selected variable. 10. The method of claim 9, wherein: I) comprises setting the input value of a plurality of intervened-on variables of the feature vector, other than the selected variable, to respective specified values; and II) comprises estimating the effect of the plurality of intervened-on variables based on the reconstructed value of the selected variable. 11. The method of claim 9, wherein: I) comprises repeating a)-g) over multiple graph-sampling events, each using the same selected variable and setting the intervened-on variable to the same specified value, but performing the sampling of the first and second causal graphs afresh each time, thereby resulting in a respective reconstructed value of the selected variable from each graph-sampling events; and II) comprises determining an average treatment effect of the intervened-on variable on the selected variable averaged over the multiple graph-sampling events. 12. The method of claim 11, wherein each graph-sampling event sets a plurality of intervened- on values to specified values, the same values each time; and II) comprises determining the average treatment effect of the plurality of intervened-on variables on the selected variable. 13. The method of claim 9, wherein the intervened-on variable models a treatment on a real- world entity or an environment thereof and the target variable models an effect of the treatment applied to the real-world entity, and the method further comprises actioning the treatment on the real-world entity in dependence on the estimated treatment effect. 14. The method of claim 13, wherein one of: the real-world entity comprises a living being, and the treatment comprises a medical treatment to the living being or an environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real-world entity comprises a network or software, and the treatment comprises an act of controlling the network or software. 15. The method of claim 2, wherein one of: - the inference network comprises a respective constituent inference network for each pair of variables in the feature vector, and e) comprises: for each identified confounded pair, inputting the input value of at least the respective parent variable into the respective inference network for the respective confounded pair, thereby resulting in the respective latent value of the respective confounder; or - the inference network comprises a common inference network operable to encode all of the variables of the feature vectors together into the respective embedding for each identified confounded pair. 16. The method of claim 2, wherein each of the first and second encoders and each of the decoders comprises a neural network. 17. The method of claim 1, wherein the inference network comprises a neural network. 18. The method of claim 9, wherein the method is performed on a server system of a first party, the server system comprising one or more server units at one or more sites; and the method further comprises, by the server system of the first party: providing an application programming interface, API, enabling a second party to contact the server system via a network; receiving a request from the second party over the network via the API; in response to the request, determining the estimated treatment effect on the target variable; and returning the estimated treatment effect to the second party over the network via the API. 19. A computer program embodied on non-transitory computer-readable storage, wherein the computer program comprises a machine learning model comprising a plurality of first encoders, a plurality of second encoders, a decoder and an inference network; and wherein the computer- program further comprises instructions configured so as when run on one or more processors to perform a method comprising: a) selecting a selected variable from among variables of a feature vector; b) sampling a first causal graph from a first graph distribution and sampling a second causal graph from a second graph distribution, the first causal graph modelling causation between variables in the feature vector, and the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector; c) from among of the variables of the feature vector, identifying each respective parent variable being a cause of the selected variable according to the first causal graph, and from among the variables of the feature vector, identifying each confounded pair of variables that comprises the selected variable and a respective one of the identified parent variables according to the second causal graph; d) inputting an input value of each identified parent variable into a respective one of the first encoders for the respective parent variable, resulting in a respective embedding of each respective parent variable; e1) for each identified confounded pair, inputting the input value of at least the respective parent variable into the inference network, resulting in a respective latent value of each respective confounder, and inputting each of the latent values into a respective one of the second encoders for the respective latent value, resulting in a respective embedding of each respective confounder; f) combining the embedding of the parent variable with the embedding of the confounder of the confounded pair, resulting in a combined embedding; and g) inputting the combined embedding into the decoder, resulting in a reconstructed value of the selected variable. 20. A system comprising: processing apparatus comprising one or more processors; and memory comprising one or more memory units, wherein the memory stores: - a machine learning model comprising a plurality of first encoders, a plurality of second encoders, a plurality of decoders, and an inference network; - code arranged to run on the processing apparatus and being configured so as when run to perform a method comprising: a) selecting a selected variable from among variables of a feature vector; b) sampling a first causal graph from a first graph distribution and sampling a second causal graph from a second graph distribution, the first causal graph modelling causation between variables in the feature vector, and the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector; c) from among of the variables of the feature vector, identifying each respective parent variable being a cause of the selected variable according to the first causal graph, and from among the variables of the feature vector, identifying each confounded pair of variables that comprises the selected variable and a respective one of the identified parent variables according to the second causal graph; d) inputting an input value of each identified parent variable into a respective one of the first encoders for the respective parent variable, thereby resulting in a respective embedding of each respective parent variable; e) for each identified confounded pair, inputting the input value of at least the respective parent variable, and optionally an input value of the selected variable and/or one or more other variables of the feature vector, into the inference network, thereby resulting in a respective latent value of each respective confounder, and inputting each of the latent values into a respective one of the second encoders for the respective latent value, thereby resulting in a respective embedding of each respective confounder; f) combining the embeddings of the identified parent variables and the embeddings of the confounders of the identified confounded pairs, resulting in a combined embedding; and g) inputting the combined embedding into a respective one of the decoders associated with the selected variable, resulting in a reconstructed value of the selected variable.

Description:

MODELLING CAUSATION IN MACHINE LEARNING BACKGROUND Neural networks are used in the field of machine learning and artificial intelligence (AI). A neural network comprises plurality of nodes which are interconnected by links, sometimes referred to as edges. The input edges of one or more nodes form the input of the network as a whole, and the output edges of one or more other nodes form the output of the network as a whole, whilst the output edges of various nodes within the network form the input edges to other nodes. Each node represents a function of its input edge(s) weighted by a respective weight, the result being output(s) on its output edge(s). The weights can be gradually tuned based on a set of training data so as to tend towards a state where the output of the network will output a desired value for a given input. Typically the nodes are arranged into layers with at least an input and an output layer. A “deep” neural network comprises one or more intermediate or “hidden” layers in between the input layer and the output layer. The neural network can take input data and propagate the input data through the layers of the network to generate output data. Certain nodes within the network perform operations on the data, and the result of those operations is passed to other nodes, and so on. Each node is configured to generate an output by carrying out a function on the values input to that node. The inputs to one or more nodes form the input of the neural network, the outputs of some nodes form the inputs to other nodes, and the outputs of one or more nodes form the output of the network. At some or all of the nodes of the network, the input to that node is weighted by a respective weight. A weight may define the connectivity between a node in a given layer and the nodes in the next layer of the neural network. A weight can take the form of a scalar or a probabilistic distribution. When the weights are defined by a distribution, as in a Bayesian model, the neural network can be fully probabilistic and captures the concept of uncertainty. The values of the connections between nodes may also be modelled as distributions. The distributions may be represented in the form of a set of samples or a set of parameters parameterizing the distribution (e.g. the mean μ and standard deviation σ or variance σ²). The network learns by operating on data input at the input layer, and adjusting the weights applied by some or all of the nodes based on the input data. There are different learning approaches, but in general there is a forward propagation through the network from input layer to output layer, a calculation of an overall error, and a backward propagation of the error through the network from output layer to input layer. In the next cycle, each node takes into account the back propagated error and produces a revised set of weights. In this way, the network can be trained to perform its desired operation. Training may employ a supervised approach based on a set of labelled training data. Other approaches are also possible, such as a reinforcement approach wherein the network each data point is not initially labelled. The learning algorithm begins by guessing the corresponding output for each point, and is then told whether it was correct, gradually tuning the weights with each such piece of feedback. Another example is an unsupervised approach where input data points are not labelled at all and the learning algorithm is instead left to infer its own structure in the experience data. Other forms of machine learning model are also known, other than just neural networks, for example clustering algorithms, random decision forests, and support vector machines, or any other form of parametric or non-parametric function. Some machine learning models can be designed to perform causal discovery using observational data or both observational and interventional data. That is, for a set of variables (e.g. [x ₁, x ₂, x ₃]), the model when trained can estimate a likely causal graph describing the causal relationships between these variables. E.g. in the case of three variables a simple causal graph could be x1 → x2 → x3, meaning that x1 causes x2 and x2 causes x3 (put another way, x3 is an effect of x2 and x2 is an effect of x ₁). Or as another example, x ₁ → x ₂ ← x ₃, means that x ₁ and x ₃ are both causes of x2 (x2 is an effect of x1 and x3). However, in the past such models were only used for causal discovery alone, not for treatment effect estimation for decision making. Another type of model aims to perform treatment effect estimation, which commonly assumes that the causal graph is already given by the user. Such methods did not previously work with unknown causal graphs. “Deep End-to-End Causal Inference” [DECI] (Geffner et al, Microsoft Research, https://arxiv.org/pdf/2202.02195.pdf) appreciated that in many applications, users would benefit from the ability to perform treatment effect estimation for decision making with observational data only, without needing to know the causal graph. Accordingly, Geffner et al provided an integrated machine learning model that both models the causal relationships between variables and performs treatment effect estimation, by averaging over multiple possible causal graphs sampled from a distribution. This advantageously allows the method to exploit a model that has been trained for causal discovery in order to also estimate treatment effects. The method thus enables “end-to-end” causal inference. SUMMARY It is recognized herein that there is still further scope to improve on the DECI model developed by Geffner et al, or the like. Particularly, the possible presence of hidden confounders has the potential to bias existing models. Consider two variables x ₁ and x ₂ and a model which is attempting to discover whether there is a causal edge between them, or to predict a treatment effect based on the modelled causation. A hidden confounder is a third variable which is not observed and which is a cause of both x1 and x2. This could lead to a false conclusion that x1 is the cause of x2 (or vice versa) when in fact there is no causal link (or a weaker causal link) and instead both x ₁ and x ₂ are an effect of a common causal variable u ₁₂ (a confounder) which is not present in the input data (i.e. not one of the variables of the input feature vector). For instance in the healthcare field x1 could measure the presence or absence of a certain condition (e.g. disease) in a subject, whilst x2 could be a lifestyle factor such as an aspect of the subject’s diet (e.g. salt or fat intake, etc.) or whether they are a smoker, and the confounder u12 could be a measure of a socioeconomic circumstance of the patient (e.g. annual income). Ignoring the socioeconomic circumstance may give the false impression that the lifestyle factor x ₂ causes the condition x ₁, whereas in fact the ground truth is that both x1 and x2 are effects the socioeconomic circumstance. To address such possibilities, the present disclosure provides a machine learning model which models the possibility of hidden confounders as latent variables model. According to one aspect disclosed herein, there is provided a computer-implemented method comprising :a) selecting a selected variable from among variables of a feature vector; and b) sampling a first causal graph from a first graph distribution and sampling a second causal graph from a second graph distribution. The first causal graph models causation between variables in the feature vector, and the second causal graph models presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector. The method further comprises: c) from among of the variables of the feature vector, identifying a parent variable which is a cause of the selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph. The method then comprises: d) inputting an input value of the parent variable into a first encoder, resulting in a respective embedding of the parent variable; and e) inputting the input value of at least the parent variable (and optionally an input value of the selected variable and/or one or more others of the variables of the feature vector) into an inference network, resulting in a latent value modelling the respective confounder of the confounded pair. The latent value is input into a second encoder, resulting in an embedding of the confounder of the confounded pair. The method then further comprises: f) combining the embedding of the parent variable with the embedding of the confounder of the confounded pair, resulting in a combined embedding; and g) inputting the combined embedding into a decoder, resulting in a reconstructed value of the selected variable. The model may advantageously be used to discover the likely presence of hidden confounders, and optionally estimate their values. Alternatively or additionally the model may be used to perform treatment effect estimation which takes into account the possibility of hidden confounders in the causal graph. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Nor is the claimed subject matter limited to implementations that solve any or all of the disadvantages noted herein. BRIEF DESCRIPTION OF THE DRAWINGS To assist understanding of embodiments of the present disclosure and to show how such embodiments may be put into effect, reference is made, by way of example only, to the accompanying drawings in which: Figure 1 is a schematic block diagram of a system in accordance with embodiments disclosed herein; Figure 2 is a schematic computation diagram illustrating an example of a DECI type machine learning model that performs end-to-end causal inference but does not account for latent confounders; Figure 3 schematically illustrates an example of a causal graph; Figure 4 is schematic sketch of an example of a probabilistic distribution; Figure 5 schematically illustrates another example of a causal graph including a hidden cause; Figure 6 is a schematic computation diagram illustrating a further machine learning model including an inference network for conditional treatment effect estimation; Figure 7 is a schematic flowchart of a method of training a model; Figure 8 is a schematic flowchart of a method of making treatment effect estimations using a trained model, by selecting a target variable with index i and an intervened-on variable (i.e. the treatment); Figure 9 schematically illustrates a causal graph (or part thereof) including a possible confounder between a pair of variables; Figure 10 is a schematic computation diagram illustrating a machine learning model that models the possible presence of hidden confounders in accordance with embodiments disclosed herein; and Figure 11 is a schematic flow chart showing the core functionality of the model of Figure 10. DETAILED DESCRIPTION OF EMBODIMENTS EXAMPLE SYSTEM OVERVIEW Figure 1 illustrates an example system according to embodiments of the present disclosure. The system comprises a server system 102 of a first party, a network 112, and a client computer 114 of a second party. The server system 102 and client computer 114 are both operatively coupled to the network 112 so as to be able to communicate with one another via the network 112. The network 112 may take any suitable form and may comprise one or more constituent networks, e.g. a wide area network such as the Internet or a mobile cellular network, a local wired network such as an Ethernet network, or a local wireless network such as a Wi-Fi network, etc. The server system 102 comprises processing apparatus comprising one or more processing units, and memory comprising one or more memory units. The, or each, processing unit may take any suitable form, e.g. a general purpose processor such as a CPU (central processing unit); or an accelerator processor or application specific processor such as a dedicated AI accelerator processor or a repurposed GPU (graphics processing unit), DSP (digital signal processor), or cryptoprocessor, etc. The, or each, memory unit may also take any suitable form, e.g. an EEPROM, SRAM, DRAM or solid state drive (SSD); a magnetic memory such as a magnetic disk or tape; or an optical medium such as an optical disk drive, quartz glass storage or magneto-optical memory; etc. In the case of multiple processing units and/or memory units, these may be implemented in the same physical server unit, or different server units in the same rack or different racks, or in different racks in the same data centre or different data centres at different geographical sites. In the case of multiple server units, these may be networked together using any suitable networking technology such as a server fabric, an Ethernet network, or the Internet, etc. Distributed computing techniques are, in themselves, known in the art. The memory of the server system 102 is arranged to store a machine learning (ML) model 104, a machine learning algorithm 106, training data 108, and an application programming interface (API) 110. The ML model 104, ML algorithm 106 and API 110 are arranged to run on the processing apparatus of the server system 102. The ML algorithm 106 is arranged so as, when run, to train the ML model 104 based on the training data 108. Once the model 104 is trained, the ML algorithm 106 may then estimate treatment effects based on the trained model. In some cases, after the ML model 104 been trained based on an initial portion of training data 108 and been made available for use in treatment effect estimation, training may also continue in an ongoing manner based on further training data 108 , e.g. which may be obtained after the initial training. The API 110, when run, allows the client computer 114 to submit a request for treatment effect estimation to the ML algorithm 106. The ML model 104 is a function of a plurality of variables. The request may specify a target variable to be examined, and may supply input values of one or more other variables (including intervened-on values and/or conditioned values). In response, the ML algorithm 106 may control the ML model 104 to generate samples of the target variable given the intervened-on and/or conditioned values of the one or more other variables. The API 110 returns the result of the requested causal query (the estimated treatment effect) to the client computer 114 via the network 112. In embodiments, the API may also allow the client computer to submit some or all of the training data 108 for use in the training. CAUSAL MACHINE LEARNING MODEL Figure 2 schematically illustrates an example implementation of a machine learning model 104 of a type disclosed in “Deep End-to-End Causal Inference” (DECI) by Geffner et al. The ML model 104 comprises a respective encoder ^^ _^ ^{^} ^ ^{^} and a respective decoder ^^ _^ ^{^} ^ ^{^} for each of a plurality of variables xi, where i is an index running from 1 to D where D > 1. The set of variables xi…xD may be referred to as the input feature vector. Each variable represents a different property of a subject being modelled. The subject may for example be a real-life entity, such as a human or other living being; or an object such as a mechanical, electrical or electronic device or system, e.g. industrial machinery, a vehicle, a communication network, or a computing device etc.; or a piece of software such as a game, operating system software, communications software, networking software, or control software for controlling a vehicle or an industrial processor or machine. Each property could be an inherent property of the being or object, or an environment of the being or object which may affect the being or object or be affected by the being or object. For instance in a medical example, the variables represent different properties of a person or other living being (e.g. animal). One or more of the variables may represent a symptom experienced by the living being, e.g. whether the subject is exhibiting a certain condition such as a cough, sore throat, difficulty breathing, etc. (and perhaps a measure of degree of the condition), or a measured bodily quantity such as blood pressure, heart rate, vitamin D level, etc. One or more of the variables may represent environmental factors to which the subject is exposed, or behavioural factors of the subject, such as whether the subject lives in an area of high pollution (and perhaps a measure of the pollution level), or whether the subject is a smoker (and perhaps how many per day), etc. And/or, one or more of the variables may represent inherent properties of the subject such as a genetic factor. In the example of a device, system or software, one or more of the variables may represent an output state of the device, system or software. One or more of the variables may represent an external factor to which the device, system or software is subjected, e.g. humidity, vibration, cosmic radiation, and/or a state of one or more input signals. For instance the variables may comprise a control signal, and/or an input image or other sensor data captured from the environment. Alternatively or additionally, one or more of the variables may represent an internal state of the device, system or software, e.g. an error signal, resource usage, etc. One, more or all of the variables may be observed or observable. In some cases, one or more of the variables may be unobserved or unobservable. Each respective encoder ^^ _^ ^{^} ^ ^{^} is arranged to receive an input value of its respective variable xi, and to generate a respective embedding e _i (i.e. a latent representation) based on the respective input value. As will be familiar to a person skilled in the art, an embedding, or latent representation or value, is in itself a known concept. It represents in the information in the respective input variable an abstracted form, typically in a compressed form, which is learned by the respective encoder during training. The embedding may be a scalar value, or may be a vector of dimension `embedding_dim` which is greater than 1. Note that a ”value” as referred to herein could be a vector value or a scalar value. For example if one of the variables is an image (e.g. a scan of the subject), then the “value” of this vector variable is the array pixel values for the image. The different variables x _i may have a certain causal relationship between them, which may be expressed as a causal graph. A causal graph may be described as comprising a plurality of nodes and edges (note that these are not the same thing as the nodes and edges mentioned earlier in the context of a neural network). Each node represents a respective one of the variables xi in question. The edges are directional and represent causation. I.e. an edge from x _i=k to x _i=l represents that x _k causes xl (xl is an effect of xk). A simple example involving three variables is shown in Figure 3. In this example x2 causes x1, and x1 causes x3. For example x3 may represent having a respiratory virus, x ₁ may represent a lung condition and x ₂ may represent a genetic predisposition. Of course other possible graphs are possible, including those with more variables and other causal relationships. A given graph G may be expressed as a matrix, with two binary elements for each possible pair of variables x _i=k, x _i=l; one binary element to represent whether an edge exists between the two variables, and one to represent the direction of the edge if it exists (e.g.0 = no edge from k to l, 1 = edge from k to l, or vice versa). For example for three variables this could be written as: ^ ^{^ ^^ ^^1,2} ^ _{^ ^^ ^^1,3} ⁾ _{^^ ^^ ^^2,3} Other equivalent representations are possible. E.g. an alternative representation of the probabilities of existence and direction of edges in a graph would be: For any given situation, the actual causal graph may not be known. A distribution q _φ of possible graphs may be expressed in a similar format to G, but with each element comprising a parameter φ (phi, also drawn φ) representing a probability instead of a binary value. ^ ^{^_ ^^ ^^ ^^ ^^ ^^ ^^1,2 ^^_ ^^ ^^ ^^1,2} ^ _{^ ^^ = ( ^^_ ^^ ^^ ^^ ^^ ^^ ^^1,3 ^^_ ^^ ^^ ^^1,3} ⁾ ^ _{^_ ^^ ^^ ^^ ^^ ^^ ^^2,3 ^^_ ^^ ^^ ^^2,3} (Or an equivalent representation.) In other words the parameter φ_exists1,2 represents the probability that an edge between x ₁ and x ₂ exists; φ_dir1,2 represents the probability that the direction of the possible edge between x ₁ and x ₂ is directed from x ₁ to x ₂ (or vice versa); parameter φ_exists1,2 represents the probability that an edge between x1 and x3 exists; etc. Returning to Figure 2, the ML model 104 further comprises a selector ^^, a combiner 202 and a demultiplexer 204. It will be appreciated that these are schematic representations of functional blocks implemented in software. The selector ^^ is operable to sample a causal graph G from the distribution. This means selecting a particular graph G (with binary elements) whereby the existence and direction of the edges are determined pseudorandomly according to the corresponding probabilities in the distribution qφ. Preferably the possible graphs are constrained to being directed acyclic graphs (DAGs), for the sake of practicality and simplicity of modelling. The selector ^^ also receives a value of the index i for a selected target variable x _i. For the currently selected variable x _i (node of index i in the graph), the selector ^^ selects the respective embeddings e _Pa(i) generated by the respective encoders ^^ _^ ^{^} ^ ^{^} ^ _{^( ^^)} of the parents Pa(i) of the node i (variable x _i) in the currently sampled graph G, and inputs these into the combiner 202. The combiner 202 combines the selected embeddings e _Pa(i) into a combined embedding e _c. In embodiments the combination is a sum. In general a sum could be a positive or negative sum (a subtraction would be a sum with negative weights). The combined (summed) representation thus has the same dimension as a single embedding, e.g. `embedding_dim`. In alternative implementations however it is not excluded that another form of combination could be used, such as a concatenation. The demultiplexer 204 also receives the index i of the currently selected variable x _i, and supplies the combined embedding e _c into the input of the decoder ^^ _^ ^{^} ^ ^{^} associated with the currently selected variable x _i. This generates a value of a respective noiseless reconstructed version x _i’ of the respective variable xi based on the combined embedding ec. The encoders ^^ _^ ^{^} ^ ^{^} = _{1… ^^} and decoders ^^ _^ ^{^} ^ ^{^} = _{1… ^^} are constituent machine learning models comprised by the overall ML model 104. They are each parameterized by respective sets of parameters θ which are tuned during learning. In embodiments, each of the encoders ^^ _^ ^{^} ^ ^{^} = _{1… ^^} and decoders ^^ _^ ^{^} ^ ^{^} = _{1… ^^} is a respective neural network (in which case their parameters may be referred to as weights). However it is not excluded that some or all of them could instead be implemented with another form of constituent machine learning model. TRAINING Figure 7 schematically represents a method of training the DECI type model 104 based on a training data 108. The training data 108 comprises a plurality of training data points. Each data point [x1 … xD] comprises a set of input values, one respective value for each of the variables xi. The method processes each of the training data points, at least in an initial portion of the training data 108. At step S10 the method takes a new data point from memory, and at step S20 the index i is set to a first value (e.g. i=1) to specify a first target variable xi from among the set of variables in the data point. At step S30, this causes the selector ^^ to select the target variable xi with the currently set value of the index i as the variable to be processed. At step S40 the selector ^^ samples a random graph G from the distribution q _φ. At step S50, the selector ^^ selects the parents Pa(i) of the target variable xi (node i in the graph) and supplies the respective embeddings ePa(i) from the encoders ^^ _^ ^{^} ^ ^{^} ^ _{^( ^^)} of the selected parents into the combiner 202. At step S60, the combiner 202 combines (e.g. sums) the embeddings of the selected parents Pa(i) into the combined embedding e _c, and the demultiplexer 204 selects to supply the combined embedding e _c into the decoder ^^ _^ ^{^} ^ ^{^} of the target variable xi. The respective decoder ^^ _^ ^{^} ^ ^{^} is thus caused to generate a noiseless reconstruction xi’ of the selected target variable xi. This could equally be expressed as: where ^∑ _{^^∈ ^^ ^^( ^^)} is another way of writing the operations of the selector ^^ and combiner 202. Preferably, G is a DAG (directed and acyclic) so that it is valid to generate the value of any node in this way. The difference x _i-x _i’ may be referred to as the residual. The process is repeated for each variable xi in the set of variables, i.e. for i=1…D, thus creating a full reconstructed set of values [x ₁’ … x _D’] for the data point in question. This is represented schematically by steps S70 and S80 in Figure 7, looping back to S30. However note that this the “loop” may be a somewhat schematic representation. In practice some or all of the different variables xi for a given data point could be processed in parallel. At step S90 the ML algorithm 106 applies a training function which updates the parameters (e.g. weights) θ of the encoders ^^ _^ ^{^} ^ ^{^} = _{1… ^^} and decoders ^^ _^ ^{^} ^ ^{^} = _{1… ^^} . Simultaneously it also updates the parameters φ of the distribution q _φ of possible graphs. The training function attempts to update the parameters θ, φ in such a way as to reduce a measure of overall difference between the set of input values of the set of input variables [x1 … xD] and the reconstructed version of the set variables [x ₁’ … x _D’]. For example in embodiments the training function is an evidence lower bound (ELBO) function. Training techniques of this kind are, in themselves, known in the art, e.g. based on stochastic back propagation and gradient descent. During the training phase, x _i is an observed data value that is attempted to be reconstructed. The residual x _i—x _i’ between the reconstruction and the observation may be used to compute the ELBO objective. In embodiments, the model 104 may be a probabilistic model that it assigns a joint probability (likelihood) p (X ₁,…,X _D) to all the variables X ₁,…,X _D; where X _i represents the noiseless reconstruction x _i combined with a random noise term z _i, e.g. additively (X _i = x _i’ + z _i), and zi may be sampled from a random distribution, zi ~ p(zi). In such embodiments, then in order to train the model 104, one first collects a real-world data point with D-dimensional features (x1, x2,…,xD), and then maximizes the likelihood of it being generated by the model. That is to say, the training operates to try to maximize p(X1 = x1, X2 = x2,…,XD = xD). This is done by adjusting the parameters (e.g. weights) θ of the model 104. The ELBO mentioned previously is an example of a numerical approximation to the quantity p(X ₁ = x ₁, X ₂ = x ₂,…,X _D = x _D). ELBO is typically much easier to compute than p(X1 = x1, X2 = x2,…,XD = xD). The above-described process is repeated for each of the data points in the training data 108 (or at least an initial portion of the training data). This is represented schematically in Figure 7 by step S100 and the loop back to S10. Though again, this form of illustration may be somewhat schematized and in some implementations, batches of data points could be reconstructed in parallel, and the model parameters updated based on the batches. At the beginning of the overall training method the graph distribution q _φ may start with some predetermined set of probabilities, e.g. all elements 0.5 for the existence and direction of each possible edge, or using some prior domain knowledge to inform which edges are possible or impossible, or more or less likely. Then, as the model is trained with more and more data points, the probabilities (i.e. the parameters φ of qφ) are gradually learned in parallel with the parameters θ (e.g. weights) of the encoders and decoders ^^ _^ ^{^} ^ ^{^} = _{1,…, ^^} , ^^ _^ ^{^} ^ ^{^} = _{1,… ^^} . TREATEMENT EFFECT ESTIMATION Once the DECI model 104 has been trained sufficiently based on all the data points in the training data 108 (or at least an initial portion of the initial training data), then at step S110 the trained ML model 104 is made available to be used for treatment effect estimation (i.e., answering causal queries). In embodiments, this may comprise making the model 104 available to via the API 110 to estimate treatment effects / answer causal queries requested by the client computer 114. A method of using the ML model 104 to estimate a treatment effect is shown schematically in Figure 8. At step T10 the index i is set to that of a target variable xi whose treatment effect is to be estimated. In addition, the input values of one or more ”intervened-on” variables are also set to their known values. The intervened-on variables are variables whose values are set to some specified value, to represent that the property that they represent has been controlled (the treatment, i.e. an intervention on the modelled property). An ”intervened-on” variable could also be referred to as a treated variable or controlled variable. For example, in the medical application, the intervened- on variable(s) may represent one or more interventions performed on the subject, and the target variable may represent a possible symptom or condition of the subject (e.g. the presence of a certain disease). Or in the case where the subject is a device or software, the intervened-on variable(s) may represent one or more states that are set to defined values, and the target variable may represent a condition or state of the device or software that is being diagnosed. At step T20, the selector ^^ samples a graph G pseudorandomly from the distribution q _φ according to the probabilities φ. In other words the chance of there existing an edge in given direction between each pair of nodes i=k, i=l in the sampled graph G is determined by the pair of parameters φ _exists(k,l); φ_dir(k, l). Although optionally, some of the learned probabilities could be overridden to predetermined values based on prior knowledge (e.g. in some scenarios a given causality could be ruled out – probability set to zero – or may be known to be unlikely, either based on a priori or empirical knowledge). At step T30 the selector ^^ selects the parents Pa(i) of the target variable xi in the sampled graph G, and supplies the respective embeddings ePa(i) from the encoders ^^ _^ ^{^^} ^ _{^^( ^^)} of the selected parents into the combiner 202 to be combined (e.g. summed) into the combined embedding ec. The demultiplexer 204 selects to pass the combined embedding ec into the decoder ^^ _^ ^{^} ^ ^{^} of the target variable xi, thus causing it to generate a noiseless reconstruction of xi’. Thus the selection ^^ selects the parents of node i in the graph G, so it depends on both i and G. The index is set to a particular target node i that one is trying to estimate (as well as selecting a graph G from q _φ), and the model 104 outputs a noiseless reconstruction of x _i, denoted by x _i’. During training, such reconstruction was formed of each node given the actual data for its parent nodes. Now at test time, the model 104 is used in a slightly different way for treatment effect estimation. Here, the target variable x _i which is now treated as unknown, and the goal is to generate simulated values of the target variable, given other interventional variables. In some embodiments the simulated value could just be taken as the noiseless reconstructed value x _i. However for some purposes, the noiseless reconstruction x _i’ of the target variable x _i may not be considered preferable, and instead the full interventional distribution may be taken into account. This can characterized by the following sources of uncertainties: 1), different realizations of the graph G that could happen to have been selected during sampling (even an unlikely one), which can be modelled by simulating different graphs from q _φ.; and 2), the randomness of the residual noise random variables zi. The latter can be taken into account by the following simulation equation: In other words, during the treatment effect estimation phase, X _i is a simulated value that is obtained through random sampling of zi, wherein Xi = xi’ + zi (except for xi=T, the intervened-on variable). More generally in other possible implementations, the noise z _i could be combined with the noiseless reconstruction x _i’ in other ways in order to obtain Xi, not necessarily additively. In the following description, reference may be made to Xi as the simulated value of target variable xi. However generally the simulated value could simply be the noiseless reconstructed value xi’, or X _i (which takes into account noise, e.g. X _i = x _i’ + z _i), or a value which is simulated based on the reconstructed value xi’ in any other way. By whatever means the simulated value is determined, according to the present disclosure, an average is taken over multiple sampled graphs. In embodiments, an average is taken over multiple sampled graphs and residual noise variables zi. In other words, for a given target variable xi to be estimated, multiple values of variable Xi are simulated based on different respective sampled graphs G, each time sampled randomly from both q _φ and z _i This is represented schematically in Figure 8 by the loop from step T50 back to T20, to represent that multiple simulated values of xi are determined (for a given target variable xi), each determined based on a different randomly sampled graph G (each time sampled from the distribution qφ) and optionally noises ^^ _^^ . Again the illustration as a loop may be somewhat schematized, and in embodiments some or all of the different simulated values for a given target variable x _i, based on the different sampled graphs and sampled residual noises z, may in fact be computed in parallel. In embodiments the average could be taken as a simple mean, median or mode of the different values of the simulated values x _i’ or X _i’ of the variable x _i (as simulated with the different sampled graphs and optionally residual noises). In embodiments, such averaging is based on estimating an expectation of the probabilistic distribution of the simulated values. Figure 4 schematically illustrates the idea of a probabilistic distribution p(xi). The horizontal axis represents the value of the target variable being simulated, x _i, and the vertical axis represents the probability that the variable takes that value. If the distribution is modelled as having a predetermined form, it may be described by one or more parameters, e.g. a mean μ and standard deviation σ or variance σ ² in the case of a Gaussian. Other more complex distributions are also possible, which may be parameterized by more than two parameters, e.g. a spline function. As a convenient shorthand the probabilistic distribution may be referred to as a function the target variable of x _i, i.e. p(xi), in the sense that it models the distribution of the target variable x _i, but in fact it will be appreciated that the distribution is in fact determined based on the corresponding simulated values xi’ or Xi. Based on the trained model 104, it is possible using statistical methods to estimate an expectation E[p(x _Y)|do(x _T=val1)] of the probabilistic distribution p of a particular target variable x _i=Y given the intervened value val1 of another, treatment/intervention variable xi=T. In embodiments, the target variable xY may model an outcome of a treatment modelled by xT. Note that “treatment” as used most broadly herein does not necessarily limit to a medical treatment or a treatment of a living being, though those are certainly possible use cases. In other examples, the treatment may comprise applying a signal, repair, debugging action or upgrade to an electronic, electrical or mechanical device or system, or software, where the effect may be some state of the device, system or software which is to be improved by the system. In embodiments, the actual real-world treatment may be applied in dependence on the estimation (e.g. expectation) of the effect of the modelled treatment, for example on condition that the treatment estimation (e.g. expectation) is above or below a specified threshold or within a specified range. One suitable measure of expectation may be referred to as the average treatment effect (ATE). This may be expressed as: E[p(x _Y)|do(x _T=val1)] - E[p(x _i=Y)|(do(x _T=val2)] or E[p(xY)|do(xT=val1)] - E[p(xi=Y)] which could also be expressed E[p(xY)|do(xT=val1)] - E[p(xi=Y)|do(xT=mean(xT))], where val1 is some treatment, val2 is some other treatment, and “do” represents applying the treatment. In other words, the ATE is the difference between: a) the expectation of the distribution of the effect xY given the value val1 of a treatment xT and b) the expectation of the distribution of the effect x _Y given the value val2 of a treatment x _T, or the difference between a) the expectation of the distribution of the effect xY given the value val1 of a treatment xT and b) the expectation of the distribution of the effect xY without applying a value of the treatment xT. Note that in variants the equality x _T=val1 or x _T=val2 could be replaced with another form of expression such as a “greater than”, “less than” or range type expression. An estimation of E[p(xY)|do(xT=treatment) may be determined based on the modelled distributions of noise, together with the learned graph and arrow functions. This may be done using the noise distribution at node x _Y, and using noise samples from other upstream variables, plus knowledge of the graph and its behaviour under the intervention do(xT=treatment), in order to thus make the ATE estimation calculation. The expectation E[p(xY)|do(xT=val)] of a target variable x _Y given a treatment x _T=val may be computed as: where z _Y is a noise term. E is an average over everything that is random in the equation. Consider for example a simple graph: x1→ x2 ←x3 where x1 is the treatment xT, and x2 is the target variable xY. In the computation of E, x1 is set to its known value. E may then be written: ^^[ ^^ _^ ^{^} ^ ^{^}( ^^ ₁ ^{^^}( ^^ ₁ = ^^ ^^ ^^) + ^^ ₃ ^{^^}( ^^ ₃)) + ^^ _^^] To compute E, one example method is simply to take the mean of the different sampled values xY’ or XY of the target variable xY, based on the different sampled graphs, optionally also including some random noise (e.g. additive noise) in each sample. Another option is to fit the sampled values of x _Y to a predetermined form of distribution, such shown as in Figure 4, e.g. a Gaussian, normal or spline function (again optionally also including random noise). The average may then be determined as the average of the fitted distribution. Note that during the determination of the multiple instances of x _i’ it is possible, especially for some distributions, that the same graph G ends up being sampled multiple times, or even every time. However the graph is still being freshly sampled each time and even if the sampled graphs turn out to be the same, they may still be described as different instances of the sampled graph, in the sense that they are individually sampled anew each time. In the case where the multiple graphs happen to be the same, then the averaging only averages the noise z. In some cases, if the treated (i.e. controlled) variable has one or more parents, the graph G may be mutilated: G->G _do(xT=val). E.g. consider the graph: x4→ x1→ x2 ←x3 where again x ₁ is the treatment x _T, and x ₂ is the target variable x _Y. In the computation of E, x ₁ is set to its known value. Therefore x ₄ has no effect on the outcome of x ₂. So in the determination of the expectation, the graph is mutilated to remove the node x4 and the edge from x4→ x1. In other words, fixing the value of the known (controlled) variable x1 means that any effect of the edge from the parent of the known variable x ₁. Another type of average which may be determined according to embodiments disclosed herein may be referred to as the conditional ATE (CATE). This is the difference between: a) the expectation of the target variable x _Y given the treatment x _T=val1 conditional on an observation γ of at least one other of the variables x _C in the set, and b) the expectation of the target variable x _Y without the treatment (either with a different treatment val2 or no treatment) but still conditional on the same observation of x _C. That is, E[p(x _Y)|do(x _T=val1), x _C=γ] - E[p(x _Y)|do(x _T=val2), x _C=γ] or: E[p(xY)|do(xT=val1), xC=γ] - E[p(xY)|,xC=γ] which could also be expressed E[p(x _Y)|do(x _T=val1), x _C=γ] - E[p(x _i=Y)|do(x _T=mean(x _T)) x _C=γ]. Note that in variants the equality x _T=val1, x _T=val2, or x _C=γ could be replaced with another form of expression such as a “greater than”, “less than” or range type expression. Figure 5 illustrates by way of example why estimating the conditional treatment effect is not necessarily straightforward. In this example x ₃ is the target x _Y whose treatment effect will be estimated and x4 is the treatment xT, where the treatment is the cause of the target effect. In addition, there is another, unobserved cause x ₁ of the target effect, and another observable effect x ₂ of the unobserved cause x ₁. The variable x ₂ is to be the observed condition x _C. For instance, in a medical example the target effect x3 could be some condition or symptom of the subject (e.g. a respiratory problem), the treatment x4 could be a possible medical intervention (e.g. taking some drug or vitamin), the unobserved cause x ₁ may be a genetic factor, and the other observable cause x2 may be some observable physical quality of the subject’s body (e.g. body mass index). In general the unobserved cause could be unobservable, or merely unobserved. The observable condition x ₂ contains information about the unobserved cause x ₁, which in turn can reveal information about the desired effect x3 (=xY). For example if it is known that an athlete is in the Olympics and that they are a footballer, this reduces the probability that they are also a rower. In fact for any two variables that are effects of a common cause and whose intersection is not 100%, knowing something about one gives information about the other. E.g. if it is known that a subject has lung cancer and that they were exposed to a carcinogenic chemical other than tobacco, this reduces the probability that they were are smoker. However the causal direction is from x ₁ → x ₂, and the model 104 of Figure 2 is only configured to learn effects of causes in the direction from cause to effect – it is not configured to learn inferences of effect from cause. I.e. it is not configured to “go against the arrows” in the figure (the directional causal edges). To address this, as illustrated in Figure 6, in embodiments the ML model 104 may be adapted to include at least one inference network h disposed between at least one observable condition xc (x2 in the example) and at least one unobservable potential cause (x ₁ in the example) of the condition x _c. As illustrated in Figure 6, in some such embodiments, the inference network h (or individual such networks) may be disposed between the unobserved cause and multiple potential effects (up to all the other variables). This will allow the model to learn which variable(s) may be an effect of the unobserved cause, if relationship is not prior knowledge. The inference network(s) h may be trained at the training stage simultaneously along with the encoders g ^e and decoders g ^d and the parameters of the graph distribution qφ, or alternatively after the rest of the model (see below). In embodiments the inference network h may comprise a neural network, in which case training the inference network comprises tuning the weights of the inference network. Alternatively the use of other forms of machine learning is not excluded for the inference network. The inclusion of the inference model makes it possible to estimate a conditional expectation such as E[xY|do(xT=val1), xC=γ]. To estimate the conditional expectation, then initially during the estimation of the target variable xY, the conditional variable xc is not fixed. In other words, the method proceeds as described above with respect to ATE, to obtain multiple different samples of x _Y based on multiple respective sampled graphs. At the same time, respective simulated samples x _c of the conditional variable are also obtained in the same way based on the respective sampled graphs. This gives a two dimensional set of samples (xY, xc), which could be described as a plot or “scatter plot”. Then a predetermined form of function is fitted to the 2D set of samples (x _Y, x _c), such as a straight line, a curve, or a probabilistic distribution. After that, xc is set to its observed value in the fitted function, and a corresponding value of xY is read out from the fitted function. This is taken as the conditional expectation of x _Y given x _c. At least two alternative variants to computing CATE may be employed, depending on implementation. Variant I: estimate CATE using the same approach as used to estimate ATE, but performing a re- weighing of the terms inside of the expectations such that the condition (that gives CATE its name) is satisfied. This type of approach is known as an importance sampling technique. The weights of the different samples used to compute the expectation are provided by an inference network, which is trained together with the rest of the ML model 104. Variant II: after the model 104 of Figure 2 has been trained and a specific CATE query is received (e.g. via the API 110), the inference network h is trained to estimate the effect variable from the conditioning variable. To train this model, data simulated from the trained model of Figure 2 is used, while applying some treatment specified in the query. Then the conditional average treatment effect is estimated by inputting the relevant value of the conditioning variable into the inference network h. It returns a distribution over effects from which the expected effect can be computed. The main difference between the two approaches is that the first solely uses components learnt from the observed data during model training, while the second requires learning a new network after the model 104 has been trained. The reasoning for proposing both methods is that there are specifics settings where one or the other are more computationally efficient. Note that the disclosed methods are not limited to the controlled (i.e. treated) variable xT being a direct parent of the target variable x _Y (i.e. the variable whose treatment effect being estimated). The treated variable or the effect variable can be any arbitrary variable in the set xi=1…D, e.g. a grandparent of the target variable xY. In embodiments, the simulation of the target variable takes into account a potential effect of all causal variables across the sampled graph. An example implementation of this is as follows. This may be used in conjunction with any of the ways of averaging discussed above, or others. The method of estimating the target variable xY (e.g. the treatment effect) may comprise an inner and an outer loop. In the inner loop, the method loops through i=1…D, so as to generate the simulated value X _i (or just x _i’) of each of the input variables x _i (even those that are not the target variable), except skipping over the one or more controlled (treated or known) variables x _T which are set to their known, fixed values (any edges into those nodes have been “mutilated” away, i.e. removed from the sampled graph). Then, proceeding to the next round (iteration) of the outer loop, the simulated values X _i of the non-controlled variables are fed back to the respective inputs of the model 104. In other words (other than for the one or more controlled variables xT), the simulated values Xi from the previous round or cycle (iteration) of the outer loop become the input values xi of the current iteration of the outer loop to generate an updated set of values for the simulated variables xi. This may be repeated one or more further times, and the simulated values will start to converge (i.e. the difference between the input layer and output layer of the model 104 will get smaller each time). If noise is included the noise is frozen throughout a given inner loop, then re- sampled each outer loop. The total number of iterations of the outer loop may be predetermined, or the outer loop may be iterated until some convergence criterion is met. In embodiments the outer loop is iterated at least D-1 times, which guarantees convergence without needing to evaluate a convergence criterion. This method advantageously allows causal effects to propagate throughout the graph. For example if x1 causes x2 and x2 causes x3, and an intervention is performed on x1 then the outer loop will be run at least two times to propagate the effect through to x3. However the method of using an inner and outer loop is not essential. An alternative would be to perform a topological sort of the nodes and propagate effects through in a hierarchical fashion starting from the greatest grandparents or “source nodes” (those which are only causes and not effects). Or as another alternative, though less preferred, it is not necessary to propagate effects through multiple generations and instead only the effects of immediate parents of the target variable may be taken into account. EXTENSION TO INCLUDE LATENT CONFOUNDERS Figure 9 schematically illustrates the issue of the presence of possible “hidden confounders”, which may also be referred to just as confounders. A confounder is a variable that is both A) unobserved (i.e. hidden, so not part of the input feature vector); and B) a cause of at least two variables that are among the variables of the input feature vector [x1….xD]. These two variables may be referred to herein as the “confounded” variables or a “confounded pair” of the respective confounder. In the example of Figure 9 the confounded variables are labelled x ₁ and x ₂ and the confounder between them is labelled u ₁₂. The confounder could be unobservable or merely unobserved. Of course the causal graph may be more complex than shown in Figure 9, and confounders may exist between more than one pair of variables of the input feature vector. An issue recognized herein is that existing machine learning models do not take into account the presence of possible confounders, which may lead to an erroneous bias in the training of such models or predictions made by such models. Consider for example a scenario where x ₁ is a variable measuring the presence or absence of a certain condition (e.g. disease) in a subject (e.g. patient), and x2 is a variable measuring a property of the subject which may or may not affect the possible presence of the condition. For instance x ₂ could measure a lifestyle factor such as an aspect of the subject’s diet (e.g. salt or fat intake, etc.), whether they are a smoker, a dosage of a certain medication they are taking, or an environmental factor such as living in an area of high pollution or near an electricity pylon. Say then that the model is being used to try to learn whether there is a causal link between x1 and x2, or to predict what would be the effect on x1 of treating x2. However there may be an unobserved common cause of the two variables, i.e. a confounder. For instance, the confounder u ₁₂ could represent a factor in the socioeconomic circumstances of the patient (e.g. annual income, education or family circumstances). Ignoring the confounder (e.g. socioeconomic circumstance) may give the false impression that the lifestyle factor x2 causes the condition x1, whereas in fact the ground truth is that both x1 and x2 are effects of the socioeconomic circumstance, in which case x ₂ may not actually be a cause of x ₁ or may only be a weaker cause than it would otherwise appear. The presently disclosed extension – which in embodiments may be applied as an extension to the previous work by Geffner et al summarised in the Background section – introduces the modelling of possible hidden confounders into a causal machine learning model by introducing a second causal graph sampled from a second graph distribution. The extended model works in a similar manner to the model 104 described in relation to Figure 2 and like elements function in the same manner except where stated otherwise. Figure 10 schematically illustrates the extension to the model on the encoder side. The decoder side works the same as described in relation to Figure 2. As illustrated in Figure 10, as in the base model 104 of Figure 2, the extended model 104’ comprises a respective first encoder g ^ei for each variable xi (i.e. each feature) of the input feature vector [x1…xD], each arranged to encode the value of its respective variable xi into a respective embedding e _i. The extended model 104’ additionally comprises an inference network H, and a respective second encoder g ^eij for each of a plurality of pairs of input variables xi, xj in the input feature vector – preferably one for each possible pair (i.e. every combination of two variables from the input vector). For each of the plurality of pairs of variables x _i, x _j, the inference network H is arranged to encode the respective values of one or more of the input variables x1…xD (e.g. in training, at least the two respective values of the pair together) into a respective latent value uij for the respective pair. This latent value represents a modelled value of the (possible) confounder that potentially exists between the respective pair of input variables x _i, x _j. The inference network H may also be a function of a pseudorandom noise term ε _i. These noises will be added as part of the input. That is, u = H(x, ε), as shown in Figure 10. In embodiments, the inference network H is implemented as one common inference network that encodes all the variables x ₁…x _D together into the latent confounder value u _ij for a given pair. Alternatively it is not excluded that the inference network H could be structured as a respective constituent inference network Hij for each of the plurality of pairs of input variables input variables x _i, x _j. Other internal structures of H are also possible. The extended model 104’ further comprises a respective second encoder g ^eij for each respective latent value uij (corresponding to each respective pair of input variables xij). Each respective second encoder g ^e _ij is arranged to encode the respective latent value u _ij of the respective confounder into a respective embedding e _ij. The extended model 104’ comprises a selector ^^’ analogous to the selector ^^ described in relation to the base model of Figure 2, but with additional functionality. As in Figure 2, a causal graph Gφ is sampled from the graph distribution qφ, and a certain selected variable xi is selected; and the selector ^^’ then selects the parent variables Pai that are parents of the selected variable xi according to sampled graph G _φ, and inputs these into the combiner 202 (e.g. adder or concatenator). The graph distribution used for this is the same as the graph distribution q _φ described previously with respect to Figure 2. I.e. it comprises matrix whose elements represent the probabilities of directed causal edges existing between the different possible pairs of variables in the input vector [x ₁ … x _D]. In the context of the extended model 104’, this may also be described as the first graph distribution, or the directed graph distribution; and the graph sampled therefrom may be referred to as the first causal graph or the directed graph. In other words: q _φ = q ₁ = q _directed G _φ = G ₁ = G _directed However, in addition, the extended model 104’ also samples a second causal graph G2 from a second graph distribution q ₂. These may also be referred to herein as the bidirected graph distribution and bidirected graph: q2 = qbidirected G ₂ = G _bidirected The second graph distribution q ₂ is a graph distribution representing the probabilities that confounders exist between pairs of variables. It may be expressed as a matrix in which the different elements represent the probabilities that there exists a confounder between different pairs of variables x _i, x _j, preferably having an entry for each possible pair of variables in the input feature vector [xi…xD] (i.e. every combination of two variables from the input vector). So for a simple three variable input vector (D=3), the second graph distribution would have three entries: Ψ_exists12 Ψ_exists13 Ψ_exists23 where Ψ_exists(i,j) represents the probability that a confounder exists between variables with indices i and j. More generally the second graph distribution q2 may be expressed as: Other equivalent representations may also be possible. To sample an actual graph G2 from the second graph distribution q2, the presence or absence of each edge in the graph G ₂ is determined pseudorandomly according to the probability specified in the corresponding element of the distribution q ₂. E.g. if ψ_exists1,1 = 0.7, then the element (1,2) in G2 – representing whether or not a confounder u12 exists between variables x1 and x2 – has a 70% chance of being 1 (meaning confounder present) and a 30% chance of being 0 (meaning no confounder present). Again, other equivalent mathematical expressions of the same functionality are also possible. The value of the confounder, if it exists in the sampled graph G2, taken from the output of the inference network H. In embodiments H may be configured to model a probabilistic distribution, which may also be described as a third distribution q ₃, from which the values of u _ij are sampled. Based on the sampled graphs G1 and G2, in addition to the embeddings ei of the parents Pa(i) of the selected variable x _i determined from the first causal graph G ₁ (=G _φ), as sampled from the first graph distribution q ₁ (=q _φ), the selector ^^’ also selects the embeddings e _ij of any confounded pairs that include the selected variable x _i - i.e. of any confounders u _ij that affect the selected variable xi - as determined from the second causal graph G2 sampled from the second graph distribution q2. These embeddings eij of the selected confounders uij are input into the combiner 202 (e.g. adder or concatenator) along with the embeddings e _i of the parents Pa(i) of the selected variable x _i. The combiner 202 then combines (i.e. sums or concatenates, depending on implementation) all of the embeddings ei of the selected parents and the embeddings eij of the selected confounders together into a combined embedding e _C. As shown in Figure 2, the combined embedding e _C is then input into the decoder g ^d _i associated with the selected variable xi, which decodes the combined embedding to produce a respective reconstructed value xi’ of the input variable xi. Optionally the pseudorandom noise zi may be added (or included in some other, non-additive way) to create the noisy reconstructed value X _i. Either may be used as the simulated value of xi. This side of the model 104’ (the decoder side) works the same as in the base model 104, as described previously with respect to Figure 2. It will be appreciated that the inference network H, second encoders g ^e _ij, and selector ^^’ may be implemented as modules of software in a similar manner to the other components as described in relation to the model of Figure 2. As with the first encoders g ^ei and decoders g ^di, each of the inference network H and second encoders g ^eij may be implemented as a respective neural network, or alternatively other types of constituent ML models such as random forests or clustering algorithms, or any other form of parametric or non-parametric function, are not excluded. The core functionality of the extended model 104’ is represented in the flow chart of Figure 11. By core functionality here, it is meant the mechanism of the model for sampling a causal graph that includes latent confounders, identifying parents of a selected variable according to the sampled graph, including any confounders, and then combining and decoding the embeddings of the identified parents, including the embeddings of any confounders. This mechanism may be used in both training and in subsequent treatment effect estimation. At step U10 the method selects one of the variables xi from among the features of the input feature vector [x1…xd]. When used in training this variable xi is one of the variables to be reconstructed in order to determine a loss effect. When used in treatment effect estimation the selected variable x _i will be the target variable, and the intervened-on variable (i.e. variable to be treated) will also be selected separately (though may be null). At step U20 the method samples a first causal graph G ₁ from the first graph distribution q ₁ and samples a second causal graph G ₂ from the second graph distribution q ₂. At step U30 the method determines which of the other variables [x ₁ … x _i-1, xi+1 … xD] in the feature vector are parents Pa(i) of the selected variable xi in the sampled first graph G1. Note this may involve mutilating the graph as described previously. The method also determines which of these parents Pa(i) share a confounder u _ij with the selected variable x _i in the sampled second graph G2. In general this could be none, one, some or all of the parents Pa(i) identified from the first graph G1. At step U40 the method generates the reconstructed value x _i’ of the selected variable x _i. This is done by inputting the input values of each of the selected parents Pa(i) (as determined from the first graph G1) into the respective first encoders g ^ei in order to generate their respective embeddings e _i; and inputting at least the input value of the parent in each selected confounded pair x _i, x _j (as determined from the second graph G ₂) into the inference network H to generate the respective latent value uij, and inputting this into the respective second encoder g ^eij for the respective confounder in order to generate the respective embedding eij for each respective pair of confounded variables. In some circumstances, the values input into the inference network H to generate the latent value uij for a confounded pair i,j (and in turn the respective embedding eij) may also comprise one or more additional input values in addition to the parent of the respective pair. E.g. in training these one or more additional input values preferably comprise the observed input value of the selected variable as well. And in training or inference (e.g. treatment effect estimation), the one or more additional input values may comprise input values of one or more other of the variables of the feature vector {x1…xD} (other than the selected variable xi or the respective parent x _j of the pair i,j), as one or more of these variables may potentially also comprise information about the possible confounder. In general, H can generate uij for the selected variable xi based on what is known about any or all of the other variables. The more is known, the better that H will predict u _ij. One (of many) ways to it is to take the average: H(x _j) = sum over x _i {H(x _i, xj) * p(xj)}. So H can generate uij from only a specified value of the parent xj. H can even generate uij given nothing as input, meaning H is just generating uij randomly from a distribution learned at training. When estimating average treatment effects, this is acceptable, since in the one will marginalize out all the variables x in the entire population anyway. Continuing step U40, all the generated embeddings ei; ei,j for the selected parents and confounders are input into the combiner 202 to be combined (e.g. summed or concatenated) into the combined embedding e _C, which is then decoded by the respective decoder g ^d _i for the selected variable x _i to produce the respective reconstructed value xi’. Optionally a noise term zi may be added to produce the simulated value Xi. Alternatively the noiseless reconstructed value xi’ could be used as the simulated value. It will be appreciated that Figure 11 is somewhat schematized and all the steps of Figure 11 do not necessarily have to be performed linearly. E.g. the method could identify the parents Pa(i) from the first graph and begin generating the parents’ embeddings e _i before it has begun identifying all of the confounded pairs x _i, x _j. Or the method could begin generating embeddings for some parents while still selecting other parents from the graph G1, etc. Training of the model 104’ may be performed in an analogous manner to that already described in relation to Figures 2 and 7, by replacing steps S30-S70 in Figure 7 with steps U10-U40 from Figure 11. In other words, training begins with a given training data point comprising a set of respective values of at least some the variables (features) of the feature vector xi = [x1 … xD]. Preferably this will be values of all the variables of the feature vector if available, but the function H can take any number of observations as input, so if one or more values are unobserved then they are simply not passed through H. The method then cycles through the indices of each variable i = 1…D observed in the feature vector, and for each performs the method of Figure 11 with the current variable as the selected variable in order to produce a respective simulated value x _i’ or X _i. Over the cycle, the method will thus produce a simulated value xi’ or Xi for each variable in the feature vector (or at least those that are observed), which together thus result in a simulated feature vector xi’ = [x1’ … xD’] or Xi = [X1 … XD]. The method then determines a measure of the discrepancy between the input feature vector x _i and the simulated feature vector x _i’ or X _i for the given training data point, and based on this tunes the parameters (e.g. weights) of all the constituent models (e.g. neural networks) of the ML model 104’, including the first encoders g ^ei, second encoders g ^eij, inference network H, and decoders g ^di, as well as the first and second graph distributions q ₁ and q ₂. E.g. this may be done using an ELBO function, or any other suitable learning function. Algorithms for comparing an input feature vector with a simulated version in order to tune the parameters of a ML model are, in themselves, known in the art. The method may be repeated over multiple training data points in order to refine the training. Once the model 104’ has been trained to at least some extent, it may be used to estimate whether a likely confounder exists between a given target pair of variables xi, xj (i.e. a pair of variables of interest). This may be done by accessing the trained second graph distribution q ₂ from memory and reading out the value of the probability ψ_exists(i, j). If the probability is high (e.g. above a threshold) it may be determined that an unobserved confounder is likely to exist between the two variables xi and xj, whereas conversely, if the probability is low (e.g. below a threshold) it may be determined that such a confounder is not likely to exists between the variables in question. In dependence on this, a decision may be made as to whether or not to action a treatment on a first, intervened-on one of the two variables xj as a means of affecting a second, targeted one of the two variables x _i. Consider a scenario where x _j is a apparently cause of x _i according to the first graph distribution q1, because the relevant elements – e.g. φ_exists(I,k), φ_dir(i,j) – of the first graph distribution q1 have been read out from memory and indicate a high probability (e.g. above a threshold probability) of a directed edge existing from x _j to x _i. Thus from the first graph distribution q ₁ alone, it may appear that treating x _j is an effective way of affecting x _i. However, if ψ_exists(i, j) in the second graph distribution q2 also indicates a high probability (e.g. above a threshold probability) of a hidden confounder existing between xj and xi, then it may be determined that in fact x _j should not be treated as a means of affecting target variable x _i. The trained model 104’ may also be used to estimate the value of such a confounder. This may be done by reading out the value of uij from the output of the inference network after a suitable amount of training. In embodiments the inference network H may model a probabilistic distribution, which could also be described as q3, from which the value uij is sampled. The decision as to whether to action the treatment may also be dependent on the value of the estimated confounder. E.g. if a confounder is determined likely to be present but having a weak effect, it may be decided still to action the treatment of x _j as a means of affecting target variable x _i. In alternative or additional applications, the trained model 104’ may be used to perform treatment effect estimation in an analogous manner to that already described in relation to Figures 2 and 8, by replacing steps T10-T40 in Figure 8 with steps U10-U40 from Figure 11. In other words, the target variable is selected as the selected variable x _i, and input values of one or more other, intervened-on (i.e. controlled) variables x _j is/are set to its/their intended value (i.e. the proposed value for treatment). As discussed earlier the inference network H can take any number of observations, so if any variables are not intervened-on or observed, their values are simply not passed through H. The method of Figure 11 is then used to generate a simulated value x _i’ or X _i of the target variable xi. In embodiments this may be repeated over multiple different sampled instances of the first and second causal graphs G1, G2; i.e. repeating the steps of Figure 11 but newly re-sampling the first causal graph G ₁ from the first graph distribution q ₁ and the re-sampling the second causal graph G2 from the second graph distribution q2 afresh each time (while keeping the same selection of the target variable and the same fixed values of the intervened on variable or variables x _j). The estimated effect of the treatment may then be determined by averaging over these multiple rounds of graph sampling. E.g. this may comprise determining the average treatment effect (ATE) or any other averaging technique described previously with respect to Figures 2 and 8. A decision about whether or not to action the proposed treatment of the (to-be) intervened-on variable or variables xj may be made in dependence on the estimated treatment effect. If the treatment is determined to be effective and to have a positive effect on the target variable xi (e.g. having above a threshold effect), then the treatment may be actioned; but otherwise the treatment may not be actioned. EXAMPLE USE CASES The variables of the feature vector x _i = [x ₁…x _D] may represent properties of a tangible real-world entity being modelled, or an environment in which the entity resides. The entity may comprise a living being; or a physical object such as an electronic, electrical or mechanical device or system. Alternatively or additionally the object may comprise a piece of software running or to be run on a computer system. In one example use case, the modelled real-world entity may comprise a living being, e.g. a human or animal. The disclosed model 104 or 104’ may be used to determine whether to action a treatment to the being in order to affect a condition of the being (e.g. to cure or alleviate a disease by which the being is afflicted). In this case the target variable xi represents a measure of the condition, e.g. a symptom experienced by the living being as a result of the condition, and one or more intervened-on variables x _j represent one or more properties of the being or its environment which are susceptible to being controlled as a treatment (e.g. dosage of a drug, or a change in habit or environment, etc.). E.g. this may be used in a medical setting to treat a patient. A decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to treat the target condition may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target condition, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE. In another example use case, the real-world entity may comprise a mechanical, electrical or electronic system or device; and the disclosed model 104 or 104’ may be used to determine whether to action a treatment to affect a state of the system or device, such as to repair, maintain or debug the system or device. In this case the target variable xi may represent a measure of the state of the system or device, e.g. sensor data measuring wear, operating temperature, operating voltage, output or throughput, etc. The one or more intervened-on variables xj may represent one or more properties of the device or its environment susceptible to being treated, e.g. vibration, humidity, ambient temperature, applied current or voltage level, etc. E.g. this could be used in an industrial setting to service industrial equipment such as factory equipment. A decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to treat the target state may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target state, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE. In another example use case, the real-world entity being modelled may comprise software that is run, or to be run, on one or more processors in one or more computer devices at one or more locations; and the disclosed model 104 or 104’ may be used to determine whether to action a treatment to try to optimize the running of the software. In this case the target variable xi may comprise a measure of a current state of the software, e.g. memory or processing resource usage, or latency, etc. The one or more intervened-on variables x _j may represent any property capable of affecting the running of the software, e.g. input data, or rebalancing or the load across a different combination of processors or devices. For example this could be used in a data centre or the cloud for load balancing of software run across different server devices. A decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to optimize the running of the software may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target state, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE. In another example use case the real-world entity being modelled may comprise a network, e.g. a mobile cellular network, a private intranet, or part of the internet (e.g. an overlay network such as a VoIP network overlaid on the network). The disclosed model 104 or 104’ may be used to determine whether to action a treatment to try to optimize the operation of the network. In this case the target variable xi may represent any state of the network that it may be wished to improve, e.g. a property of network traffic such as end-to-end delay, jitter, packet loss, error rate, etc. The one or more intervened-on variables x _j may represent and property capable of affecting the target variable, e.g. balancing of traffic across the network, routing or timing of traffic, encoding scheme used, etc. A decision as to whether to perform the proposed treatment on the intervened-on variable(s) as a means to optimize the network performance may be made in dependence on whether a confounder is estimated to exist between the proposed treatment and the target state, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE. In another example use case, the real-world entity may comprise an autonomous or semi- autonomous self-locomotive device such as a self-driving vehicle or robot. In this case the target variable xi may represent sensor data providing information on the device’s environment, e.g. image data or other sensor data (such as distance or motion sensor data) providing information on the environment of the device (such as presence of obstacles, location of another object to interact with). The one or more intervened-on variables xj may comprise a control signal that can be applied to control the device, e.g. to steer the vehicle or apply brakes or lights, or to move a robot arm in a certain way. A decision as to whether to perform a certain control operation (a type of “treatment”) represented by the intervened-on variable, as a means to achieve a certain effect on the environment (e.g. avoid an obstacle or interact with another object), may be made in dependence on whether a confounder is estimated to exist between the proposed variable to be controlled and the target outcome, and/or an estimate strength of the confounder. And/or, the decision may be made in dependence on the estimated treatment effect, e.g. ATE. EXAMPLE IMPLEMENTATION Further details of some example implementations of the various concepts discussed above are now described, by way of example only. As discussed earlier, in practice it is not always possible to directly perform interventional studies (i.e., randomised AB testings) to estimate treatment effects, and it may be required to use observational studies. That is, the causal graph and the associated causal effects are to be inferred based on observational data, in embodiments solely based on observational data. Therefore, this would appear to require an assumption that all the data provided by users already contains all the information that is needed (i.e. there should not be any unmeasured confounders). However, this assumption is often unrealistic, as the modelled scenario is also impacted by certain variables that cannot be directly measured. Therefore, it would be desirable to provide an effective and theoretically principled strategy for handling latent confoundings when performing joint causal discovery & inference. However, the existence of unmeasured confounding poses questions for causal discovery since there might exist multiple contradicting causal structures that are compatible with observations. Traditional methods for causal discovery mainly focus on estimating the causal relationships between observed variables and does not recover the causal relationships that involve unmeasured confounders. There also exists several related methods for causal discovery with latent confounders, but each comes with certain limitations on assumptions. Following the previous work by Geffner at al of developing deep end-to-end causal inference (DECI), this framework can be extended in order to handle latent confounders. The outcomes of this is to implement latent variable variants/baselines of DECI that explicitly take latent confounders into account. These variants/baselines should preferably: 1), be able to perform joint graph learning and functional learning; 2), have computational costs comparable to DECI; and 3), subsume DECI, in the sense that it should perform similarly to DECI when there is no latent confounders. To recap the causal discovery without latent confounders using DECI, recall that in DECI, we perform causal discovery under no latent confounder assumptions, but modelling the causal graph G on all observed vertexes X jointly with the observations x ¹, ... , x ^N as: Here, p(G) is a prior over DAG, implemented by: where G is the adjacency matrix representation, and ℎ ⁽ ^^ ⁾ = ^^ ^^ ⁽ ^^ ^{^^ ^ ^^)} − ^^ (3) is the DAG penalty. We aim to fit θ, the parameters of our non-linear ANM, using observational data by maximizing (an lower bound of) log pθ(x ¹, ... , x ^N). Given θ, the question of causal discovery can be simply answered by the posterior, pθ(G|x ¹, ... , x ^N). In the original DECI, we maximize the ELBO of log p _θ(x ¹, ... , x ^N) and use mean-field approximation qϕ(G) to perform approximate inference. Causal discovery under ADMGs: in the present disclosure, we will extend DECI’s framework to consider unmeasured confounders. Unfortunately, if some variables in a DAG is unobserved, the resulting set of (conditional) independences no longer corresponds to a DAG (on the observed variables). To deal with unobserved variables, a common extension is the so called acyclic directed mixed graphs (ADMGs). In a ADMGs, a bidirected edge x ↔ y indicates the existence of a unmeasured founder u _xy, such that x ←u _xy → y. Note that this would be different than a undirected edge x−y used in (CP)DAG, which indicates either x ← y or x → y. Finally, the term acyclic in ADMG means that there is no cycles that contains only directed edges. Therefore, causal discovery under latent confounders can be formulated as the same equation as the original DECI, except that the prior p(G) will be based on the constraint that G is a ADMG (instead of a DAG). We will discuss how to parameterize the ADMG constraint and how to parameterize pθ(x ¹, ... , x ^N, G) later. Note that in principle, an ADMG causal discovery algorithm is able to express search results using (a subset of) the following structures: ^ x → y: x is the cause of y; ^ x ↔ y: indicates the existence of a unmeasured founder uxy, such that x ← uxy → y. ^ x − y: either x ← y or x → y. ^ x ⇒ y: either x ↔ y, or x → y. In the following, we mainly consider representing our posteriors using the first two, where they have better granularity than the last two. D-DECI (deconfounded DECI) Magnification-based representation: We perform causal discovery on the highest granularity, i.e. explicitly model the latent confounders (denoted by U), and learn a DAG G’ on X ∪U. In other words, G’ (on X ∪U) is a magnification of G (on X ); and the marginalized graph G over observed variables X will then be a ADMG. To relate G’ to the previous notation of G1 and G2, G1 is the adjacency matrix on X; and G2 is the adjacency matrix on U. G' is basically a big matrix on both X and U, obtained by concatenating and unpacking G ₁ and G ₂. In other words, G' = M(G ₁, G ₂), where M(.) is a pre-known mathematical function. G' may also be called magnified matrix. In short, it is obtained by concatenating and aggregating G1 and G2. For example, for a graph x1-- >x ₂<-->x ₃, the corresponding matrices will be: where the 4th row of G' corresponds to the 4th variable (additional latent confounder) implied in the bidirected edge x2<-->x3. That is, the 4th variable should point to both x2 and x3, hence those two 1s in the second and third cell of the 4th row. Under this idea, we model the causal graph G’ jointly with the observations x ¹, ... , x ^N, as well as latent confounders u ¹, ... , u ^N: We will define constraints on G’ and the parameterization of pθ(x ⁿ, u ⁿ|G’) later. We call Eq. (5) magnification-based parameterization, since the original ADMGs formulation does not explicitly assume any functional forms. To optimize θ and perform approximate inference on G’, we define: 1) A variational distribution qϕ(G) to approximate the intractable posterior pθ(G|x ¹, ... , x ^N). Similar to the original DECI, this can be parametrized as the product of independent Bernoulli distributions for each potential directed edge in G’. We parametrize edge existence and edge orientation separately, using the ENCO parametrization. 2) An inference network q _λ(u|x) to approximate the intractable posterior p _θ, G’(u|x), e.g., a Gaussian inference network used in VAEs. The inference network qλ(u|x) has also been referred to as H earlier. We use them to build the ELBO, given by: By committing to optimize Eq. (6), we implicitly make the following assumptions: ^ We already know the exact number of unmeasured variables. That is, the latent vertexes U is given and fixed. ^ For all unmeasured variables in U, they only possess out-going edges. That is, all variables U serve as confounders. This in practice and be implemented by model.set graph constraint() in DECI code base. Prior over ADMGs: The ADMG constraint on the graph G’ can be imposed in two ways: Directly apply the DAG constraint on G’. That is, This ensures that once G’ can be converted to an ADMG (by marginalizing U and only keeping X). OR ^ First marginalize G’ to the graph G on X , and then apply ADMG constraints on G. This can be done via the following three ADMG constraints: o Ancestral ADMG constraint: where G _directed,i,j = 1 if and only if x _i ← x _j ; and G _{bidirected,i,j} = 1 = G if and only if x _i ↔ x _j . o Arid ADMG constraint: ℎ ⁽ ^^ ⁾ = ^^ ^^ ⁽ ^^ ^{^^)} − ^^ + ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ⁽ ^^ _{^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^}, ^^ _{^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^} ⁾ = 0 (9) o Bow-free constraint: By using any of those constraints, p(G’) can be implemented by: Parameterization of pθ(x ⁿ, u ⁿ|G’): Similar to DECI, we can parameterize pθ(x ⁿ, u ⁿ|G’) using auto-regressive ANM. That is, z = g _G’(x, u; θ) = [x, u] – f _G’ Q ([x, u]; θ). Where z are independent noise variables following the distribution Πi pzi. Then we can write the observational likelihood as: where we omitted the Jacobian-determinant term because it is always equal to one for DAGs G’. Structural identifiability under latent confounders: Preferably, we would like to find a set of suitable assumptions such that D-DECI should satisfy the structural identifiability definition, as follows. Definition 1 (D-DECI Structural Identifiability). For a distribution pθ(x;G’) := ∫u pθ(x, u|G’)du, the graph G’ is said to be structural identifiable from pθ(x;G’) if there exists no other distribution pθ’(x;G’’) such that G’ ≠ G’’ and pθ(x;G’) =pθ’ (x;G’’). Generally, such structural identifiability does not hold. However, under the constraint of bow- freeness, we will be prove identifiability, which will be discussed in the next section. D-DECI identifiability Notation and assumptions: Let x ∈ ℝ ^D denote the observed data, u ∈ ℝ ^P denote the unobserved data, v = [x; u] ^T , and p(x, u; ^^ ⁰) the ground truth data generating distribution, where ^^ ⁰ is a binary adjacency matrix representing the true causal DAG. DDECI uses the additive noise model (ANM) defining the structural assignment xj’ = fj(pa(j); θ)+nj where pa(j) denotes the parents of xj. The function f corresponds to the decoders g ^d described earlier. We’ll also use the notation pa _x(j) and pa _u(j) to denote the parents of x _j in the observed and unobserved set, respectively. n _j are mutually independent noise variables. The noise term n has also been referred to as z earlier. Assumption 1 (Form of fj): We assume that the functions fj are non-linear and decompose as: ^^ _^^( ^^ ^^ _^^( ^^), ^^ ^^ _^^( ^^); ^^) = ^^ _{^^, ^^}( ^^ ^^ _^^( ^^); ^^) + ^^ _{^^, ^^}( ^^ ^^ _^^( ^^); ^^) (13) Assumption 2 (Unobserved variables are latent confounders): All unobserved variables are latent confounders of observed variables, i.e. have the causal structure xj ← uk → xi. Justification of this is given by the proposition that any latent variable can be reparameterised by absorbing exogenous noise to make it a latent confounder. A distribution p(x) satisfies minimality w.r.t. ^^ if it is Markov w.r.t. ^^, but no to any proper subgraph of ^^. Since we cannot access the true data generating distribution p(x, u; ^^ ⁰), only the marginal distribution over observed variables p(x; ^^ ⁰), we require a slightly different definition of minimality. We define this below. Definition 1 (Minimality): p(x; ^^) is minimal if it does not satisfy the local Markov condition with respect to the ADMG of any subgraph of ^^ . This form of minimality implies the bow-free constraint. To see this, note that the causal structure x _i ← u _k → x _j together with x _i ← x _j, and x _i ← u _k → x _j alone imply the same conditional independencies. Since the latter is a subgraph of the former, only it satisfies the minimality condition. Assumption 3 (Minimality): For a distribution p(x; ^^) generated by D-DECI, we assume the minimality condition holds. Comments on assumption of latent variables: Treatment effect is invariant to parameterisations if intervening on observed variables only. Proof of structural identifiability: Here we prove structural identifiability of ^^ given the assumptions of the previous section. Lemma 1: Let M denote a set of variables not containing vi, and let s(vi) denote a linear or nonlinear function of v _i. Then, the residual of s(v _i) regressed on M cannot be independent of n _i: Proof: Assume that [s(v _i) − g _i(M) n _i] holds, then M must contain at least one descendent of v _i as it must have dependence on the noise ni to cancel effect of ni in s(vi). We can express gi(M) as ^^ _^^ ∪ ^^ _^^). Since g _i operates on variables defined by non-linear transformations of the exogenous noise terms, we cannot express ui as ai(∪ _{^^: ^^ ^^∈ ^^} ^^ _^^) + bi(ni). M contains a descendent of v _i, so ^^ _^^ ∪ ^^ _^^ includes at least one noise term n _k that satisfies v _i n _k (i.e. is not in v _i). Thus, terms containing ni cannot be fully removed from s(v _i) − g _i(M) and so [s(v _i) − g _i(M) n _i] does not hold. Lemma 2: If an only if is satisfied, there is a latent confounder, and no direct cause, between xi Proof: Define gi and gj as gi(M) = fi,x(pax(i)) + g’i(M) and gj(N) = fj,x(pax(j)) + g’i(N) respectively. Then, (15) becomes equivalent to: This is equivalent to: (17) And since ni nj, we have: The first implies the existence of an unobserved mediator between xj and xi, the second implies the existence of an unobserved mediator between xi and xj, and the third implies the existence of an unobserved confounder. Given Assumptions 2 and 3, this indicates the presence of a latent confounder and no direct cause between xi and xj . Lemma 3. If (19) is satisfied, there is no causal relationship between xi and xj and no latent confounder. Lemma 4. If (20) and (21) are satisfied, xj is a direct cause of xi and there is no latent confounder. (21) Thus, for any pair of variables we can determine whether there exists a latent confounder, and if not we can determine whether or not there is a direct cause and the orientation. MLE recovers ground proof: We can use the exact same proof as in the DECI paper to show that MLE recovers ground truth—we can just replace ^^ with the ADMG corresponding to ^^ projected onto x. D-DECI recovers ground truth: In the infinite data limit, we have: The posterior approximation q(un|xn) should preferably be very accurate (i.e. ^^ _^^ ⁾|| ^^ ⁽ ^^ _^^ ^| ^^ _^^, ^^ ⁾]] ≈ 0) to recover the ground truth causal structure. FINAL REMARKS It will be appreciated that the above embodiments have been described by way of example only. More generally, according to one aspect disclosed herein there is provided (Statement 1) a computer-implemented method comprising: a) selecting a selected variable from among variables of a feature vector; b) sampling a first causal graph from a first graph distribution and sampling a second causal graph from a second graph distribution, the first causal graph modelling causation between variables in the feature vector, and the second causal graph modelling presence of possible confounders, a confounder being an unobserved cause of both of two variables in the feature vector; c) from among of the variables of the feature vector, identifying a parent variable which is a cause of the selected variable according to the first causal graph, and which together with the selected variable forms a confounded pair having a respective confounder being a cause of both according to the second causal graph; d) inputting an input value of the parent variable into a first encoder, resulting in a respective embedding of the parent variable; e) inputting at least the input value of the parent variable (and optionally also an input value of the selected variable and/or an input value of one or more others of the variables of the feature vector) into an inference network, resulting in a latent value modelling the respective confounder of the confounded pair, and inputting the latent value into a second encoder, resulting in an embedding of the confounder of the confounded pair; f) combining the embedding of the parent variable with the embedding of the confounder of the confounded pair, resulting in a combined embedding; and g) inputting the combined embedding into a decoder, resulting in a reconstructed value of the selected variable. In embodiments there may optionally be provided a method in accordance with any of the following statements. Statement 2. The method of Statement 1, wherein c) comprises: identifying each of the variables of the feature vector that is a respective parent variable of the selected variable according to the first causal graph, and identifying each respective confounded pair that comprises the selected variable and a respective one of the identified parent variables according to the second causal graph; d) comprises: inputting a respective input value of each of the parent variables into a respective first encoder for the parent, resulting in a respective embedding of each respective parent variable; e) comprises: for each identified confounded pair, inputting the input value of the respective parent variable and the input value of the selected variable into the inference network, thereby resulting in a respective latent value of each of the respective confounders, and inputting each of the latent values into a respective second encoder for the respective latent value, resulting in a respective embedding of each of the respective confounders; f) comprises combining the embeddings of all the identified parent variables and confounders together, thereby resulting in said combined embedding; and g) comprises inputting the combined embedding into a respective decoder associated with the selected variable, resulting in a reconstructed value of the selected variable. Statement 3. The method of Statement 2, comprising: i) for a given training data point comprising a given combination of input values of the variables of the feature vector, repeating a)-g) over multiple selections, each selection selecting a different one of the variables of the feature vector as the selected variable thereby resulting in a respective reconstructed value, the multiple selections together thereby resulting in a respective reconstructed version of the training data point comprising the reconstructed values for the training data point; ii) evaluating a measure of difference between the training data point and the reconstructed version; and iii) training parameters of the inference network, first and second encoders, decoders, and first and second graph distributions, based on the evaluated measure. Statement 4. The method of Statement 3, comprising repeating i)-iii) over multiple input data points, each comprising a different combination of input values of the variables of the feature vector. Statement 5. The method of Statement 4, comprising, after the training over the multiple training data points, observing the second graph distribution to estimate whether a confounder exists between a target pair of the variables of the feature vector . Statement 6. The method of Statement 5, comprising, after the training over the multiple training data points, observing the latent value of the respective confounder between the target pair of variables, resulting from the inference network, as an estimated value of the respective confounder. Statement 7. The method of Statement 5 or 6, wherein the variables of the feature vector model properties of a real-world entity or an environment thereof, and the method comprises, in dependence in said estimating of whether a confounder exists between the target pair of variables, determining whether or not to apply a treatment to one of said target pair of variables to affect the other of said target pair of variables. Statement 8. The method of Statement 7, wherein one of: the real-world entity comprises a living being, and the treatment comprises a medical treatment to the living being or the environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real- world entity comprises a network or software, and the treatment comprises an act of controlling the network or software. Statement 9. The method of any of Statements 2 to 8, comprising: I) setting the input value of an intervened-on variable of the feature vector, other than the selected variable, to a specified value; and II) estimating an effect of the intervened-on variable on the selected variable based on the reconstructed value of the selected variable. Statement 10. The method of Statement 9, wherein: I) comprises setting the input value of a plurality of intervened-on variables of the feature vector, other than the selected variable, to respective specified values; and II) comprises estimating the effect of the plurality of intervened- on variables based on the reconstructed value of the selected variable. Statement 11. The method of Statement 9 or 10, wherein: I) comprises repeating a)-g) over multiple graph-sampling events, each using the same selected variable and setting the intervened- on variable to the same specified value, but performing the sampling of the first and second causal graphs afresh each time, thereby resulting in a respective reconstructed value of the selected variable from each graph-sampling events; and II) comprises determining an average treatment affect of the intervened-on variable on the selected variable averaged over the multiple graph- sampling events. Statement 12. The method of Statement 11, wherein each graph-sampling event sets a plurality of intervened-on values to specified values, the same values each time; and II) comprises determining the average treatment effect of the plurality of intervened-on variables on the selected variable. Statement 13. The method of Statement 9, wherein the intervened-on variable models a treatment on a real-world entity or an environment thereof and the target variable models an effect of the treatment applied to the real-world entity, and the method further comprises actioning the treatment on the real-world entity in dependence on the estimated treatment effect. Statement 14. The method of Statement 13, wherein one of: the real-world entity comprises a living being, and the treatment comprises a medical treatment to the living being or an environment thereof; or the real-world entity comprises a mechanical, electrical or electronic device or system, or an environment thereof; and the treatment comprises an act of maintaining, debugging, upgrading or controlling the device or system, or controlling the environment thereof; or the real-world entity comprises a network or software, and the treatment comprises an act of controlling the network or software. Statement 15. The method of any of Statements 2 to 14, wherein the inference network comprises a respective constituent inference network for each pair of variables in the feature vector, and e) comprises: for each identified confounded pair, inputting the input value of the respective parent variable and the input value of the selected variable into the respective inference network for the respective confounded pair, thereby resulting in the respective latent value of the respective confounder. Alternatively the inference network may comprise a common inference network operable to encode all the variables of the feature vector together into the respective latent for each pair. Statement 16. The method of any of Statements 2 to 15, wherein each of the first and second encoders and each of the decoders comprises a neural network. Statement 17. The method of any of Statements 1 to 16, wherein the inference network comprises a neural network. Statement 18. The method of Statement 9 or any subsequent Statement when dependent thereon, wherein the method is performed on a server system of a first party, the server system comprising one or more server units at one or more sites; and the method further comprises, by the server system of the first party: providing an application programming interface, API, enabling a second party to contact the server system via a network; receiving a request from the second party over the network via the API; in response to the request, determining the estimated treatment effect on the target variable; and returning the estimated treatment effect to the second party over the network via the API. According to another aspect disclosed herein, there is provided a computer program embodied on computer-readable storage, wherein the computer program comprises a machine learning model comprising a plurality of first encoders, a plurality of second encoders, a decoder and an inference network; and wherein the computer-program further comprises instructions configured so as when run on one or more processors to perform the method of any preceding Statement. According to another aspect disclosed herein, there is provided a system comprising: processing apparatus comprising one or more processors; and memory comprising one or more memory units, wherein the memory stores: a machine learning model comprising the first encoders, second encoders, decoders, and inference network of any preceding Statement; and code arranged to run on the processing apparatus and being configured so as when run to perform the method of any preceding Statement. In some examples, the model 104/104’ and associated computer executable instructions are provided using any computer-readable media that are accessible by the computing equipment 102. Computer-readable media include, for example, computer storage media such as memory and communications media. Computer storage media include volatile and non-volatile, removable, and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or the like. Computer storage media include, but are not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), persistent memory, phase change memory, flash memory or other memory technology, Compact Disk Read-Only Memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, shingled disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing apparatus. In contrast, communication media may embody computer readable instructions, data structures, program modules, or the like in a modulated data signal, such as a carrier wave, or other transport mechanism. As defined herein, computer storage media do not include communication media. Therefore, a computer storage medium should not be interpreted to be a propagating signal per se. Propagated signals per se are not examples of computer storage media. Although the computer storage medium may be described within the computing equipment 102, it will be appreciated by a person skilled in the art, that, in some examples, the storage is distributed or located remotely and accessed via a network or other communication link (e.g., using a communication interface). Other variants or use cases of the disclosed techniques may become apparent to the person skilled in the art once given the disclosure herein. The scope of the disclosure is not limited by the described embodiments but only by the accompanying claims.

Previous Patent: DETECTING UPLOADS OF MALICIOUS FILES TO CLOUD STORAGE

Next Patent: NEURAL GRAPHICAL MODELS