Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ENSEMBLE LEARNING OF DIFFRACTIVE NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2022/056422
Kind Code:
A1
Abstract:
A diffractive neural network device for classification and/or processing optical images, optical signals, or optical data includes a plurality of diffractive neural network devices (i.e., an ensemble) configured to each receive the optical input. The diffractive neural network devices include a plurality of optically transmissive and/or reflective substrate layers arranged in an optical path and have a plurality of physical features formed thereon or within having different transmission and/or reflection coefficients as a function of the lateral coordinates across each substrate layer that are established by training computer models of each diffractive neural network. For each diffractive neural network of the ensemble, an object space input filter and/or a Fourier space input is filter disposed along an optical path for each diffractive neural network. One or more optical detectors are configured to capture the optical outputs/signal(s) resulting from the diffractive neural network devices of the ensemble.

Inventors:
OZCAN AYDOGAN (US)
RAHMAN MD SADMAN SAKIB (US)
LI JINGXI (US)
RIVENSON YAIR (US)
MENGU DENIZ (US)
Application Number:
PCT/US2021/050134
Publication Date:
March 17, 2022
Filing Date:
September 13, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
International Classes:
G02B5/32; G02F1/225; G06N3/04; G06N3/08
Domestic Patent References:
WO2019200289A12019-10-17
WO2020113240A22020-06-04
WO2020101863A22020-05-22
Foreign References:
US20200104716A12020-04-02
JP2019133628A2019-08-08
US20190370652A12019-12-05
RU2189078C22002-09-10
CN111582435A2020-08-25
CN110929864A2020-03-27
Attorney, Agent or Firm:
DAVIDSON, Michael S. (US)
Download PDF:
Claims:
What is claimed is:

1. A method of forming a diffractive neural network for classification and/or processing of at least one optical image, optical signal, or optical data comprising: training a plurality of diffractive neural network models to perform classification and/or processing of the at least one optical image, optical signal or optical data, with each diffractive neural network model comprising a multi-layer transmissive and/or reflective network having a plurality of diffractive physical features located in different positions in each of the layers of the transmissive and/or reflective diffractive neural network model, wherein each trained diffractive neural network model is associated with an object space input filter and/or a Fourier space input filter, and the training comprises feeding a plurality of different training images, optical signals, or optical data to the plurality of diffractive neural network models and computing at least one optical output or signal of optical transmission and/or reflection through/from the diffractive neural network models and iteratively adjusting the phase and/or amplitude of transmission/reflection coefficients for each layer of the multi-layer transmissive and/or reflective diffractive neural network models to arrive at optimized transmission and/or reflection coefficients; and iteratively pruning the number of individual diffractive neural network models to reduce the total number of diffractive neural network models that are used to perform the classification and/or processing of input data.

2. The method of claim 1, wherein the diffractive neural network models are trained individually or as groups of two or more network models.

3. The method of claim 1, further comprising manufacturing or having manufactured physical embodiment(s) of the multi-layer transmissive and/or reflective diffractive neural network models selected through the pruning operation, wherein each diffractive neural network comprises a plurality of substrate layers having physical features that match the optimized transmission and/or reflection coefficients generated during the training.

4. The method of claim 1, wherein the training images, optical signals, or optical data are encoded in the phase and/or amplitude channel of the input.

26

5. The method of claim 1, wherein the object space input filters comprise different shaped filters that represent different transmission or reflection functions.

6. The method of claim 1, wherein each of the object space input filters is located at or near the object plane or its virtual and/or digital replica.

7. The method of claim 1, wherein the Fourier space input filters comprise different shaped filters that represent different transmission or reflection functions.

8. The method of claim 1, wherein each of the Fourier space input filters is located at or near the Fourier plane of the object or its virtual and/or digital replica.

9. The method of claim 7, wherein the Fourier space input filters comprise transmissive or reflective filters interposed between a set of lenses.

10. The method of claim 1, wherein the object space input filters and/or the Fourier space input filters comprise learnable filters that are learned during training of the diffractive neural network models.

11. The method of claim 1, wherein iterative pruning comprises assigning individual weights to each class score of the individual diffractive neural network models and defining an ensemble class score as a weighted sum of the individual class scores and at each iteration of the ensemble pruning operation, optimizing the weights through a gradient descent and error backpropagation-based method or an optimization tool to minimize the softmax-cross-entropy (SCE) loss or another fidelity loss function defined between the predicted ensemble class scores and the labeled ground truth, followed by choosing the set of weights giving the highest data classification and/or processing accuracy.

12. The method of claim 11, further comprising randomly removing one or more diffractive neural network models with a certain period within the pruning iterations.

13. The method of claim 11, further comprising ranking the individual diffractive neural network models based on their weights and removing a certain number of diffractive neural network models based on their weight ranking during the pruning iterations.

14. The method of claim 1, further comprising one or more digital displays or screens interposed along one or more optical paths or at the input of the individual diffractive neural network models.

15. The method of claim 1, further comprising one or more spatial light modulators (SLMs) interposed along one or more optical paths or at the input of the individual diffractive neural network models.

16. The method of claim 1, further comprising one or more spatial light modulators (SLMs) that are used as the object space input filters and/or the Fourier space input filters.

17. A diffractive neural network device for classification and/or processing of at least one optical image, optical signal, or optical data comprising: a plurality of diffractive neural network devices configured to each receive an optical input containing the optical image, optical signal, or optical data, wherein each of the plurality of diffractive neural network devices comprises: a plurality of optically transmissive and/or reflective substrate layers arranged in an optical path, each of the plurality of optically transmissive and/or reflective substrate layers comprising a plurality of physical features formed on or within the plurality of optically transmissive and/or reflective substrate layers and having different transmission and/or reflection coefficients as a function of the lateral coordinates across each substrate layer, wherein the plurality of optically transmissive and/or reflective substrate layers and the plurality of physical features thereon collectively define a trained mapping function between the optical input to the plurality of optically transmissive and/or reflective substrate layers and one or more optical outputs or optical signal(s) created by optical diffraction through and/or optical reflection from the plurality of optically transmissive and/or reflective substrate layers; an object space input filter and/or a Fourier space input filter disposed along an optical path for each of the diffractive neural network devices; and one or more optical detectors configured to capture the one or more optical outputs or optical signal (s) resulting from the plurality of optically transmissive and/or reflective substrate layers for the plurality of diffractive neural network devices.

18. The device of claim 17, wherein the input images, optical signals, or optical data are encoded in the phase and/or amplitude channel of the input.

19. The device of claim 17, wherein the object space input filters comprise different shaped filters that represent different transmission or reflection functions.

20. The device of claim 17, wherein each of the object space input filters is located at or near the object plane or its virtual and/or digital replica.

21. The device of claim 17, wherein the Fourier space input filters comprise different shaped filters that represent different transmission or reflection functions.

22. The device of claim 17, wherein each of the Fourier space input filters is located at or near the Fourier plane of the object or its virtual and/or digital replica.

23. The device of claim 17, wherein the Fourier space input filters comprise transmissive or reflective filters interposed between a set of lenses.

24. The device of claim 17, wherein the object space input filters and/or the Fourier space input filters comprise learnable transmissive or reflective filters.

25. The device of claim 17, further comprising one or more digital displays or screens interposed along one or more optical paths or at the input of the individual optical network devices.

26. The device of claim 17, further comprising one or more spatial light modulators (SLMs) interposed along one or more optical paths or at the input of the individual optical network devices.

27. The device of claim 17, further comprising one or more spatial light modulators (SLMs) that are used as the object space input filters and/or the Fourier space input filters.

29

Description:
ENSEMBLE LEARNING OF DIFFRACTIVE NEURAL NETWORKS

Related Application

[0001] This Application claims priority to U.S. Provisional Patent Application No. 63/078,087 filed on September 14, 2020, which is hereby incorporated by reference. Priority is claimed pursuant to 35 U.S.C. § 119 and any other applicable statute.

Technical Field

[0002] The technical field relates to an optical deep learning physical architecture, platform, or device that can perform, at the speed of light, various complex functions. In particular, the technical field pertains to ensemble learning used to improve the inference performance of diffractive optical networks using feature engineering and ensemble learning.

Background

[0003] Recent years have witnessed the emergence of deep learning, which has facilitated powerful solutions to an array of intricate problems in artificial intelligence, including e.g., image classification, object detection, natural language processing, speech processing, bioinformatics, optical microscopy, holography, sensing and many more. Deep learning has become particularly popular because of the recent advances in the development of advanced computing hardware and the availability of large amounts of data for training of deep neural networks. Algorithms such as stochastic gradient descent and error back propagation enable deep neural networks to leam the mapping between an input and the target output distribution by processing a large number of examples. Motivated by this major success enabled by deep learning, there has also been a revival of interest in optical computing, which has some important and appealing features such as e.g., (1) parallelism provided by optics/photonics systems, (2) potentially improved power efficiency through e.g., passive and/or low-loss optical interactions, and (3) minimal latency.

[0004] As a recent example of an entirely passive optical computing system, Diffractive Deep Neural Networks (D 2 NN) have been demonstrated to perform all-optical inference and image classification through the modulation of input optical waves by successive diffractive surfaces that are trained through deep learning methods, e.g., stochastic gradient-descent and error-backpropagation. See e.g., Lin et al., All-optical machine learning using diffractive deep neural networks, Science, Vol. 361, Issue 6406, pp, 1004-08 (2018). Earlier generations of these diffractive neural networks achieved >98% blind testing accuracies in classification of handwritten digits (MNIST) that are encoded in the amplitude or phase channels of the input optical fields, and were experimentally demonstrated using terahertz wavelengths along with 3D-printing of the resulting diffractive layers/surfaces that form a physical network. In a D 2 NN that is fabricated with linear materials, where nonlinear optical processes including surface nonlinearities are negligible, the only form of nonlinearity in the forward optical model occurs at the opto-electronic detector plane. Without the use of any non-linear activation function, D 2 NN framework still exhibits depth feature as its statistical inference and generalization capabilities improve with additional diffractive layers, which was demonstrated both empirically and theoretically. The same diffractive processing framework of D 2 NNS has also been utilized to design deterministic optical components for e.g., ultra- short pulse shaping, spectral filtering and wavelength division multiplexing.

[0005] To further improve the inference capabilities of optical computing hardware, coupling of diffractive optical systems with jointly -trained electronic neural networks that form opto-electronic hybrid systems have also been reported, where the front-end is optical/diffractive and the back-end is all-electronic. Despite all this progress, there is still significant room for further improvements in diffractive processing of optical information.

Summary

[0006] Here, major advances in the optical inference and generalization capabilities of the D 2 NN framework is shown by feature engineering and ensemble learning over multiple independently trained diffractive neural networks, where parallel processing of optical information is exploited. To create this advancement, first the base D 2 NN models were diversified by manipulating their training inputs by means of spatial feature engineering (i.e., different masks). In this approach, the input fields are filtered either in the object space or in the Fourier space by introducing an assortment of curated passive filters, before the diffractive networks (see FIGS. 1 A-1C). Following the individual training of 1252 uniquely different D 2 NNs with various features, an iterative pruning strategy was used to obtain ensembles of D 2 NNs that work in parallel to improve the final classification accuracy by combining the decisions of the individual diffractive classifiers. Based on this feature learning and iterative pruning strategy, blind testing accuracies of 61.14±0.23% and 62.13±0.05% were numerically achieved on the classification of CIFAR-10 test images with ensemble sizes ofN=14 and N=30 D 2 NNs, respectively. Stated differently, fourteen (14) D 2 NNS (or thirty (30) D 2 NNs) selected through this pruning approach work in parallel to collectively reach 61.14±0.23% (62.13±0.05%) optical inference accuracy for CIFAR-10 test dataset, which provides an improvement of >16% over the average classification accuracy of the individual D 2 NNs within each ensemble, demonstrating a “wisdom of the crowd” effect. The use of ensembles significantly improves inference and generalization performance provided by feature engineering and ensemble learning of D 2 NNs and marks a major step forward to open up new avenues for optics-based computation, machine learning and machine vision related systems, benefiting from parallelism of optical systems.

[0007] In one embodiment, a method of forming a diffractive neural network for classification and/or processing of at least one optical image, optical signal, or optical data is disclosed. The method includes: training a plurality of diffractive neural network models to perform classification and/or processing of the at least one optical image, optical signal or optical data, with each diffractive neural network model comprising a multi-layer transmissive and/or reflective network having a plurality of diffractive physical features located in different positions in each of the layers of the transmissive and/or reflective diffractive neural network model, wherein each trained diffractive neural network model is associated with an object space input filter and/or a Fourier space input filter, and the training comprises feeding a plurality of different training images, optical signals, or optical data to the plurality of diffractive neural network models and computing at least one optical output or signal of optical transmission and/or reflection through/from the diffractive neural network models and iteratively adjusting the phase and/or amplitude of transmission/reflection coefficients for each layer of the multi-layer transmissive and/or reflective diffractive neural network models to arrive at optimized transmission and/or reflection coefficients. The number of individual diffractive neural network models is iteratively pruned to reduce the total number of diffractive neural network models that are used to perform the classification and/or processing of input data.

[0008] In another embodiment, a diffractive neural network device for classification and/or processing of at least one optical image, optical signal, or optical data is disclosed. The device includes a plurality of diffractive neural network devices configured to each receive an optical input containing the optical image, optical signal, or optical data, wherein each of the plurality of diffractive neural network devices includes: a plurality of optically transmissive and/or reflective substrate layers arranged in an optical path, each of the plurality of optically transmissive and/or reflective substrate layers comprising a plurality of physical features formed on or within the plurality of optically transmissive and/or reflective substrate layers and having different transmission and/or reflection coefficients as a function of the lateral coordinates across each substrate layer, wherein the plurality of optically transmissive and/or reflective substrate layers and the plurality of physical features thereon collectively define a trained mapping function between the optical input to the plurality of optically transmissive and/or reflective substrate layers and one or more optical outputs or optical signal(s) created by optical diffraction through and/or optical reflection from the plurality of optically transmissive and/or reflective substrate layers. The device further includes an object space input filter and/or a Fourier space input filter disposed along an optical path for each of the diffractive neural network devices. One or more optical detectors are included with the device and configured to capture the one or more optical outputs or optical signal(s) resulting from the plurality of optically transmissive and/or reflective substrate layers for the plurality of diffractive neural network devices.

Brief Description of the Drawings

[0009] FIGS. 1A-1C schematically illustrates the ensemble diffractive neural network system. FIG. 1 A illustrates an example of a D 2 NN using a feature engineered input, where an input mask with a passive transmission window opened at a certain position is employed against the object plane. An object from the CIFAR-10 image dataset is shown as an example and is encoded either in the amplitude channel or in the phase channel of the input plane of the diffractive network. FIG. IB is the same as in FIG. 1 A, but uses a passive input mask placed on the Fourier plane of a 4-f system; here a bandpass filter is shown as an example. FIG. 1C is an ensemble D 2 NN system, formed by N different feature engineered D 2 NNs, is shown where each diffractive network of the ensemble takes the form of the D 2 NN of (FIG. 1A) or (FIG. IB). The final ensemble class score is computed through a weighted summation of the differential detector signals obtained from the individual diffractive networks. Through feature engineering and ensemble learning, blind inference accuracies of 62.13±0.05%, 61.14±0.23% and 60.35±0.39% were achieved on CIFAR-10 test image dataset using N=30, N=14 and N=12 D 2 NNs, respectively. The standard deviations are calculated through three (3) repeats using the same hyperparameters.

[0010] FIGS. 2A-2F show the inference accuracy of D 2 NN ensembles as a function of Nmax and N. FIG. 2A: Variation of the blind testing accuracy as a function of the maximum allowed ensemble size (Nmax) during the pruning; FIG. 2B: Variation of the blind testing accuracy as a function of the selected ensemble size (N); FIG. 2C: Relationship between Nmax and N. FIG. 2D illustrates the average of the accuracy data of FIG. 2A. FIG. 2E illustrates the average of the accuracy data of FIG. 2B. FIG. 2F illustrates the average of the selected ensemble size data of FIG. 2C. The symbols in the legend denote different pruning parameters used in the ensemble selection process. See also FIG. 4 and Table 1.

[0011] FIGS. 3A-3D show an ensemble of N=14 D 2 NNs that achieves a blind classification accuracy of 61.21% on CIFAR-10 test dataset. FIG. 3A: Input filters/masks used before each one of the D 2 NNs that form the ensemble. For D 2 NNs 1, 5-9: the input filters are on the object plane. For the remaining D 2 NNs 2-4, 10-14: the input filters are on the Fourier plane. The input filters corresponding to the networks with phase encoded inputs are enclosed within a border/frame (5-14), while the inputs of the diffractive networks 1-4 are amplitude encoded. The dynamic range of the input phase encoding for filter 5 is O-zr. The dynamic range for filters 6, 7, 9 is 0-3zr/2. The dynamic range for filter 11 is O-n/2. The dynamic range for filters 8, 10, 12-14 is 0-2zr. Class specific weights for each D 2 NN of the ensemble. If one ignores these class specific weights and replaces them with all ones, the blind inference accuracy slightly decreases to 61.08%, from 61.21%. FIG. 3C: True positive rates of the individual diffractive networks, compared against their ensemble for different classes. FIG. 3D: Test accuracy of the individual networks compared against their ensemble. The dashed lines show the classification performance improvement (-16.6%) achieved by the diffractive ensemble over the mean performance of the individual D 2 NNs. Three repeats with the same hyperparameters resulted in a blind classification accuracy of 61.14±0.23%, where 61.21% represents the median, detailed in this FIG.

[0012] FIG. 4 shows a flow chart of the ensemble pruning process. The meaning of the symbols is as follows: i: iteration number. S : the set of ensembles, resulting after each iteration; S . ensemble on iteration z; m: the number of networks in the ensemble on iteration z; Wk', the weight vector for network k, T. the interval between random elimination of D 2 NNs; Sd.i. the set of networks to eliminate from the ensemble on iteration z; na.r. the number of networks to eliminate from the ensemble on iteration i; IT. the fraction of networks to retain on iteration z; m: the ratio of the number of randomly eliminated networks to the number of networks eliminated based on ranking; p: the fraction of the networks in the ensemble on which random elimination is applied. At the end of the pruning process, S includes a series of D 2 NN ensembles (formed by A) of gradually decreasing size.

[0013] FIGS. 5A-5D illustrate an ensemble of N=12 D 2 NNs that achieves a blind classification accuracy of 60.29% on CIFAR-10 test dataset. FIG. 5 A: Input filters/masks used before each one of the D 2 NNs that form the ensemble. For D 2 NNs 1, 2, 4-7: the input filters are on the object plane. For the remaining D 2 NNs 3, 8-12: the input filters are on the Fourier plane. The input filters corresponding to the networks with phase encoded input are enclosed within a border/frame (4-12), while the inputs of the diffractive networks 1-3 are amplitude encoded. The dynamic range of the input phase encoding for filter 4 is 0-7i. The dynamic range for filters 6, 7 is 0-3TI/2. The dynamic range for filter 9 is O-n/2. The dynamic range for filters 5, 8, 10-12 is 0-2TI. FIG. 5B: Class specific weights for each D 2 NN of the ensemble. FIG. 5C: True positive rates of the individual networks, compared against their ensemble for different classes. FIG. 5D: Test accuracy of the individual networks compared against their ensemble. The dashed lines show the classification performance improvement (-14.7%) achieved by the diffractive ensemble over the mean performance of the individual D 2 NNS. Three repeats with the same hyperparameters resulted in a blind classification accuracy of 60.35±0.39%, where 60.29% represents the median.

[0014] FIGS. 6A-6D illustrate a D 2 NN ensemble consisting of only phase-encoded-input networks (N=14) achieves a blind classification accuracy of 60.65% on CIFAR-10 test dataset. FIG. 6A: Input filters/masks used before each one of the D 2 NNs that form the ensemble. For D 2 NNsl-7: the input filters are on the object plane. For the remaining D 2 NNs 8-14: the input filters are on the Fourier plane. The dynamic range of the input phase encoding for filters 1, 2, and 6 is 0-7i. The dynamic range for filters 3 and 5 is 0-3TI/2. The dynamic range for filters 9, 10, and 11 is 0-7i/2. The dynamic range for filters 4, 7, 8, and 12- 14 is 0-271. FIG. 6B: Class specific weights for each D 2 NN of the ensemble. FIG. 6C: True positive rates of the individual networks, compared against their ensemble for different classes. FIG. 6D: Test accuracy of the individual networks compared against their ensemble. The dashed lines show the classification performance improvement (-15.4%) achieved by the diffractive ensemble over the mean performance of the individual D 2 NNs. Three repeats with the same hyperparameters resulted in a blind classification accuracy of 60.74±0.17%, where 60.65% represents the median.

[0015] FIGS. 7A-7D illustrate a D 2 NN ensemble consisting of only phase-encoded-input networks (N=12) achieves a blind classification accuracy of 60.43% on CIFAR-10 test dataset. FIG. 7A: Input filters/masks used before each one of the constituent D 2 NNs that form the ensemble. For D 2 NNs 1-5: the input filters are on the object plane. For the remaining D 2 NNS 6-12: the input filters are on the Fourier plane.

[0016] The dynamic range of the input phase encoding for filters 1, 2, and 5 is 0-7i. The dynamic range for filter 4 5 is 0-3TI/2. The dynamic range for filters 9 and 10 is 0-7i/2. The dynamic range for filters 3, 6-8, 11, and 12 is 0-2TI. FIG. 7B: Class specific weights for each D 2 NN of the ensemble. FIG. 7C: True positive rates of the individual networks, compared against their ensemble for different classes. FIG. 7D: Test accuracy of the individual networks compared against their ensemble. The dashed lines show the classification performance improvement (-15.1%) achieved by the diffractive ensemble over the mean performance of the individual D 2 NNs. Three repeats with the same hyperparameters resulted in a blind classification accuracy of 60.41±0.10%, where 60.43% represents the median. [0017] FIG. 8 illustrates a flowchart of operations used in designing an ensemble-based diffractive neural network for classification of at least one optical image, signal, or data (or other task) according to one embodiment. The ensemble diffractive neural network is also illustrated along with an exemplary task: classifying an image (knife).

[0018] FIG. 9A illustrates a single substrate layer of the diffractive neural network showing a plurality of physical features formed thereon that collectively define a pattern of locations along the length and width of each layer that have varied transmission coefficients (or varied reflection coefficients for reflection-based embodiments).

[0019] FIG. 9B shows a cross-sectional view of the single substrate layer of FIG. 9A. The physical features across the lateral surface of the substrate layer may be accomplished by varying the thickness (/) of the substrate or material making up substrate the layer at different locations along the layer.

[0020] FIG. 10A illustrates an embodiment of a single diffractive neural network of an ensemble that uses a differential configuration for the optical detectors. Each data class is represented by a pair of detectors (or other groupings) at the output plane, where the normalized difference between these optical detector pairs represents the class scores.

[0021] FIG. 10B illustrates circuitry that is used to perform a differential operation on groups of detectors of the embodiment of FIG. 1C.

[0022] FIG. 11 illustrates an ensemble-based diffractive neural network used for object classification that uses a plurality of different diffractive neural networks that collectively work together to perform the task of object classification.

[0023] FIG. 12 illustrates a table of different filters used either on the object plane or the Fourier plane used to train the diffractive neural network models used in the ensemble-based diffractive neural network. Examples of input filters or masks are also provided.

Detailed Description of Illustrated Embodiments

[0024] Ensemble learning refers to improving the inference capability of a system by training multiple models instead of a single model, and combining the predictions of the constituent models (known as base models, base learners or inducers). It is also possible to learn how to combine the decisions of the base learners, which is known as meta-leaming (learning from learners). Ensemble learning is beneficial for several reasons; if the size of the training data is small, the base learners are prone to overfitting and as a result suffer from poor generalization to unseen data. Combining multiple base learners helps to ameliorate this problem. Also, by combining different models, the hypothesis space can be extended and the probability of getting stuck in a local minimum is reduced. An important aspect to consider when generating ensembles is the diversity of the learned base models. The learned models should be diverse enough to ensure that different models leam from different attributes of the data, such that through their “collective wisdom,” the ensemble of these models can eliminate the implicit variance of the constituent models and substantially improve the collective inference performance. One approach to enrich the diversity of the base models is to manipulate the training data used to train different classifiers, making them leam different features of the input space in each trained model. On top of the training of these unique and independent classifiers, pruning methods that aim at finding small sized ensembles, while also achieving a competitive inference performance are also very important.

[0025] Based on these considerations, FIGS. 1A and IB depict the two types of diffractive neural networks 20 (also referred to herein as Diffractive Deep Neural Networks (D 2 NNs)) (base learners) that were selected to constitute the constituents of a diffractive neural network ensemble 22 (FIGS. 1C, 8, and 11). The difference between these two types of D 2 NNs 20 lies in the placement of the input filter 24 (in one embodiment is passive) in the form of a mask which is used to filter out different spatial features of the object field of an object 48 to variegate the information fed to the base D 2 NN classifiers 20. In the structure of the diffractive neural network 20 of FIG. 1A, the input filter 24 is placed on or near the object plane, whereas the structure of FIG. IB uses an input filter 24 on the Fourier plane of a 4-f system placed before the D 2 NN 20. Further heterogeneity is introduced by diversifying the input filter 24 profiles for both types depicted in FIGS. 1 A and IB (see FIG. 12). For example, input filters 24 with transmissive windows of different shapes (rectangular, Gaussian, Hamming, Hanning) and different locations are used at the object plane. The input filters 24 used at the Fourier plane also vary in terms of their pass/stop bands. To further improve the diversity of the base model D 2 NNs 20, the input object information is encoded into either the phase channel with four (4) different dynamic ranges, or the amplitude channel of the illumination field. Using all of these different hyperparameter choices and their combinations, 1252 unique D 2 NN classifiers 20 were trained to form the initial pool of the D 2 NN classifier networks 20. Three hundred forty (340) of these D 2 NN classifier networks 20 had the input object information encoded in the amplitude channel, while 912 of them had phase encoded inputs. 276 of the amplitude encoded D 2 NNs 20 had an input filter 24 located on the object plane and 64 had an input filter 24 located on the Fourier plane. 656 of the phase-encoded-input D 2 NN classifier networks 20 had an input filter 24 on the object plane and 256 had an input filter 24 on the Fourier plane. For these 1252 unique D 2 NN classifier networks 20, each diffractive neural network 20 subsequently acts on the filtered version of the input image 50, and therefore the trained diffractive layers of each base D 2 NN 20 directly act on the space domain information (not the frequency or Fourier domain). While an input image 50 containing an object 48 is disclosed herein, it should be appreciated that the optical input 50 may include an optical signal, or optical data.

[0026] In the D 2 NNs 20, as explained herein, the trained diffractive layers of the electronic models are then manufactured or fabricated into physical substrate layers 28 that are arranged in a space apart arrangement as seen, for example, in FIGS. 1A-1C, 8, and 10A. As explained herein, the substrate layers 28 contain physical features 30 (or have physical properties) that reflect the computer-derived designs that are generated during the training of the electronic versions of the D 2 NNs 20 in the diffractive neural network ensemble 22. The different physical features 30 form “artificial neurons” in each of the plurality of substrate layers 30. The features 30 and/or the physical regions formed thereby act as artificial “neurons” that connect to other “neurons” of other substrate layers 28 of the diffractive neural network 20 through optical diffraction (or reflection in the case of a reflection-based embodiment) and alter the phase and/or amplitude of the light wave. The particular number and density of the features 30 or artificial neurons that are formed in each substrate layer 28 may vary depending on the type of application. In some embodiments, the total number of artificial neurons may only need to be in the hundreds or thousands while in other embodiments, hundreds of thousands or millions of neurons or more may be used. Likewise, the number of substrate layers 28 that are used in a particular diffractive neural network 20 may vary although it typically ranges from at least two substrate layers 28 to less than ten substrate layers 28.

[0027] With respect to training of the electronic models of the diffractive neural networks 20, the preparation of this initial set of 1252 unique D 2 NNs 20 was followed by iterative pruning, with the aim of obtaining ensembles of significantly reduced size, i.e., with much smaller number of D 2 NNs 20 (base models) in the ensemble. Ensemble pruning was performed by assigning weights (w) to each class score of the individual D 2 NN 20 classifiers and defining the ensemble class score as a weighted sum of the individual class scores (FIG. 1C). At each iteration of the ensemble pruning, the weights were optimized through the gradient descent and error backpropagation method to minimize the softmax-cross-entropy (SCE) loss between the predicted ensemble class scores and their one-hot labeled ground truth (of course, other fidelity loss functions may also be used), followed by choosing the set of weights giving the highest accuracy. Then, the ‘significance’ of the individual D 2 NNs 20 in a given state of the ensemble 22 was quantified and ranked by the absolute summation (i.e., LI norm) of their weights, based on which a certain fraction of the D 2 NNs 20 were then eliminated from the ensemble 22 due to their relatively minor contributions. In addition to this greedy search, a periodic random elimination of the individual D 2 NNs 20 from the ensemble 22 was also used in the pruning process, so that the solution space could be expanded.

[0028] Based on this pruning process, the iterative search algorithm resulted in a sequence of D 2 NN ensembles 22 with gradually decreasing sizes. To select the final ensemble 22 with a desired size (i.e., the number of unique diffractive neural networks 20), a maximum limit on the ensemble size (referred to as the ‘maximum allowed ensemble size’, i.e., N max) was set, and searched for the D 2 NN ensemble 22 that achieves the best performance in terms of inference accuracy on the validation dataset (i.e., the test dataset was never used during the pruning phase). As this procedure was followed for different values of the pruning hyperparameters, D 2 NN ensembles 22 with different sizes and blind testing accuracies were created; the search was repeated three (3) times for each set of hyperparameters, which helped quantify the mean and standard deviation of the inference accuracy for the resulting D 2 NN ensembles 22. Based on these analyses, FIGS. 2A, 2D reveal that as the maximum allowed ensemble size (Nmax) gets larger, the blind testing accuracies increase. FIGS. 2B, 2E show a similar trend reporting the blind testing accuracies as a function of N, i.e., the number of D 2 NNS 20 in the ensemble 22 that is selected. FIGS. 2C and 2F further report the relationship between N and Nmax during the pruning process, which indicates that on average these two quantities vary linearly (with a slope of ~1).

[0029] While the results reported in FIGS. 2A, 2D, 2B, 2E demonstrate the significant gains achieved through the ensemble learning of diffractive neural networks 20, they also highlight a diminishing return on the blind inference accuracy of the ensemble 22 with increasing number of D 2 NNs 20 selected. For example, with ensemble sizes of N=14 and N=30 D 2 NNS 20, the ensemble 22 achieved blind inference, image classification accuracies of 61.14±0.23% and 62.13±0.05%, respectively, on CIFAR-10 test dataset. Increasing the ensemble size to e.g., N=77 D 2 NNs 20 resulted in a classification accuracy of 62.56% on the same test dataset. Because of this diminishing return achieved by larger ensemble sizes, Nmax=14 was used to better explore a sweet-spot. Table 1 below reports the blind testing accuracies achieved for different pruning hyperparameters for a maximum allowed ensemble size of 14.

[0030] In particular, Table 1 shows a comparison of blind testing accuracy results achieved under different pruning hyperparameters, with a maximum allowed ensemble size of Nmax=14 (see FIG. 4). For the classification accuracies that are reported, the average and the standard deviation values result from three (3) independent repeats of the pruning process using the same hyperparameters. Table 2 describes the schemes used for r ; denoted by (i), (ii) and (iii). The highlighted solid box highlights the ensemble 22 achieving the best average blind testing accuracy (N=14), and the other highlighted solid box highlights the ensemble achieving the best average blind testing accuracy per network (N=12).

Table 1 Table 2

[0031] These results summarized in Table 1 reveal that, although non-intuitive, the periodic random elimination of diffractive models during the pruning process results in better classification accuracies, compared to pruning with no random model elimination; see the columns in Table 1 with T= 00 , where T refers to the interval between periodic random elimination of D 2 NN 20 models. In Table 1, the best average blind testing accuracy (61.14±0.23%) that was achieved for Nmax=14 is highlighted. For three (3) individual repeats of the pruning process using the same hyperparameters, the classification accuracies achieved by the resulting fourteen (14) D 2 NNs 20 were 60.88%, 61.33% and 61.21%. FIGS. 3A-3D further presents a detailed analysis of the latter N=14 ensemble that achieved a blind testing accuracy of 61.21%, which is the median among the three (3) repeats. Six of the selected base D 2 NN classifiers 20 have input filters 24 on the object plane, while the remaining eight have input filters 24 on the Fourier plane (FIG. 3A). FIG. 3B also shows the magnitudes of the class specific weights, optimized for the base classifiers of this N=14 ensemble. Even if these optimized weights are ignored, and made all to be equal to 1, the same diffractive ensemble of fourteen (14) D 2 NNs 20 achieves a similar inference accuracy of 61.08%, a small reduction from 61.21%.

[0032] In addition to these, FIG. 3C also shows the true positive rates for each class, corresponding to the individual members of N=14 D 2 NNs 20 as well as the ensemble 22. The improvements in the true positive rates of the ensemble 22 over the mean performance of the individual classifiers 20 for different data classes he between 13.47% (for class 0) and 19.98% (for class 6). FIG. 3D further presents a comparison of the classification accuracies of the individual diffractive classifiers 20 compared against their ensemble 22. Through these comparative analyses reported in FIGS. 3C and 3D, it is evident that the performance of the ensemble 22 is significantly better than any individual diffractive network 20 of the ensemble 22, demonstrating the “wisdom of the crowd” achieved through the pruning process.

[0033] In Table 1, another metric is reported, i.e., ‘the accuracy per network’, which is the average accuracy divided by the number of networks in the ensemble, to reveal the performance efficiency of ensembles that achieve at least 60% average blind testing accuracy for CIFAR-10 test dataset. The best performance achieved in Table 1 based on this metric is highlighted with a highlighted (solid line) box: N=12 unique D 2 NNs selected by the pruning process with Nmax=14 achieved a blind testing accuracy of 60.35±0.39%, where the accuracy values for the individual three (3) repeats were 60.77%, 60.00% and 60.29%. Details of the latter ensemble with a blind testing accuracy of 60.29% (which is the median among the three repeats) can be found in FIGS. 5A-5D, revealing the selected input filters 24 and the class specific weights of the resulting 12 D 2 NNs 20 of this ensemble 22.

[0034] The results reveal that encoding the input object 48 information in the amplitude channel of some of the base D 2 NNs 20 and in the phase channel of the other D 2 NNs 20 help to diversify the ensemble 22. Table 3 further confirms this by reporting the blind testing accuracies achieved when the initial ensemble 22 consists of only the 912 D 2 NNs 20 whose input is encoded in the phase channel.

Table 3

[0035] Table 3 shows a comparison of blind testing accuracy results achieved under different pruning hyperparameters, with only phase encoded input D 2 NNs 20 and a maximum allowed ensemble size of Nmax=14 (see FIG. 4). For the classification accuracies that are reported, the average and the standard deviation values result from three (3) independent repeats of the pruning process using the same hyperparameters. The upper rectangular box (7=20) highlights the D 2 NN ensemble 22 achieving the best average blind testing accuracy (N=14), and the lower rectangular box (7 0) highlights the D 2 NN ensemble 22 achieving the best average blind testing accuracy per network (N=12).

[0036] A direct comparison of Table 1 and Table 3 reveals that including both types of input encoding (phase and amplitude) within the ensemble 22 helps improve the inference accuracy. Using only phase encoding for the input of D 2 NNs 20, the best average blind testing accuracy achieved using Nmax=14 was 60.74±0.17 with an ensemble size of N=14. The detailed description of the median of these D 2 NN ensembles 22 with a classification test accuracy of 60.65% is provided in FIGS. 6A-6D. FIGS. 7A-7D also shows the details of another phase-only input encoding ensemble with N=12 D 2 NNs 20, achieving a blind testing accuracy of 60.43%.

[0037] Finally, it is noteworthy that the top ten D 2 NNs 20 (in terms of their individual blind testing accuracies) within the initial pool of 1252 networks were not selected in any of the D 2 NN ensembles 22 of FIG. 3 and FIGS. 5A-5D, 6A-6D, 7A-7D. This corroborates the conjecture that the individual performance of a base model might not be indicative of its performance within an ensemble 22. In fact, several of the base D 2 NNs 20 selected in the ensembles of FIG. 3 and FIGS. 5A-5D, 6A-6D, 7A-7D had blind testing accuracies <40%, whereas the blind testing accuracies of the best models (not chosen in any of the ensembles) were >50%.

[0038] Although forming an D 2 NN ensemble 22 of separately trained D 2 NNs 20 ensues a major improvement in the classification and generalization performance of diffractive networks, further improvements might be possible to reduce the performance gap with respect to the state-of-the-art electronic neural networks. The classification accuracies of widely known all-electronic classifiers on grayscale CIFAR-10 test image dataset can be summarized as: Support Vector Machine (SVM) 37.13%, LeNet 66.43%, AlexNet 72.64%, ResNet 87.54%. While the blind testing accuracy of an ensemble 22 of N=30 unique diffractive optical networks 20 (62.13±0.05%) comes close to the performance of LeNet, which was the first demonstration of a convolutional neural network (CNN), there is still a large performance gap with respect to the state-of-the-art CNNs, and this suggests that there might be more room for improvements, especially through a wider span of input feature engineering within larger pools of D 2 NNs 20, forming a much richer and more diverse initial condition for iterative pruning. [0039] The presented improvement in the classification performance of D 2 NNs 20 obtained with feature engineering and ensemble learning does not come free of cost. Due to the multiple optical paths as part of this framework, the number of diffractive layers 28 and the opto-electronic detectors 40 to be fabricated and used increases in proportion to the number of networks (N) 20 used in the final ensemble 22, which results in an increased complexity for the optical network set-up. The required training time also raises up significantly because of the need for a large number of individual networks 20 in the initial pool, which in this case was 1252 individual D 2 NNs 20. However, this training process is a one-time effort, and the inference time or latency remains the same by virtue of the parallel processing capability of the diffractive optical ensemble system 22; stated differently, the information processing occurs through diffraction of light within each D 2 NN 20 of the ensemble 22, and because all of the individual diffractive networks 20 of an ensemble 22 are passive devices that work in parallel, a slowdown in speed of inference is not expected. Also, the detection circuitry complexity of the diffractive optics-based solutions is still minimal compared to its electronic counterparts, and the hardware complexity of D 2 NN ensembles 22 can be reduced even further by using an additive sum of the individual class scores instead of the weighted sum, at the cost of a very small sacrifice in the inference accuracy. For example, for the ensemble 22 of D 2 NNs 20 depicted in FIG. 3, if a simple additive sum of the individual class scores is used instead of the optimized class-specific weights, the blind classification accuracy reduces only slightly from 61.21% to 61.08%. This suggests that a further reduction in the hardware complexity is attainable with a very small sacrifice in the inference accuracy by discarding the specific weights of the class scores. However, these weights still play a very significant role in the pruning process as they help in the selection of the diffractive models to be retained in each iteration during the ensemble pruning by measuring/ quantifying the significance of the individual networks in an ensemble 22.

[0040] Some of the drawbacks associated with the relatively increased size and complexity of the optical hardware should also become less restrictive since the advances in integrated photonics and fabrication technologies have led to continuous miniaturization of opto-electronic devices. In addition to the issues of hardware complexity and size, to maintain a desired signal-to-noise ratio at the output optical detectors 40, the optical input (illumination) power of the system needs to be increased in proportion to the ensemble size. However, due to the availability of various high-power laser sources, this higher demand for increased illumination power of the system will not be a significant obstacle for its operation. [0041] In summary, the statistical inference and generalization performance of D 2 NNs 20 was significantly improved using feature engineering using input filters 24 and ensemble learning. A total of 1252 unique D 2 NNs 20 were independently trained that were diversely engineered with various passive input filters 24. Using a pruning algorithm, these 1252 D 2 NNS 20 were then searched to select an ensemble 22 that collectively improves the image classification accuracy of the optical network. The results revealed that ensembles 22 of N=14 and N=30 D 2 NNs 20 achieve blind testing accuracies of 61.14±0.23% and 62.13±0.05%, respectively, on the classification of CIFAR-10 test images, which constitute the highest inference accuracies achieved to date by any diffractive neural network design applied on this dataset. The versatility of D 2 NN framework stems from its applicability to different parts of the electromagnetic spectrum and the availability of miscellaneous fabrication techniques such as 3D printing and lithography. Together with further advances in the miniaturization and fabrication of optical systems, the presented results and the underlying platform may be utilized in a variety of applications, for e.g., ultra-fast object classification, diffraction-based optical computing hardware, and computational imaging tasks.

[0042] Materials and Methods

[0043] Implementation of D 2 NNs. As the basic building block of the diffractive ensemble, all the individual D 2 NN base classifiers 20 presented herein consist of five (5) successive diffractive layers which are then fabricated into corresponding substrate layers 28, which modulate the phase of the incidence optical field and are connected to each other by free space propagation in air. The propagation model used was formulated based on the Rayleigh-Sommerfeld diffraction equation, assuming that each diffractive feature 30 (or “neuron”) on the diffractive layers (or corresponding substrate layers 28) serves as a source of modulated secondary waves, which jointly form the propagated wave field. The presented results and analyses are broadly applicable to any part of the electromagnetic spectrum as long as the diffractive features 30 and the physical dimensions are accordingly scaled with respect to the wavelength of light. Using a coherent illumination wavelength of Z. for all the diffractive network designs, the size of each neuron and the axial distance between two successive diffractive layers or substrate layers 28 were set to be ~0.5 X and 40 X, respectively, which guarantees an adequate diffraction cone for each neuron to optically communicate with all the neurons of the consecutive layer 28, and enables the diffractive optical network to be “fully-connected”. Each optical detector 40 (e.g., photodetector) at the output plane of a D 2 NN 20 is assumed to be a square, with a width of 6.4 X. Since a differential detection scheme using pairs of optical detectors 40 for each class was employed, the optical detectors 40 were divided into two groups: positive detectors 40p and negative detectors 40n, and were collectively used to compute the differential class scores for network k, i.e., z ck through the following equation:

Z ck + Z ck -

[0044] z ck = ck + ck ’ (1)

Z ck,- + Z ck,-

[0045] where z ck + and z ck > denote the optical signal of the positive and the negative detectors 40p, 40n for class c, respectively. An empirical factor of K 0.1, also termed as the “temperature” coefficient in machine learning literature, was a non-trainable hyperparameter that was utilized to achieve more efficient convergence during the training phase by dividing Eq. (1) by K. In addition, the input object optical image 50 to the diffractive neural networks 20 was encoded either in the amplitude or in the phase channel of the input illumination, which is assumed to be a uniform plane wave generated by a coherent source. The phase encoding of the input objects took values from either of the following four intervals: 0-0.5n, 0-7i, 0-1.571 or 0-271.

[0046] Feature engineering of diffractive networks. Two types of feature-engineered diffractive networks 20 were used: one employed an input filter 24 placed on/against the object plane that filters the spatial signals directly, while the other one used an input filter 24 placed on the Fourier plane of a 4-f system to filter certain spatial frequency components of the object. Unless the filters 24 are specifically mentioned to be trainable, these input filter designs were pre-defined, keeping the transmittance of their pixels constant during the training of the diffractive networks 20 (see e.g., FIG. 12). For all the feature engineered D 2 NN classifiers 20, each diffractive network 20 subsequently acts on the filtered input image, directly processing the input information on the spatial domain, not the frequency or Fourier domain. The initial network pool contained 1252 individually-trained D 2 NNs 20. FIG. 12 shows for each type of the input filter 24 (e.g., mask) design, a brief description is given, and the number of trained base D 2 NN classifiers 20 and some examples are presented. [0047] The object plane filters 24 are designed to be of the same size as the object 50, containing transmissive patterns, the amplitude distribution of which takes one of the following forms: 1) 2D Gaussian functions defined with variable shapes and center positions; 2) multiple superposed 2D Gaussian functions defined with variable center positions; 3) 2D Hamming/Hanning functions defined with variable center positions; 4) square windows with different sizes at variable center positions; 5) multiple square windows at variable center positions; 6) patch-shaped windows rotated at variable angles; 7) circular windows at variable center positions; 8) sinusoidal gratings with variable periods and orientations; 9) Fresnel zone plates with variable x-y spatial positions; and 10) superpositions of Gaussian functions and square windows at variable spatial x-y positions.

[0048] For the second type of D 2 NNs 20 with a Fourier plane input filter 24, using the same Rayleigh-Sommerfeld diffraction equation mentioned above, a 4-f system with two lenses 42 was numerically implemented; the first lens 42 transforms the object information from the spatial domain to the frequency domain and the second lens 42 does the opposite. On the Fourier plane that is 2f away from the object plane, a single amplitude-only input filter 24, designed in one of the following forms is employed: 1) various combinations of circular/annular passbands, which are defined through specifying a series of equally spaced concentric ring-like areas, such that it can serve as a low/high pass, single-band pass or multiband pass filter, or 2) a single trainable layer enabling the system to leam an input spatial frequency filter on its own. On the output image plane of the 4-f system that is 4f away from the object plane, a square aperture is placed with the same size as the object or 1.5 times the size of the object, before feeding the resulting complex-valued field into the diffractive network. In the numerical implementation, the lens has a focal length f of -145.6 Z and a diameter of 104 X.

[0049] For each type of the input filter 24 design, the number of trained base D 2 NNs 20 and some input filter 24 examples can be found in the table of FIG. 12.

[0050] Ensemble pruning. The method that was used for ensemble pruning involved iterative elimination of the D 2 NN network 20 members from the initial pool of 1252 unique networks 20 based on a quantitative metric, which is indicative of an individual network’s “significance” in the collective inference process. However, since a member’s individual performance supremacy might not always translate to an improvement of the ensemble 22, during the iterative process some members were randomly eliminated. Ensemble pruning with intermittent random elimination of members 20 was found to result in better performing ensembles compared to pruning without random elimination as detailed in Table 1.

[0051] The pruning method (see FIG. 4) was initiated with an ensemble that consisted of all the no = 1252 individually trained D 2 NN models. An ensemble class score z c was defined as z c = , where z ck is the score predicted for class c by the member/network k (Eq.

1), and w ck is the corresponding class-specific weight. The weight vectors w k = k =

1, 2, ... , no, were optimized by minimizing the softmax cross-entropy loss of the class scores predicted by the ensemble of D 2 NNs; C=10 denotes the total number of classes in the dataset.

To reduce overfitting the weights to the training data examples, an L2 loss term was also included in the pruning loss function:

[0052] Pruning loss

[0053] where OC is set to be 0.001, £[.] denotes the expectation over the image batch, and g c represents the c th entry of the ground truth label vector g. During the optimization of the ensemble, in each iteration of the back propagation, all the image samples in the validation set were fed into the ensemble model (i. e. , the batch size equals to 5K); using training images for weight optimization during the ensemble pruning resulted in overfitting and therefore was not implemented. The class-specific weights were optimized using the gradient-descent algorithm (Adam) for 10000 steps. After optimizing the weights, the individual members/networks 20 were ranked based on a quantitative metric. An intuitive choice for this metric could have been the individual prediction accuracy of each network 20. However, a better metric for measuring the significance of individual networks 20 in an ensemble 22 was found to be the LI norm of the individual weight vectors optimized for validation accuracy. The superiority of the weight LI norm as a metric was substantiated by the fact that it resulted in ensembles achieving much better blind testing accuracies, consistently. After ranking the network 20 members based on their weight vectors, a certain fraction of them was eliminated from the bottom (i.e., lowest ranked ones), and the procedure was repeated with the reduced ensemble until only one member was left in the ensemble. As mentioned earlier, after T iterations of the pruning, this member/network elimination was done randomly instead of the ranking based elimination. However, to avoid elimination of the network 20 members with the largest weights, random elimination was select within a fraction p (which was 2/3 in this case) of the networks counted from the bottom. Once the pruning process was complete (see FIG. 4), a maximum allowed ensemble size (Nmax) was set, and the ensemble 22 with the best performance on the validation dataset and satisfying the size limit was chosen. The test image dataset was never used during the pruning process.

[0054] Training details. All the D 2 NNs 20 and their weighted ensembles 22 were numerically implemented and trained using Python (v3.6.5) and TensorFlow (vl.15.0, Google Inc.). An Adam optimizer with default parameters (predefined in TensorFlow) was used to calculate the back-propagated gradients during the training of the individual optical models and the ensemble weights. The learning rate, starting from an initial value of 0.001, was set to decay at the rate of 0.7 every 8 epochs. Since the images in the original CIFAR-10 dataset contain three color channels (red, green and blue) and monochromatic illumination is used in the diffractive diffractive neural network models, the built-in rgb Jo grayscale function in TensorFlow was applied to convert these color images to grayscale. Also, to enhance the generalization capability of the trained D 2 NNs 20, the images were randomly flipped (left-to-right) with a probability of 0.5 while training. For training of the individual D 2 NNS 20, a batch size of 8 was used and trained each model for 50 epochs using the training image set, and selected the best model based on the classification performance on the validation image set. The D 2 NN loss function for a given network k used softmax-cross- entropy between the differential class scores z c k and their one-hot labeled ground truth vector

[0056] where E[.] denotes the expectation over the training images in the current batch, C=10 denotes the total number of classes in the dataset, and g c represents the c th entry of the ground truth label vector g.

[0057] For all the training tasks detailed above, a desktop computer was used with a GTX 1080 Ti graphical processing unit (GPU, Nvidia Inc.), Intel® Core TM i7-8700 central processing unit (CPU, Intel Inc.) and 64 GB of RAM, running Windows 10 operating system (Microsoft Inc.). The typical training time of one D 2 NN model is ~4.5 hours. The time required for the iterative ensemble pruning process depends on the pruning hypermeters, varying between 0.75 to 7.5 hours.

[0058] Once the diffractive neural network 20 models are pruned and finalized, a physical embodiment diffractive neural network ensemble 22 may then be manufactured that can be used, for example, for classification of at optical image(s), signal(s), or data. While classification is illustrated as one illustrative task or function, it should be appreciated that the task and/or function of the resulting diffractive neural network may include processing of the input optical image, signal, or data. FIG. 8 illustrates a flowchart of the operations or processes according to one embodiment to create and use the physical embodiment of the diffractive neural network ensemble 22. As seen in operation 200 of FIG. 8, a specific task/function is first identified that the diffractive neural network ensemble 22 will perform. This may include, for example, classification and/or processing of an optical image, signal or data. Once the task or function has been established, a computing device 100 having one or more processors 102 executes software 104 thereon to then digitally train software models or mathematical representations of different D 2 NNs 20 or diffractive neural network computer models that form a large initial pool of diffractive neural network 20 models. Each diffractive neural network 20 model of the initial pool includes a filter 24, multi-layer diffractive or reflective substrate layers 28 formed in each of the diffractive neural network 20 models to the desired task or function to then generate a design for a physical embodiment of the diffractive neural network ensemble 22. The filters 24 may include different shapes that represent different transmission or reflection functions. The filters 24 may include object space input filters and/or Fourier space input filters. In some embodiments, the filters 24 are “learnable” filters (transmissive or reflective) that have aspects or characteristics that are learned during the training of the individual diffractive neural network models. In one aspect, spatial light modulators (SLMs) may be used as filters 24 or in addition to separate filters 24. This training operation and generation of the larger initial pool of diffractive neural networks 20 is illustrated as operation 210 in FIG. 8.

[0059] Note that computer-based models of the diffractive neural networks can ultimately be fabricated into a physical diffractive neural network 20 as part of the ensemble 22. The computer-based model includes a design for the physical layout for the different physical features 30 that form the “artificial neurons” in each of the plurality of physical substrates 28 which are present in the diffractive neural networks 20 that define the ensemble 22. The designs of the diffractive neural network models are used to make a physical embodiment of the diffractive neural network ensemble 22 that reflects the computer-derived design. The physical features 30 and the physical regions formed thereby act as artificial “neurons” that connect to other “neurons” of other substrate layers 28 of the diffractive neural networks 20 through optical diffraction (or reflection in the case of a reflection-based embodiment) and alter the phase and/or amplitude of the light wave.

[0060] The particular number and density of the physical features 30 or artificial neurons that are formed in each substrate layer 30 may vary depending on the type of application. In some embodiments, the total number of artificial neurons may only need to be in the hundreds or thousands while in other embodiments, hundreds of thousands or millions of neurons or more may be used. Likewise, the number of substrate layers 28 that are used in a particular diffractive neural network 20 may vary although it typically ranges from at least two substrate layers 28 to less than ten substrate layers 28. During training of the diffractive neural network 20 models, this may be done by training optical networks 20 individually or in groups of two or more.

[0061] In one embodiment, the physical features 30 or artificial neurons that are created within the substrate layers 28 are formed as varied thicknesses (t) of material at different lateral locations along the substrate layer 28 as seen in FIG. 9B. In one embodiment, the different thicknesses (t) modulate the phase of the light passing through the substrate layer 28. This type of physical feature 30 may be used, for instance, in the transmission mode embodiment of FIGS. 1A-1C. The different thicknesses of material in the substrate layer 28 forms a plurality of discrete “peaks” and “valleys” that control the transmission coefficient of the neurons formed in the substrate layer 28. The different thicknesses of the substrate layer 28 may be formed using additive manufacturing techniques (e.g., 3D printing) or lithographic methods utilized in semiconductor processing. This includes well-known wet and dry etching processes that can form very small lithographic features on a substrate layer 28. Lithographic methods may be used to form very small and dense physical features 30 on the substrate layer 28 which may be used with shorter wavelengths of the light. In some embodiments, the physical features 30 are fixed in permanent state (i.e. , the surface profile is established and remains the same once complete).

[0062] In another embodiment, the substrate layer(s) 28 may have a substantially uniform thickness but have different regions of the substrate layer 28 have different optical properties. For example, the refractive index of the substrate layers 28 may altered by doping the substrate layers 28 with a dopant (e.g., ions or the like) to form the regions of neurons in the substrate layers 28 with controlled transmission properties. In still other embodiments, optical nonlinearity can be incorporated into the optical network design using various optical nonlinear materials (crystals, polymers, semiconductor materials, doped glasses, polymers, organic materials, semiconductors, graphene, quantum dots, carbon nanotubes, and the like) that are incorporated into the substrate layer 28. A masking layer or coating that partially transmits or partially blocks light in different lateral locations on the substrate layer 28 may also be used to form the neurons on the substrate layers 28.

[0063] Alternatively, the transmission function of a neuron (i.e., physical feature 30) can also engineered by using metamaterial, metasurfaces, or plasmonic structures. Combinations of all these techniques may also be used. In other embodiments, non-passive components may be incorporated in into the substrates layers 28 such as spatial light modulators (SLMs). SLMs are devices that imposes spatial varying modulation of the phase, amplitude, or polarization of a light. SLMs may include optically addressed SLMs and electrically addressed SLM. Electric SLMs include liquid crystal-based technologies that are switched by using thin-film transistors (for transmission applications) or silicon backplanes (for reflective applications). Another example of an electric SLM includes magneto-optic devices that use pixelated crystals of aluminum garnet switched by an array of magnetic coils using the magneto-optical effect. Additional electronic SLMs include devices that use nanofabricated deformable or moveable mirrors that are electrostatically controlled to selectively deflect light. In this regard, the physical features 30 of the substrate layers 28 may be reconfigurable that can change as a function of time.

[0064] As seen in FIG. 8, in some embodiments, optional SLMs 44 may be interposed along the optical path prior to the filter 24 for each D 2 NN 20 as part of the ensemble 22. Beam splitters or the like may be used to create the multiple optical paths for the different physical diffractive neural networks 20 within the ensemble 22. The SLMs 44 are used to encode the optical input with phase or amplitude encoding. This imparts yet another degree of diversity to the design. Referring still to FIG. 8, the initial pool of computer-generated D 2 NNS 20 are then subject to an iterative pruning algorithm/operation to generate a final ensemble 22 of diffractive neural networks 20 that will be used to accomplished the specific task/function. This iterative pruning operation is seen in operation 220 of FIG. 8. Next, operation 230 reflects that the design is used to manufacture or have manufactured the physical embodiment of the diffractive neural networks 20 that form the ensemble 22 in accordance with the design. The design, in some embodiments, may be embodied in a software format (e.g., SolidWorks, AutoCAD, Inventor, or other computer-aided design (CAD) program or lithographic software program) may then be manufactured into a physical embodiment that includes the SLMs(s) 44, filters 24, the plurality of substrate layers 28, optical detectors 40, and lenses 44. The input filter 24 may be located at or near the object plane/Fourier plane or its virtual and/or digital replica (e.g., projected image of object at object plane). As explained herein, a differential detection scheme may be employed with the detectors 40 divided into two groups: positive detectors 40p and negative detectors 40n which are used to calculate differential class scores.

[0065] FIGS. 10A and 10B illustrate a physical embodiment of a single diffractive neural network 20 of an ensemble 22 that uses optical detectors 40 that perform a differential operation using sub-groups of individual pairs of optical detectors 40p, 40n with the overall set or group of optical detectors 40. Here, the optical detectors 40 are coupled to circuitry 41 that is used to perform a differential operation on groups of detectors 40. In particular, in one implementation, a group of optical detectors 40 is formed by a pair of optical detectors 40 with one of the optical detectors 40 being classified as a virtually “positive” optical detector 40p and the other optical detector 40 being classified as a virtually “negative” detector 40n. A positive optical detector 40p is an optical detector 40 whose output (e.g., output signal or data) is added to another optical signal or data with a positive scaling factor or coefficient. A negative optical detector 40n is an optical detector 40 whose output (e.g., output signal or data) is added to another optical signal or data with a negative scaling factor or coefficient. [0066] For example, FIG. 10B illustrates differential amplifier circuit 41 that is used to generate an output 43 that is the signal difference between the inputs from the negative optical detector 40n and the positive optical detector 40p within a particular group. Each group of detectors 40 include its own circuitry or hardware 41 (or share common circuitry or hardware 41 with time multiplexing of inputs) that is used to calculate the signal difference within the negative optical detector(s) 40n and positive optical detector(s) 40p making up the group (e.g., pair as illustrated in dashed box of FIG. 10A). As explained herein, in one embodiment the final task is based on identifying the particular optical detector 40 group where the normalized signal difference is maximized. That is to say, the particular group of detectors 40 is identified that has the largest normalized signal difference. This group of detectors 40 is associated with a particular classification which is then output as the desired task of the ensemble 22.

[0067] With reference to FIG. 11, in this example, an object 48 (e.g., knife) is imaged (or an optical signal from or representative of the object 48) serves as the input 50 to the diffractive neural networks 20 of an ensemble 22 (only one such D 2 NN 20 is shown here). The task to be performed in classification of the object 48 contained in the image input 50 to the diffractive neural network 20. The optical output 38 of the diffractive neural network 20 is captured by optical detectors 40 which include the pairs of positive 40p and negative 40n optical detectors. The hardware/differential amplifier circuit 41 is used to generate an output 43 that corresponds to the classification of the object 48 in the image, which in this case is that the classification of the object 48 is a knife. Note that the grouping of optical detectors 40 may involve a pair of detectors 40 (e.g., one positive detector 40p and one negative detector 40n). Groupings, however, may encompass additional variations such as two or more positive detectors 40p and two or more negative detectors 40n.

[0068] The physical substrate layers 28, once manufactured may be mounted or disposed in a holder 46 as seen in FIG. 10A. The holder 46 may include a number of slots formed therein to hold the substrate layers 28 in the required sequence and with the required spacing between adjacent substrate layers 28. The plurality of substrate layers 28 may be positioned within and/or surrounded by vacuum, air, a gas, a liquid, or a solid material.

[0069] Once the physical embodiment of the diffractive neural networks 20 of the ensemble 22 have been made, the diffractive neural network ensemble 22 is then used to perform the specific task or function as illustrated in operation 240 of FIG. 8. The optical input 50 is input to the diffractive neural networks 20 that makeup the ensemble 22 and the output optical signal(s) 38 is/are captured or imaged with the optical detectors 40. It should be appreciated that a variety of different optical detectors 40 could be employed. These include, a CMOS image sensor or image chip such as CCD, photodetectors (e.g., photodiode such as avalanche photodiode detector (APD)), photomultiplier (PMT) device, focal plane array, single-pixel detectors, and the like. There may be a single optical detector 40 or an array of optical detectors 40. As seen in the specific example of FIG. 8, the optical input 50 is an image of an object 48 (in this example, a knife). The physical diffractive neural networks 20 of the ensemble 22 each output a result to the respective optical detectors 40 which are then used to output a final output signal 43 or value which is a classification output in this example (i.e. , input image is a knife). Additional circuitry such as circuitry 41 shown in FIG. 10B may be coupled to the respective otpical detectors 40 to receive the different signals and combine to output the final output based on the weighted signals from each optical detector 40.

[0070] While embodiments of the present invention have been shown and described, various modifications may be made without departing from the scope of the present invention. While the invention has largely been described as being used with image classification, it should be appreciated that the ensemble 22 of diffractive neural networks 20 can be used in other machine learning and/or machine vision tasks. Likewise, the training of the diffractive neural network 20 models that are described herein may involve transmissive and/or reflective layers. In this regard, training involves iteratively adjusting the phase and/or amplitude of the transmission/reflection coefficients for each substrate layer 28 of the multilayer diffractive neural network 20 models. In addition, in optional embodiments, one or more digital displays or screens are interposed along one or more optical paths or at the input 50 of the individual diffractive neural networks. These displays or screens may be used to provide the optical input 50 to the D 2 NNs 20 of the ensemble 22.

[0071] The invention, therefore, should not be limited, except to the following claims, and their equivalents.