Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR TRAINING AN EYE STATE PREDICTOR
Document Type and Number:
WIPO Patent Application WO/2024/132135
Kind Code:
A1
Abstract:
A method for training an eye state predictor (ESP), which is implemented as a neural network, includes feeding a first eye-related observation (ERO, EROl, Pl, Pr) as input to the eye state predictor (ESP) to determine a predicted 3D eye state (3DES, {EC, EG, EP}) of at least one eye of the subject for a time (tK) as output of the eye state predictor (ESP), the first eye-related observation (ERO, EROi, Pi, Pr) referring to the at least one eye of the subject for the time (tK). The predicted 3D eye state ({EC, EG, EP}) is fed as input to a differentiable predictor (DP) to determine a prediction (II) for the at least one eye of the subject for the time (tK) as output of the differentiable predictor (DP). Based on the prediction (II) and at least one of the first eye- related observation (ERO, EROl, Pl, Pr) and a second eye-related observation, a training loss (Δ) is determined. The second eye-related observation (ERO2) refers to the at least one eye of the subject for the time (tK). The training loss (Δ) is used to train the eye state predictor (ESP).

Inventors:
DIERKES, Kai (Berlin, DE)
PETERSCH, Bernhard (Berlin, DE)
DREWS, Michael (Berlin, DE)
Application Number:
PCT/EP2022/087323
Publication Date:
June 27, 2024
Filing Date:
December 21, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
PUPIL LABS GMBH (Berlin, DE)
International Classes:
G06F3/01; G06V10/82
Attorney, Agent or Firm:
ZIMMERMANN & PARTNER PATENTANWÄLTE MBB (München, DE)
Download PDF:
Claims:
Claims

1. A method (1000, 2000, 3000, 4000, 5000) for training an eye state predictor (ESP) implemented as a neural network (NN), in particular as a convolutional neural network (CNN), the method comprising: feeding (1200, 2200, 3200, 4200, 5200) a first eye-related observation (ERO, EROi, Pi, Pr) as input to the eye state predictor (ESP) to determine a predicted 3D eye state (3DES, {Ec, EG, Ep}) of at least one eye of a subject for an observation situation (tk) as output of the eye state predictor (ESP), the first eye-related observation (ERO, EROi, Pi, Pr) referring to the at least one eye of the subject in and/or during the observation situation (tk); feeding (1300, 2300, 3300, 4300, 5300) the predicted 3D eye state ({Ec, EG, Ep}) as input to a differentiable predictor (DP, DPI) to determine a prediction (II, III) for the at least one eye of the subject in and/or during the observation situation (tk) as output of the differentiable predictor (DP); determining (1400, 2400, 3400, 3401, 4400, 4500), based on the prediction (II, III) and at least one of the first eye-related observation (ERO, EROi, Pi, Pr) and a second eye-related observation, a training loss (A, Al, A2, A’), the second eye-related observation (ERO2) referring to the at least one eye of the subject in and/or during the observation situation (tk); and using (1500, 2500, 3500, 4500, 5500) the training loss (A, Al, A2, A’) to train the eye state predictor (ESP).

2. The method (1000, 2000, 3000, 4000, 5000) of claim 1, wherein the observation situation (tk) refers to a corresponding observation time (tk) and/or is represented by at least one of the observation time (tk), and an observation ID typically depending on the observation time (tk), wherein at least one of the first eye-related observation (ERO, Pi, Pr) and the second eye-related observation (ERO2) comprises at least one of an eye image (Pi, Pr) of an eye of the subject recorded by an eye camera during the observation situation and/or for the time (tk) as a primary eye-related observation (pERO), and a corresponding secondary eye-related observation (sERO) derived from the eye image (Pi, Pr), the corresponding secondary eye-related observation (sERO) typically comprising at least one visual feature extracted from the eye image (Pi, Pr) and/or being implemented as visual feature representation of the eye image (Pi, Pr) comprising the at least one visual feature, in particular as a semantic segmentation (Si, Sr) of the eye image (Pi, Pr).

3. The method (1000, 2000, 3000, 4000, 5000) of claim 1 or 2, wherein at least one of the first eye-related observation (ERO, Pi, Pr) and the second eye-related observation (ERO2) comprises a left eye image (Pi) of a left eye and a right eye image (Pr) of a right eye of the subject recorded by a respective eye camera during the observation situation and/or for the observation time (tk) as a respective primary eye-related observation (pERO), and/or wherein at least one of the first eye-related observation (ERO, Pi, Pr) and the second eye- related observation (ERO2) comprises corresponding secondary eye-related observations (sERO) derived from the left and right eye images (Pi, Pr), in particular a semantic segmentation (Si) of the left eye image (Pi) and a semantic segmentation (Sr) of the right eye image (Pr).

4. The method (1000, 2000, 3000, 4000, 5000) of claim 2 or 3, wherein the semantic segmentation (Si, Sr) of the respected eye image (Pi, Pr) comprises at least one label (Li, Lr) selected from a list consisting of a pupil label, an iris label, a sclera label, an eyelid label, a skin label, and an eye lash label.

5. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, further comprising: determining (1100, 2100, 3100), typically using a scene camera (160), a scene observation (Ps) referring to a field of view of the at least one eye of the subject during the observation situation and/or for the observation time (tk) as an eye-related observation (ERO).

6. The method (1000, 2000, 3000, 4000, 5000) of claim 5, wherein the scene observation (Ps) comprises at least one of a scene image (Ps) typically recorded by the scene camera as a primary eye-related observation (pERO), and at least one corresponding secondary eye- related observation derived from the scene image (Ps).

7. The method (1000, 2000, 3000, 4000, 5000) of any of the claims 2 to 6, wherein the at least one corresponding secondary eye-related observation (sERO) comprises at least one of a semantic segmentation (Si, Sr) of the scene image (Ps), a 3D gaze point typically measured in scene camera coordinates, and a 2D gaze point typically measured in scene camera image coordinates.

8. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein at least one of the first eye-related observation (ERO, Pi, Pr) and the second eye-related observation comprises at least one of a corneo-retinal standing potential of the at least one eye of the subject during the observation situation and/or for the observation time (tk), a velocity of a head of the subject during the observation situation and/or for the observation time (tk), an acceleration of the head of the subject during the observation situation and/or for the observation time (tk), an orientation of the head of the subject during the observation situation and/or for the observation time (tk), respective 2D gaze point during the observation situation and/or for the observation time (tk) and a 3D gaze point during the observation situation and/or for the observation time (tk) as a respective primary observation.

9. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein the second eye-related observation (ERO2) is different from the first eye-related observation (EROi), wherein the first eye-related observation (EROi) and the second eye-related observation (ERO2) are used to determine the training loss (A, Al, A2, A’), wherein the first eye-related observation (EROi) comprises or even is the primary eye-related observation (pERO), and/or wherein the second eye-related observation comprises or even is the secondary eye-related observation (sERO).

10. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein the respective eye-related observation is determined using a head-wearable device comprising the eye camera, the head-wearable device typically comprising at least one of a left eye camera, a right eye camera (150), a scene camera (160), and an inertial measurement unit (170).

11. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, further comprising at least one of determining (1100, 2100, 3100), using an eye camera (150), at least the first eye- related observation (ERO, Pi, Pr) during the observation situation and/or for the observation time (tk), typically a plurality of eye-related observations (ERO, Pi, Pr) for respective observation times (tk); storing the first eye-related observation (ERO, Pi, Pr), typically the plurality eye- related observation (ERO, Pi, Pr) in a database; using the database to determine the first eye-related observation; using the database to determine the second eye-related observation; feeding (1200, 2200, 3200) several eye-related observations (Pi, Pr, Ss, Si, Sr) as input to the eye state predictor (ESP) to determine a respective predicted 3D eye state ({Ec, EG, Ep}) of the at least one eye.

12. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein a plurality of respective eye-related observations (ERO, Pi, Pr) are determined, wherein the respective eye-related observations (ERO, Pi, Pr) are determined for different subjects and/or for several respective observation times (tk) and/or wherein the method (1000, 2000, 3000, 4000, 5000) is performed iteratively.

13. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein the predicted 3D eye state ({Ec, EG, Ep}) comprises at least one of, typically three of or even all of a predicted 3D center (Ec) of an eyeball of the at least one eye, a predicted 3D gaze direction (EG) of the at least one eye, a predicted 3D state of an eyelid of the at least one eye and a predicted 3D state (Ep) of a pupil of the at least one eye.

14. The method (1000, 2000, 3000, 4000, 5000) of claim 13, wherein the 3D state of the pupil comprises at least one of a predicted 3D pupil size of the at least one eye, a predicted 3D pupil aperture of an iris of the at least one eye, a predicted 3D pupil radius of an iris of the at least one eye, and a predicted 3D pupil diameter of the iris of the at least one eye.

15. The method (1000, 2000, 3000, 4000, 5000) of claim 13 or 14, wherein the 3D state of the eyelid of the at least one eye comprises at least one of an eyelid shape, an eyelid position, and a percentage of an eyelid closure.

16. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein the predicted 3D eye state ({Ec, EG, Ep}) is a monocular state (3DESm), a pair of corresponding left and right monocular 3D eye states or a binocular state (3DESb), wherein the predicted 3D eye state ({Ec, EG, Ep}) comprises and/or is one of a 6 dimensional vector, a 10 dimensional vector, an 11 dimensional vector and a 12 dimensional vector, and/or wherein the binocular state (3DESb) is represented by a data set with a lower dimensionality than the sum of the dimensionalities of the corresponding left and right monocular 3D eye states.

17. The method (2000) of any of the preceding claims, wherein the differentiable predictor (DP) is an identity operator (I), and wherein the predicted 3D eye state ({Ec, EG, Ep}) and a corresponding 3D eye state ({Ec, EG, Ep}’) for during the observation situation and/or the observation time (tk) are used to determine the training loss (A, Al, A2, A’).

18. The method (1000, 2000, 3000, 4000, 5000) of claim 17, wherein the corresponding 3D eye state ({Ec, EG, Ep}’) is determined using the first eye-related observation (EROi, Pi, Pr), different to the 3D eye state ({Ec, EG, Ep}) or using the second eye-related observation (ERO2).

19. The method (1000, 2000, 3000, 4000, 5000) of claim 18, wherein the corresponding 3D eye state ({Ec, EG, Ep}’) is determined using a 3D geometric eye model, in particular a 3D geometric eye model taking into account corneal refraction.

20. The method (1000, 3000, 4000, 5000) of any of the claims 1 to 16, wherein the differentiable predictor (DP) is different to an identity operator (I).

21. The method (1000, 2000, 3000, 4000, 5000) of any of the claims 1 to 16 and 20, wherein the differentiable predictor (DP) is configured to output a synthetic eye image (SIi, SIr) as prediction in response to receiving the input, and the training loss (A) is determined based on the synthetic eye image (SIi, SIr) and a corresponding visual feature representation extracted from the eye image (Pi, Pr), in particular a corresponding semantic segmentation (Si, Sr) of the eye image (Pi, Pr).

22. The method (1000, 2000, 3000, 4000, 5000) of any of the claims 1 to 16, 20, and 21, wherein the differentiable predictor (DP) is implemented as a generative neural network or a differentiable Tenderer such as an approximate ray tracing algorithm which is differentiable .

23. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, wherein at least one loss function (LF, LF1, LF2, LF’) is used to determine the training loss (A, Al, A2, A’), wherein at least two differentiable predictors (DP, DPI) are used to determine a respective prediction (II, III) for the at least one eye of the subject during the observation situation and/or for the observation time (tk), wherein the training loss (A, Al, A2, A’) is determined based on each of the respective predictions (II, III), wherein the predicted 3D eye state ({Ec, EG, Ep}) is fed as input to a first differentiable predictor (DP) to determine a first prediction (II) for the at least one eye of the subject during the observation situation and/or for the observation time (tk) as output of the first differentiable predictor (DP), wherein the predicted 3D eye state ({Ec, EG, Ep}) is fed as input to a second differentiable predictor (DPI) to determine the second prediction (III) for the at least one eye of the subject during the observation situation and/or for the observation time (tk) as output of the second differentiable predictor (DPI), wherein the first prediction (II) and the first eye- related observation (EROi) or the second eye-related observation (ERO2) are fed as input to a first loss function (LF) to determine a first training loss (A) as output of the first loss function (LF); wherein the second prediction (III) and the first eye-related observation (EROi), the second eye-related observation (ERO2) or a third eye-related observation (ERO3) are fed as input to a second loss function (LF1) to determine a second training loss (Al) as output of the second loss function (LF1), wherein the second training loss (Al) is used to train the eye state predictor (ESP), wherein the first training loss (A) and the second training loss (Al) are used to train the eye state predictor (ESP), and/or wherein the training loss is determined as a function of the first training loss (A) and the second training loss (Al).

24. The method (1000, 2000, 3000, 4000, 5000) of any preceding claim, wherein the training loss (A, Al, A2, A') depends on at least one situation-specific parameter (PAR) and/or wherein the training loss (A, Al, A2, A') is used to amend and/or learn at least one situationspecific parameter (PAR).

25. The method (1000, 2000, 3000, 4000, 5000) of claim 24, wherein both the eye state predictor (ESP) and the differentiable predictor (DP) receive the at least one situationspecific parameter (PAR) as part of the input.

26. The method (1000, 2000, 3000, 4000, 5000) of claim 24 or 25, wherein the at least one situation-specific parameter (PAR) comprises at least one subject- specific parameter.

27. The method (1000, 2000, 3000, 4000, 5000) of claim 26, wherein the at least one subjectspecific parameter is selected from a list consisting of an interpupillary distance (IPD), an angle between an optical and a visual axis of the respective eye, a rotation operator capturing a transformation between the optical axis and the visual axis, a geometric parameter referring to a shape and/or size of a cornea of the respective eye such as spherical, non-spherical, a thickness, an astigmatism, a 3D topography, and a radius, a refractive index of at least one component of the respective eye, an iris radius of the respective eye, a pupil shape, and a geometric parameter referring to a shape and/or size of an eyeball of the respective eye.

28. The method (1000, 2000, 3000, 4000, 5000) of any of the claims 24 to 27, wherein the at least one situation-specific parameter (PAR) comprises a hardware-specific parameter.

29. The method (1000, 2000, 3000, 4000, 5000) of claim 28, wherein the at least one hardwarespecific parameter is selected from a list consisting of a respective camera intrinsics, relative camera extrinsics and a pose of the inertial measurement unit relative to at least one of the cameras.

30. The method (1000, 2000, 3000, 4000, 5000) of any of the preceding claims, further comprising at least one of using the predicted 3D eye state ({Ec, EG, Ep}) to determine a third training loss (A’); using, based on the predicted 3D eye state ({Ec, EG, Ep}) and the at least one situation-specific parameter (PAR), a third loss function (LF’) to determine the third training loss (A1); and using (1500, 2500, 3500) the third training loss ( A') to train the eye state predictor (ESP), and/or wherein the training loss (A, Al, A2, A’) is determined as a function (g) of the first training loss (A) and the third training loss (A1), for example as a function of the first training loss (A), the second training loss (Al) and the third training loss (A1).

31. A method (8000) for subject- specific parameter calibration, the method comprising: providing (8100) a trained eye state predictor (tESP), the trained eye state predictor (tESP) being trained according to any of the preceding claims; and performing the following steps, typically several times: o feeding (8200) a respective first eye-related observation (ERO, EROi, Pi, Pr) as input to the trained eye state predictor (tESP) to determine a predicted 3D eye state (3DES, {Ec, EG, Ep}) of at least one eye of a new subject for an observation situation (tk) as output of the trained eye state predictor (tESP), the first eye-related observation (ERO, EROi, Pi, Pr) referring to the at least one eye of the new subject in and/or during the observation situation (tk); o feeding (8300) the predicted 3D eye state ({Ec, EG, Ep}) and a current value of at least one subject-specific parameter for the new subject as input to a differentiable predictor (DP, DPI) to determine a prediction (II, III) for the at least one eye of the new subject in and/or during the observation situation (tk) as output of the differentiable predictor (DP); o determining (8400), based on the prediction (II, III) and at least one of the first eye-related observation (ERO, EROi, Pi, Pr) and a respective second eye- related observation, a training loss (A, Al, A2, A’), the respective second eye- related observation (ERO2) referring to the at least one eye of the new subject in and/or during the observation situation (tk); and o using (8600) the training loss (A, Al, A2, A’) to update the current value of the at least one subject- specific parameter.

32. A method (9000) for predicting a 3D eye state (3DES) for a subject in real time, the method comprising: determining an eye-related observation (ERO*) referring to at least one eye of the subject; and feeding the eye-related observation (ERO*) as input to a trained eye state predictor (tESP) implemented as a neural network (NN) to determine (5300) a predicted 3D eye state (3DES*, {Ec, EG, Ep}) as output of the trained eye state predictor (tESP).

33. The method (9000) of claim 32, wherein the predicted 3D eye state (3DES*, {Ec, EG, Ep}) comprising a predicted 3D center of rotation of an eyeball of the at least one eye, and a 3D gaze direction of the eyeball of the at least one eye.

34. The method (9000) of claim 33, wherein the predicted 3D eye state (3DES*, {Ec, EG, Ep}) comprises a predicted 3D state of a pupil of the at least one eye.

35. The method (9000) of claim 34, wherein the predicted 3D state of the pupil comprises at least one of a predicted 3D pupil size of the at least one eye, a predicted 3D pupil aperture of an iris of the at least one eye, a predicted 3D pupil radius of an iris of the at least one eye, and a predicted 3D pupil diameter of the iris of the at least one eye.

36. The method (9000) of any of the claims 32 to 35, wherein the trained eye state predictor (tESP) is trained according to the method of any of the claims 1 to 30, and/or comprising training an eye state predictor (ESP) according to the method of any of the claims 1 to 30 to obtain the trained eye state predictor (tESP).

37. The method (9000) of any of the claims 32 to 36, wherein an eye camera (150) of a headwearable device (100) worn by the subject is used for taking the at least one eye image (Pi, Pr), the head-wearable device (100) typically being implemented as a spectacles device, and/or wherein the method is at least in part controlled and/or performed by a computing and control unit of the head-wearable device.

38. The method (9000) of any of the claims 32 to 37, wherein the training loss (A, Al, A2, A') depends on at least one subject-specific parameter (PAR), and wherein the subject specific parameter (PAR) is determined in advance for the new subject, in particular as a respective physiological value, as respective measured value, using the method according to claim 31 to determine a respective predicted value or any function of one or more of said values.

39. A system (500) comprising:

- a head-wearable device (100) comprising at least one eye camera (150) configured to generate eye images (Pi, Pr) of at least a portion of an eye of the subject wearing the head-wearable device; and

- a computing system (200, 300) connectable with the at least one eye camera (150) for receiving the eye images, and configured to: o generate, based on an eye image (Pi, Pr) referring to an observation situation (tk) and received from the at least one eye camera (150), a first eye-related observation (EROi, Pi, Pr) referring to the eye of the subject in and/or during the observation situation (tk); o run an eye state predictor (ESP) implemented as a neural network (NN); o feed (1200, 2200, 3200) the first eye-related observation (EROi, Pi, Pr) as input to the eye state predictor (ESP) to determine a predicted 3D eye state (3DES, {Ec, EG, Ep}) of the eye of the subject in and/or during the observation situation (tk); o feed (1300, 2300, 3300) the predicted 3D eye state ({Ec, EG, Ep}) as input to a differentiable predictor (DP, DPI) to determine a prediction (P, Pl) for the eye of the subject in and/or during the observation situation (tk) as output of the differentiable predictor (DP); o determine (1400, 2400, 3400, 3401), based on the prediction (P, Pl) and at least one of the first eye-related observation (EROi, Pi, Pr) and a second eye- related observation, a training loss (A, Al, A2, A'), the second eye-related observation (ERO2) referring to the eye of the subject in and/or during the observation situation (tk), the computing system (200, 300) typically being configured to determine the second eye-related observation (ERO2) different than the first eye-related observation (EROi); and o use (1500, 2500, 3500) the training loss (A, Al, A2, A') to train the eye state predictor (ESP).

40. The system (500) of claim 39, wherein the head-wearable device (100) comprises a respective eye camera (150) for each eye of the subject, wherein the first eye-related observation (EROi, Pi, Pr) is generated based on respective eye images (Pi, Pr) of each eye during the observation situation and/or for an observation time (tk) at which the eye images (Pi, Pr) are generated, wherein the head-wearable device (100) comprises a scene camera (160) configured to generate scene images referring to a field of view of the subject wearing the head-wearable device, and/or wherein the computing system (200, 300) is configured to: o host or access a database for the eye-related observations (ERO, Pi, Pr); and/or o perform the method (1000-5000, 8000) according to any one of the claims 1 to 31.

41. A computer program product or a computer-readable storage medium comprising instructions which, when executed by a one or more processors of a computing system, cause the computing system to carry out any of the steps of the method (1000-5000, 8000) according to any one of the claims 1 to 31.

Description:
SYSTEM AND METHOD FOR TRAINING AN EYE STATE PREDICTOR

TECHNICAL FIELD

[001] Embodiments of the present invention relate to systems and methods for training an eye state predictor implemented as a neural network, in particular as a convolutional neural network, and a method for predicting a 3D eye state for a subject in real time, in particular for a subject wearing a head-wearable device.

BACKGROUND

[002] Over the last decades a wide variety of camera-based eye trackers have been proposed. In general, methods for determining gaze direction can be categorized into such which rely on the explicit extraction of features from eye images, and such which do not rely on such explicit feature extraction, but instead accept the entire image of an eye (or of both eyes) as an input to some kind of pre-trained algorithm, like a machine learning algorithm, for example a trained neural network. The latter methods are often called appearance-based, in contrast to the former explicit feature-based methods. Feature-based methods can be further separated into regressionbased and model-based approaches. Regression-based approaches typically employ polynomial mapping functions, which after a person-specific calibration are used to predict gaze direction (typically 2D gaze coordinates on a screen) based on suitable eye image features (typically 2D pupil centers). Model-based approaches fit a mathematical 3D eye model to extracted eye image features (typically a series of pupil contours) and next to 3D gaze direction allow for the estimation of other pertinent parameters characterizing three-dimensional (3D) eye state, as e.g. eyeball center position and pupil radius. In all these methods, explicit extraction of eye features such as pupil center, IR glint position (reflections actively generated by IR LEDs), or pupil contours is usually performed by classic computer vision and image processing algorithms or by machine learning based methods.

[003] Many known head mounted eye trackers suffer from the disadvantage that environmental stray light being reflected on the test user's eye may negatively influence the eye tracking functionality. In feature-based eye tracking approaches, cameras monitoring the test user's eyes may not be able to distinguish between features of the eyes explicitly used for keeping track of the eye movement and features, such as spurious reflections, arising from environmental lighting conditions. Reliable eye tracking is therefore often compromised by environmental conditions and undesired stray light disturbing the tracking mechanism. Thus, known head mounted eye tracker devices often suffer from a limited robustness when dealing with large variations of environments and appearance of users.

[004] Head-mounted eye trackers which employ appearance-based (learning-based) gaze estimation methods have been shown to deal much better with such large variations of environmental light conditions and appearance of users, giving very good accuracy even in an uncalibrated setting. Such methods however require large amounts of labelled training data which include images of the eyes and corresponding gaze coordinates, e.g. within images of a front facing scene camera, and are also challenging to calibrate. Due to the lack of corresponding ground truth data, they are usually also not able to (reliably) yield interesting 3D eye state parameters. This is because neural networks in the context of eye tracking are usually only trained direct supervision by means of using gaze point or gaze direction ground truth data, which is available from dedicated data collection sessions, to perform end-to-end gaze prediction on input 2D eye images, or with 2D image features which are directly visible or annotatable on 2D eye images, to perform eye feature extraction.

[005] While well-calibrated feature-based methods can still yield an even higher accuracy than appearance-based learning methods, in many use cases such as head-mounted eye trackers a highly controlled and calibrated setup is hard or even impossible to achieve. Accordingly, both gaze estimation paradigms have advantages and disadvantages.

[006] Further, model-based approaches for gaze estimation using head-mounted eye trackers that are based on fitting of a mathematical 3D eye model to a series of pupil contours extracted from 2D eye images usually assume that the center of rotation of the eyeball is fixed in the eye camera coordinate system, which may only be approximately true due to so-called head-set slippage, i.e. unavoidable movements of the headset with respect to the head of the user during usage, and, thus, may require real-time eye model updates as explained in WO 2020/244752 Al.

[007] Accordingly, there is a need to further improve the detection of gaze direction and other eye state parameters.

SUMMARY

[008] According to an embodiment of a method for training an eye state predictor, which is implemented as a neural network, in particular as a convolutional neural network, the method includes feeding a first eye-related observation as input to the eye state predictor to determine, for an observation situation, a predicted 3D eye state of at least one eye of the subject as output of the eye state predictor, the first eye-related observation referring to the at least one eye of the subject in and/or during the observation situation. The predicted 3D eye state is fed as input to a differentiable predictor to determine, for the observation situation, a prediction for the at least one eye of the subject as output of the differentiable predictor. Based on the prediction and at least one of the first eye-related observation and a second eye-related observation, a training loss is determined. The second eye-related observation also refers to the at least one eye of the subject in and/or during the observation situation. The training loss is used to train the eye state predictor.

[009] Accordingly, the eye state predictor, which is in the following also referred to as eye state predictor model, is (based on the training loss(es) that may also be referred to as prediction loss(es)) trained to output a corresponding 3D eye state upon receiving eye-related observation(s).

[0010] At least the first eye-related observation typically includes an image referring to the at least one eye of the subject, in particular a respective eye image recorded by an eye camera and showing at least a portion of the left eye of the subject (left eye image) or the right eye of the subject (right eye image). More typically, the first eye-related observation typically includes a pair of corresponding left eye image and right eye image or a concatenated eye image, i.e. an eye image formed by concatenating left eye image and right eye image. The left eye image and the right eye image may in particular be provided by respective eye cameras of a head-wearable device worn by the subject.

[0011] The observation period of the observation situation may correspond to typical video frame rates and/or be comparatively short, for example less than 0.2 s, more typically less than 0.02 s. In these embodiments, the left eye image and the right eye image may correspond to an image of a respective video stream provided by the eye cameras.

[0012] While an eye-related observation typically at most includes and/or is determined based on at most one left eye image and at most one corresponding right eye image, it is also possible that an eye-related observation include and/or is determined based on a typically short sequence of left eye images and/or the right eye images, for example a respective sequence with a length of at most 5 or 10. [0013] Further, the observation situation may refer to and/or be represented by at least one of the (corresponding) observation time (observation date), and an observation identity (ID) typically depending on the observation time.

[0014] According to embodiments, a method for training the eye state predictor includes feeding a first eye-related observation as input to the eye state predictor to determine a predicted 3D eye state of at least one eye of the subject for an observation time as output of the eye state predictor, the first eye-related observation referring to the at least one eye of the subject for the observation time. The predicted 3D eye state is fed as input to a differentiable predictor to determine a prediction for the at least one eye of the subject for the observation time as output of the differentiable predictor. Based on the prediction and at least one of the first eye-related observation and a second eye-related observation referring to the at least one eye of the subject for the observation time, in particular based on the prediction and one of the first eye-related observation and the second eye-related observation, a training loss is determined. The training loss is used to train the eye state predictor.

[0015] In the following, the methods for training an eye state predictor are also referred to as training methods.

[0016] The methods for training eye state predictors as explained herein allow to combine the advantages of feature-based, model-based and appearance-based learning methods to achieve both accurate and robust 3D eye state prediction.

[0017] In particular, the eye state prediction model may be trained to receive a 2D (camera) image of an eye of the subject (or a pair of corresponding images of a left eye and a right eye of the subject), as in above mentioned appearance-based methods, but instead of merely yielding a gaze point or direction, to output, for a given use case, a typically complete 3D eye state including or consisting of a (data) set or vector of eye state values, for example 3D eyeball center coordinates, 3D gaze direction and 3D pupil size.

[0018] The term “3D eye state” as used herein intends to describe a set of quantities or values, in particular a respective vector describing, typically characterizing a (an actual) three- dimensional state of at least one eye of the subject at a given time, i.e. a left eye, in the following also referred to as first eye, a right eye, in the following also referred to as second eye, both eyes, and/or a cyclopean eye of the subject.

[0019] A set (only) consisting of two-dimensional (2D) values that are directly visible in eye images, like for example a 2D pixel location of an eye bounding box within a remotely taken image of the face of a subject or a 2D pupil image (ellipse) property is not to be understood as a “3D eye state”.

[0020] The 3D eye state typically refers to and/or includes one or more, typically two or three 3D observables (physical quantities that can be measured) of the at least one eye of the subject.

[0021] The (predicted) 3D eye state can include any (measured/measurement based) value that characterizes the physiologically constant or transient parameters describing a respective eye in 3D, in particular the 3D center of rotation of the eyeball, the 3D gaze direction, e.g. a vector characterizing the optical or visual axis/line-of-sight, and/or the size of the pupil aperture (“pupil size”) in 3D.

[0022] The (predicted) 3D eye state typically includes at least one of, typically at least two of, for example three of or even all of a predicted 3D center (of rotation) of an eyeball of the at least one eye, a 3D gaze direction of the at least one eye, a 3D state of an eyelid of the at least one eye and a 3D state of a pupil of the at least one eye.

[0023] Typically, the (predicted) 3D state of the pupil includes at least one of a predicted 3D pupil size of the at least one eye, a predicted 3D pupil aperture of an iris of the at least one eye, a predicted 3D pupil radius of an iris of the at least one eye, and a predicted 3D pupil diameter of the iris of the at least one eye.

[0024] The (predicted) 3D state of the eyelid of the at least one eye may include at least one of an eyelid shape, an eyelid position, and a percentage of an eyelid closure.

[0025] Further, the predicted 3D eye state may be a monocular state, a pair of corresponding left and right monocular states or a binocular state.

[0026] In embodiments referring to 3D eye states consisting of a predicted 3D center of the eyeball of the at least one eye, a 3D gaze direction of the at least one eye, and a 3D state of a pupil of the at least one eye, the predicted 3D eye state is typically one of a 6-dimensional vector (characterizing a monocular 3D eye state), a 10-dimensional vector (characterizing a binocular 3D eye state) and a 12-dimensional vector (characterizing a binocular 3D eye state). Note that the two monocular 3D eye states of a subj ect are related to each other for physiological reasons. Therefore, a 10-dimensional or 11 -dimensional data set (vector) is typically sufficient to characterize the 3D eye state. In other words, a binocular 3D eye state may, depending on the applicable constraint(s), be represented by a lower dimensional data set (e.g. vector) compared to a combined data set (e.g. vector) representing two corresponding left and right (monocular) 3D eye states.

[0027] During training, training parameters of the eye state predictor, i.e. model parameters to be trained / parameters the eye state predictor model learns by optimizing the loss function, in particular the weights of the connections between the artificial neurons of the neural network (NN), are changed, typically iteratively.

[0028] The eye state predictor may be trained by machine learning (ML) and using machine learning algorithms, respectively. In particular, a respective ML optimization algorithm such as backpropagation may be used to train the eye state predictor. Accordingly, a (at least one) loss function referring to, typically mathematically characterizing and/or providing a measure for a difference or a discrepancy between the prediction(s) of the eye state predictor (model) and the eye-related observation(s) is used to determine the training loss and to train the eye state predictor, respectively. Note, for backpropagation to be employable, all steps in the computational graph, i.e. all calculations leading from the input of the algorithm to the final loss function, have to be differentiable.

[0029] The eye state predictor may be trained using a directly supervised training scheme or an indirectly supervised training scheme.

[0030] The eye-related observations usable for training the eye state predictor may include further data referring to the at least one eye of the subject.

[0031] In particular, the eye-related observations for training may be determined using a headwearable device worn by the subject and providing the eye camera, typically a left eye camera, a right eye camera, and optionally a scene camera and/or additional sensors such as an inertial measurement unit as components. Data provided by these components (also as postprocessed data / derived data) during the observation situation and/or for the observation time may also be part of the respective eye-related observation.

[0032] Accordingly, further data (e.g. sensor readouts of the head-wearable device) may be used fortraining the eye state predictor. Accordingly, accuracy and/or robustness of the 3D eye state prediction of the trained eye state predictor may be further improved.

[0033] The first eye-related observation and the second eye-related observation, respectively, used for training the eye state predictor may include a respective eye image of the eye(s) of the subject recorded by the respective eye camera (of the head-wearable device) during the observation situation and/or for the observation time as a primary eye-related observation, and/or a corresponding secondary eye-related observation(s) which is derived from the eye image(s).

[0034] The secondary eye-related observation typically includes a visual feature determined for and/or extracted from the respective eye image, more typically several respective visual features such as edges, boundaries between and/or positions of anatomical features identified in the eye images, and/or is implemented as visual feature representation of the eye image including the visual feature(s). The visual feature(s) of the respective eye image may be determined using feature detection algorithms such as feature point detection, edge detection, and/or semantic segmentation.

[0035] The secondary eye-related observation may in particular be a semantic segmentation of the respective eye image which is in the following also referred to as semantically segmented eye image.

[0036] Semantic segmentation (also known as image segmentation) may be described as clustering and/or partitioning a digital image into image segments, and providing, for the digital image, a semantically segmented image consisting of the segments that may completely cover the digital image. During semantic segmentation, objects and/or (segment) boundaries may be detected in the digital image. Further, image segmentation typically includes assigning labels to the pixels in the digital image such that pixels with the same label share one or more characteristics. For example, pixel identified as belonging to a pupil in a digital eye image may be assigned with a pupil label.

[0037] In particular, the first eye-related observation and/or the second eye-related observation may include a left eye image of a left eye and a right eye image of a right eye of the subject recorded by the respective eye camera during the observation situation and/or for the observation time as a respective primary eye-related observation.

[0038] Further, the first eye-related observation and/or the second eye-related observation may include a secondary eye-related observations derived from the left image and a secondary eye- related observations derived from the right eye image, in particular a semantic segmentation of the left eye image (semantically segmented left eye image) and a semantic segmentation of the right eye image (semantically segmented right eye image).

[0039] The semantic segmentation of the respective eye image may include at least one label which is typically selected from a list consisting of a pupil label, an iris label, a sclera label, an eyelid label, a skin label, and an eye lash label. Using such a label for training the eye state predictor may also improve the training results.

[0040] The training method may include determining, typically using a scene camera of the head-wearable device, a scene observation referring to a field of view of (the at least one eye of) the subject during the observation situation and/or for the observation time as an eye-related observation, for example as the second eye-related observation during the observation situation and/or for the observation time.

[0041] The scene observation may include a scene image (recorded by the scene camera) as a primary eye-related observation, and/or a (at least one) secondary eye-related observation derived from the scene image, in particular a semantic segmentation of the scene image, a 3D gaze point typically measured in scene camera coordinates, and a 2D gaze point typically measured in the scene camera image coordinates.

[0042] Furthermore, the first eye-related observation and/or the second eye-related observation may include at least one of a corneo-retinal standing potential of the at least one eye of the subject during the observation situation and/or for the observation time, a velocity of a head of the subject during the observation situation and/or for the observation time, an acceleration of the head of the subject during the observation situation and/or for the observation time, an orientation of the head of the subject during the observation situation and/or for the observation time, respective 2D gaze point during the observation situation and/or for the observation time and a 3D gaze point during the observation situation and/or for the observation time as a respective primary observation.

[0043] The method for training the eye state predictor is typically performed iteratively.

[0044] Accordingly, a plurality of respective eye-related observations may be (determined using an eye camera and) used for the training (once or several times).

[0045] Further, the respective eye-related observations for different subjects may be (determined using an eye camera and) used for the training.

[0046] The observation situation may be represented by a subject ID and the observation time, or an observation (identity) ID depending on the subject ID and the observation time. Further, the observation ID may depend on an ID of the hardware used for taking the images (hardware ID), in particular a device ID of the head-wearable device. [0047] The determined eye-related observation may be buffered and/or stored in a (training) database (for the eye-related observations), in particular prior to the actual training of the eye state predictor.

[0048] During the actual training (cycles) of the eye state predictor, the database may be used for determining the first eye-related observation and the second eye-related observation, respectively, typically several (first) eye-related observations to be fed as input to the eye state predictor.

[0049] For example, an eye-related observation may typically randomly be retrieved from the database and fed to the eye state predictor outputting a corresponding prediction. The training loss may be determined based on the corresponding prediction, and one or even both of the retrieved eye-related observation and a further eye-related observation retrieved from the database and corresponding to the same observation situation (including the observation time).

[0050] In embodiments in which the differentiable predictor is an identity operator (identity function), the predicted 3D eye state and a corresponding 3D eye state (for the observation situation and/or the observation time) may be used to determine the training loss. Alternatively, an operator, which is in the following also referred to as scaling operator, mapping the input to a typically linearly scaled version of the input as output may be used to determine the training loss.

[0051] The corresponding 3D eye state may be determined using the first eye-related observation but differently compared to the 3D eye state. Alternatively, the corresponding 3D eye state may be determined using the second eye-related observation, or using both the first eye-related observation and the second eye-related observation (or respective parts thereof). Even further, the corresponding 3D eye state may be determined using a sequence of eye-related observations. In particular, a 3D geometric eye model may be used to determine the corresponding 3D eye state such that the corresponding 3D eye state fits to the first eye-related observation (and/or the second eye related observation or the sequence of eye-related observations in the above alternatives).

[0052] Typically, a 3D geometric eye model taking into account corneal refraction is used to determine the corresponding 3D eye state.

[0053] In other embodiments the differentiable predictor is different to the identity operator and has non- vanishing (non-constant) derivatives with respect to the parameters of the eye state predictor. [0054] In particular, the differentiable predictor may be configured to output, in response to receiving the predicted 3D eye state as input from the eye state predictor, a synthetic eye image or a pair of left and right synthetic eye images.

[0055] In this embodiment, the training loss is typically determined based on the synthetic eye image(s) and a corresponding semantic segmentation of the eye image(s). Note that the synthetic eye image(s) may be considered as secondary eye-related observations, and/or may be stored in the database.

[0056] Likewise, a left eye image of a left eye and a right eye image of a right eye of the subject fed as input to the eye state predictor may be considered as first eye-related observations which may also be stored in the database.

[0057] The (non-identity) differentiable predictor may in particular be based on or even be implemented as a trained neural network, in particular a generative neural network and/or as a differentiable Tenderer such as an approximate ray tracer, i.e. a differentiable ray tracer.

[0058] In one embodiment, the differentiable predictor is implemented as a neural network which is trained using a 3D eye model such as a LeGrand 3D eye model or a Navarro 3D eye model (see e.g. WO 2020/244752 Al and references [1], [2] cited therein), and any non- differentiable ray tracer that is configured to generate artificial images given the 3D eye model, a 3D eye state and the camera properties, i.e. image resolution and camera intrinsics of the used eye camera(s).

[0059] In another embodiment, the differentiable predictor is implemented as a ray-tracing algorithm which is designed to generate eye images which are consistent with a 3D eye model such as a LeGrand 3D eye model or a Navarro 3D eye model (see e.g. WO 2020/244752 Al and references [1], [2] cited therein), or any non-differentiable ray tracer that is configured to generate artificial images given the 3D eye model, in an appropriate way to achieve differentiability of the generated images with respect to the internal parameters of the model.

[0060] In these embodiments, the training loss may be a region-based loss, in particular a Jaccard loss.

[0061] As used herein, the term “camera intrinsics” shall describe that the optical properties of the camera, in particular the imaging properties (imaging characteristics) of the camera are known and/or can be modelled using a respective camera model including the known intrinsic parameters (known intrinsics) approximating the eye camera producing the eye images. Typically, a pinhole camera model is used for modelling the eye camera. The known intrinsic parameters may include a focal length of the camera, an image sensor format of the camera, a principal point of the camera, a shift of a central image pixel of the camera, a shear parameter of the camera, and/or one or more distortion parameters of the camera.

[0062] In embodiments in which only one differentiable predictor (e.g. a first differentiable predictor) is used, only one loss function (e.g. a first loss function) may be used to determine the (total) training loss.

[0063] In particular, the prediction output by the (first) differentiable predictor and either the first eye-related observation or the second eye-related observation may be fed to the (first) loss function outputting the (first) training loss.

[0064] In other embodiments, at least two differentiable predictors, e.g. two, three or even more differentiable predictors may be used to determine a respective prediction for the at least one eye of the subject during the observation situation and/or for the observation time, wherein the (total) training loss is determined based on each of the respective predictions. Accordingly, training efficiency, prediction accuracy and/or prediction may be further improved.

[0065] In particular, the predicted 3D eye state may be fed as input to a first differentiable predictor to determine a first prediction for the at least one eye of the subject during the observation situation and/or for the observation time as output of the first differentiable predictor (DP).

[0066] Likewise, the predicted 3D eye state may be fed as input to a second differentiable predictor to determine the second prediction for the respective eye of the subject during the observation situation and/or for the observation time as output of the second differentiable predictor.

[0067] Further, the first prediction and either the first eye-related observation or the second eye-related observation may be fed as input to a first loss function to determine a first training loss as output of the first loss function.

[0068] Likewise, the second prediction and one of the first eye-related observation, the second eye-related observation and a third eye-related observation referring to the observation situation may be fed as input to a second loss function to determine a second training loss as output of the second loss function. [0069] The first and second training losses may be used to train the eye state predictor and to change parameters of the eye state predictor, respectively.

[0070] In particular, the eye state predictor may be trained using a (total) training loss determined as a function of the first training loss and the second training loss, for example as typically weighted sum.

[0071] The (total) training loss may depend on at least one situation-specific parameter.

[0072] In particular, both the eye state predictor and the differentiable predictor (or even two or more differentiable predictors) may receive the at least one situation-specific parameter as part of their respective input.

[0073] The eye state predictor and the differentiable predictor may receive a common situationspecific parameter, a common set of situation-specific parameters (referring and/or characterizing the observation situation or sequence of observation situation at different observation times under otherwise unchanged conditions, in particular same subject and same hardware) or respective situation-specific parameter(s) from the common set of situationspecific parameters.

[0074] The at least one situation-specific parameter typically includes a (at least one) subjectspecific parameter.

[0075] The (at least one) subject-specific parameter may in particular be selected from a list consisting of an interpupillary distance (IPD), an angle or both angles between an optical and a visual axis of the respective eye, a rotation operator capturing a transformation between the optical axis and the visual axis, for example a rotation matrix, a geometric parameter referring to a shape and/or a size of a cornea of the respective eye such as spherical, non-spherical, a thickness, an astigmatism, a 3D topography, and a radius, e.g. a spherical radius or an angle dependent radius of the cornea, a refractive index of at least one component of the respective eye, an iris radius of the respective eye, a pupil shape, and a geometric parameter referring to a shape and/or size of an eyeball of the respective eye.

[0076] During the training of the eye state predictor, one or more of the situation-specific parameters, in particular respective subject-specific parameter(s) may be determined and/or adapted.

[0077] For example, the eye state predictor may receive a respective physiological value for the respective subject-specific param eter(s), for example for a cornea radius of the respective eye, an iris radius of the respective eye, an IPD or an angle(s) between an optical and a visual axis of the respective eyes as an initial input. During the training with eye-related observations of a specific subject, the respective subject-specific parameter(s) may be changed (as part of the optimizing).

[0078] In particular, the training loss may be used to amend (or update) the situation-specific parameter(s).

[0079] After training, the adapted (learned) situation-specific parameter(s) may be output, stored (e.g. for later use as input for predicting 3D eye states of the subject), further processed and/or used as measured value(s) for the subject (which may otherwise only be measured with high effort).

[0080] In embodiments in which the eye state predictor is trained with eye-related observations of different subjects, a respective subject-ID of the eye-related observations facilitates the learning of the subject-specific parameter(s).

[0081] In some embodiments only a subset of the situation-specific parameters is learned and/or adapted during the training.

[0082] The learned subject-specific parameter(s) may later be used as inputs for predicting a 3D eye state of a subject in real time.

[0083] Furthermore, a trained eye state predictor may even be used to determine at least one subject-specific parameter for the new subject.

[0084] Further, the at least one situation-specific parameter may include a (at least one) hardware-specific parameter (which may also be learned and/or adapted during the training).

[0085] The (at least one) hardware-specific parameter may in particular be selected from a list consisting of a respective camera intrinsics, relative camera extrinsics and a pose of the inertial measurement unit relative to at least one of the (eye and optional scene) cameras.

[0086] Typically, the pose of the camera(s) is/are fixed with respect to each other and/or a coordinate system of the head- wearable device when worn by the subject.

[0087] However, cameras with restricted movability (e.g. only along a given axis, like in a VR headset to adjust for individual interpupillary distance) may also be used. In devices with some camera position adjustment capabilities, like VR headset, it may even be possible to implement an alternative position “sensor” which determines, e.g. measures the mutual distances/poses of the cameras.

[0088] Camera extrinsics such as known pose information and intrinsics such as focal length, center pixel shift, shear and distortion parameters may be measured during the production of head-wearable device and then linked to a (hardware) ID of the head-wearable device and thus serve as training input (additional “learning aids”). Accordingly, both training of the eye state predictor and accuracy of the predicted 3D eye state output by the trained eye state predictor may be further improved.

[0089] According to embodiments, the predicted 3D eye state is used to determine a third training loss that may be used to train the eye state predictor.

[0090] In particular, the predicted 3D eye state and the at least one situation-specific parameter may be fed to a third loss function to determine the third training loss.

[0091] The third training loss may be determined based on the predicted 3D eye state and the subject-specific parameter(s).

[0092] Further, the (total) training loss (training loss to be used for training the predicted 3D eye state) may be determined as a function of the first training loss and the third training loss, or as a function of the first training loss, the second training loss and the third training loss.

[0093] According to an embodiment of a method for predicting a 3D eye state for a subject in real time, i.e. with a maximum delay of at most about 0.25 s or 0.1 s, the method, which is in the following also referred to as predicting method, includes determining an eye-related observation referring to at least one eye of the subject, and feeding the eye-related observation as input to a trained eye state predictor (model) implemented as a neural network to determine, typically in real time and/or on the fly, a predicted 3D eye state as output of the trained eye state predictor.

[0094] The predicted 3D eye state may in particular include a predicted 3D center of rotation of an eyeball of the at least one eye, and a 3D gaze direction of the eyeball of the at least one eye.

[0095] Typically, the predicted 3D eye state further includes a predicted 3D state of a pupil of the at least one eye. [0096] The predicted 3D state of the pupil may include at least one of: a predicted 3D pupil size of the at least one eye, a predicted 3D pupil aperture of an iris of the at least one eye, a predicted 3D pupil radius of an iris of the at least one eye, and a predicted 3D pupil diameter of the iris of the at least one eye.

[0097] Typically, the trained eye state predictor is trained according to the training methods as explained herein.

[0098] In embodiments in which the training loss depends on one or more subject- specific parameter(s), the subject specific parameter(s) may be determined in advance for new subjects, in particular as a respective physiological value, as respective measured value, using a method for subject-specific calibration (as explained herein) to determine a respective predicted value or any function of one or more of these values.

[0099] According to an embodiment of a method for subject-specific parameter calibration, the method includes providing a trained eye state predictor as explained herein, feeding a (respective new) first eye-related observation a respective current value of at least one subjectspecific parameter for the new subj ect as input to the eye state predictor to determine a predicted 3D eye state of at least one eye of a new subject for an (a new) observation situation as output of the trained eye state predictor, the first eye-related observation referring to the at least one eye of the new subject in and/or during the observation situation, feeding the predicted 3D eye state and a current value of at least one subj ect- specific parameter for the new subject as input to a differentiable predictor to determine a prediction for the at least one eye of the new subject in and/or during the observation situation as output of the differentiable predictor, determining, based on the prediction and at least one of the first eye-related observation and a second eye- related observation, a training loss, the second eye-related observation referring to the at least one eye of the new subject in and/or during the observation situation; and using the training loss to update the respective current value of the at least one subj ect- specific parameter.

[00100] Typically after a plurality of updating cycles, the respective current value of the at least one subj ect- specific parameter is output as a respective predicted value of the at least one subj ect- specific parameter (of the respective subject).

[00101] Furthermore, for the methods explained herein, an eye camera of a headwearable device worn by the subject is typically used for taking the at least one eye image (at a given time). [00102] More typically, two respective eye cameras of the head-wearable device are used for taking corresponding (left and right) eye images of the subject (at given time(s)).

[00103] The head-wearable device is typically implemented as a spectacles device.

[00104] However, the head-wearable device may also be implemented as an augmented reality (AR-) and/or virtual reality (VR-) device (AR/VR headset), in particular a goggles, an AR head-wearable display, and a VR head-wearable display. For the sake of clarity, headwearable devices are mainly described with regard to head-wearable spectacles devices in the following.

[00105] The 3D eye state(s) is/are typically determined with respect to a coordinate system that is fixed to the eye camera(s) and/or the head-wearable device.

[00106] For example, a Cartesian coordinate system defined by the image plane(s) of the eye camera(s) may be used.

[00107] Points and directions may also be specified within and/or converted into a device coordinate system, a head coordinate system, a world coordinate system or any other suitable 3D coordinate system.

[00108] The (head-wearable) spectacles device typically includes a spectacles body, which is configured such that it can be worn on a head of a subject, for example in a way usual glasses are worn. Hence, the spectacles device when worn by a subject may in particular be supported at least partially by a nose area of the subject’s face. This state of usage of the headwearable (spectacles) device being arranged at the subject’s face will be further defined as the “intended use ” of the spectacles device, wherein direction and position references, for example horizontal and vertical, parallel and perpendicular, left and right, front and back, up and down, etc., refer to this intended use. As a consequence, lateral positions as left and right, an upper and lower position, and a front/forward and back/backward are to be understood from subject’s usual view. Equally, this applies to a horizontal and vertical orientation, wherein the subject’s head during the intended use is in a normal, hence upright, non-tilted, non-declined and nonnodded position.

[00109] The spectacles body (main body) typically includes a left ocular opening and a right ocular opening, which mainly come with the functionality of allowing the subject to look through these ocular openings. Said ocular openings can be embodied, but not limited to, as sunscreens, optical lenses or non-optical, transparent glasses or as a non-material, optical pathway allowing rays of light passing through.

[00110] The spectacles body may, at least partially or completely, form the ocular openings by delimiting these from the surrounding. In this case, the spectacles body functions as a frame for the optical openings. Said frame is not necessarily required to form a complete and closed surrounding of the ocular openings. Furthermore, it is possible that the optical openings themselves have a frame like configuration, for example by providing a supporting structure with the help of transparent glass. In the latter case, the spectacles device has the form similar to frameless glasses, wherein only a nose support/bridge portion and ear-holders are attached to the glass screens, which therefore serve simultaneously as an integrated frame and as optical openings.

[00111] In addition, a middle plane of the spectacles body may be identified. In particular, said middle plane describes a structural center plane of the spectacles body, wherein respective structural components or portions, which are comparable or similar to each other, are placed on each side of the middle plane in a similar manner. When the spectacles device is in intended use and worn correctly, the middle plane coincides with a median plane of the subject.

[00112] Further, the spectacles body typically includes a nose bridge portion, a left lateral portion and a right lateral portion, wherein the middle plane intersects the nose bridge portion, and the respective ocular opening is located between the nose bridge portion and the respective lateral portion.

[00113] For orientation purposes, a plane being perpendicular to the middle plane shall be defined, which in particular is oriented vertically, wherein said perpendicular plane is not necessarily firmly located in a defined forward or backward position of the spectacles device.

[00114] The head-wearable device has an eye camera having a sensor arranged in or defining an image plane for taking images of a first eye of the subject, i.e. of a left or a right eye of the subject. In other words, the eye camera, which is in the following also referred to as camera and first eye camera, may be a left eye camera or a right (near-) eye camera. The eye camera is typically of known camera intrinsics. [00115] In addition, the head-wearable device may have a further eye camera of known camera intrinsics for taking images of a second eye of the subject, i.e. of a right eye or a left eye of the subject. In the following the further eye camera is also referred to as further camera and second eye camera.

[00116] In other words, the head-wearable device may, in a binocular setup, have a left and a right (eye) camera, wherein the left camera serves for taking a (left) image or a stream of images of at least a portion of the left eye of the subject, and wherein the right camera takes a (right) image or a stream of images of at least a portion of a right eye of the subject.

[00117] Typically, the first and second eye cameras have the same or similar camera intrinsics (are of the same type, but may be individually calibrated).

[00118] However, the methods explained herein are also applicable in a monocular setup with one (near) eye camera only.

[00119] The eye camera(s) can be arranged at the spectacles body in inner eye camera placement zones and/or in outer eye camera placement zones, in particular wherein said zones are determined such, that an appropriate picture of at least a portion of the respective eye can be taken for the purpose of determining one or more eye- state-related parameters; in particular, the cameras are arranged in a nose bridge portion and/or in a lateral edge portion of the spectacles frame such, that an optical field of a respective eye is not obstructed by the respective camera. The optical field is defined as being obstructed, if the camera forms an explicitly visible area/portion within the optical field, for example if the camera points out from the boundaries of the visible field into said field, or by protruding from the boundaries into the field. For example, the cameras can be integrated into a frame of the spectacles body and thereby being non-obstructive. In the context of the present invention, a limitation of the visible field caused by the spectacles device itself, in particular by the spectacles body or frame is not considered as an obstruction of the optical field.

[00120] Furthermore, the head-wearable device may have illumination means for illuminating the left and/or right eye of the subject, in particular if the light conditions within an environment of the spectacles device are not optimal. [00121] Further, the head-wearable device may be provided with a scene camera for taking images of the field of view (FOV) of the subject wearing the head-wearable device, for example an integrated scene camera typically arranged in the middle plane.

[00122] The predicting method may at least in part be controlled and/or performed by a computing and control unit of the head-wearable device.

[00123] Alternatively or in addition, a companion device functionally connected with the computing and control unit (or only a control unit) of the head-wearable device of the system, e.g. via a wired or wireless network (TCPIP) connection and/or an USB connection, in particular a mobile companion device such as a smart phone, a tablet, or a laptop connected with the head-wearable device, or a desktop computer may supervise, control and/or perform the predicting methods as explained herein.

[00124] Typically, the computing system (hardware) performing the training methods and/or the predicting methods as explained herein includes one or more processors, in particular one or more CPUs, GPUs and/or DSPs, and a neural network software module including instructions which, when executed by at least one of the one or more processors, implement (an instance of) the neural network.

[00125] According to an embodiment of a system, in particular a training system for eye state predictors, the system includes a head-wearable device including at least one eye camera configured to generate eye images of at least a portion of an (at least one) eye of the subject wearing the head-wearable device, and a computing system which is connectable with the at least one eye camera for receiving the eye images, and is configured to generate, based on an eye image referring to an observation situation and received from the at least one eye camera, a first eye-related observation referring to the (at least one) eye of the subject in and/or during the observation situation; to run an eye state predictor implemented as a neural network; to feed the first eye-related observation as input to the eye state predictor to determine a predicted 3D eye state of the (at least one) eye of the subject in and/or during the observation situation; to feed the predicted 3D eye state as input to a differentiable predictor to determine a prediction for the (at least one) eye of the subject in and/or during the observation situation as output of the differentiable predictor; to determine, based on the prediction and at least one of the first eye-related observation and a second eye-related observation also referring to the eye of the subject in and/or during the observation situation, a training loss; and to use the training loss to train the eye state predictor. [00126] Typically, the computing system is configured to determine the second eye- related observation differently compared to the first eye-related observation.

[00127] The head-wearable device typically includes a respective eye camera for each eye of the subject.

[00128] Further, the first eye-related observation may be generated based on respective (left and right) eye images of the left and right eye of the subject during the observation situation and/or for an observation time at which the eye images are generated.

[00129] Likewise, the second eye-related observation may, even if determined differently, also be generated based on respective (left and right) eye images.

[00130] Furthermore, the head-wearable device may include a scene camera configured to generate scene images referring to a field of view of the subject wearing the head-wearable device.

[00131] Further, the computing system may be configured to host or access a database for the eye-related observations.

[00132] Accordingly, the typically iteratively performed training of the eye state predictor (model) may be facilitated.

[00133] The computing system may in particular be configured to perform the training methods as explained herein.

[00134] The (actual) training cycles of the eye state predictor are typically performed after generating a plurality of eye-related observations, and/or by a sufficiently powerful computing hardware, for example a respective (remote) server and/or cloud-based architecture.

[00135] The computing system may also be configured to perform the predicting methods as explained herein.

[00136] The predicting methods are, compared to the training methods, less compute intensive. Thus, the predicting methods may even be controlled and/or performed by a controller (typically providing a computing and control unit) of the head-wearable device, which is typically configured to run a trained eye state predictor (instance) implemented as neural network and/or may even be integrated into the head-wearable device, by a connected companion device such as a smartphone or by the (computing and) control unit of the headwearable device and the companion device. [00137] The term “neural network” (NN) as used in this specification intends to describe an artificial neural network (ANN) or connectionist system including a plurality of connected units or nodes called artificial neurons. The output signal of an artificial neuron is calculated by a (non-linear) activation function of the sum of its input signal(s). The connections between the artificial neurons typically have respective weights (gain factors for the transferred output signal(s)) that are adjusted during one or more learning phases. Other parameters of the NN that may or may not be modified during learning may include parameters of the activation function of the artificial neurons such as a threshold. Often, the artificial neurons are organized in layers which are also called modules. The most basic NN architecture, which is known as a “MultiLayer Perceptron”, is a sequence of so called fully connected layers. A layer consists of multiple distinct units (neurons) each computing a linear combination of the input followed by a nonlinear activation function. Different layers (of neurons) may perform different kinds of transformations on their respective inputs. Neural networks may be implemented in software, firmware, hardware, or any combination thereof. In the learning phase(s), a machine learning method, in particular a supervised, unsupervised or semi- supervised (deep) learning method may be used. For example, a deep learning technique, in particular a gradient descent technique such as backpropagation may be used for training of (feedforward) NNs having a layered architecture. Modern computer hardware, e.g. GPUs makes backpropagation efficient for many-layered neural networks. A convolutional neural network (CNN) is a feed-forward artificial neural network that includes an input (neural network) layer, an output (neural network) layer, and one or more hidden (neural network) layers arranged between the input layer and the output layer. The speciality of CNNs is the usage of convolutional layers performing the mathematical operation of a convolution of the input with a kernel. The hidden layers of a CNN may include convolutional layers as well as optional pooling layers (for down sampling the output of a previous layer before inputting it to the next layer), fully connected layers and normalization layers. At least one of the hidden layers of a CNN is a convolutional neural network layer, in the following also referred to as convolutional layer. Typical convolution kernel sizes are for example 3x3, 5x5 or 7x7. The usage of convolutional layer(s) can help to compute recurring features in the input more efficiently than fully connected layers. Accordingly, memory footprint may be reduced and performance improved. Due to the shared- weights architecture and translation invariance characteristics, CNNs are also known as shift invariant or space invariant artificial neural networks (SIANNs). In the following, the term “model of a neural network” intends to describe a set of data required to define a neural network operable in software and/or hardware. The model typically includes data referring to the architecture of the NN, in particular the network structure including the arrangement of neural network layers, the sequence of information processing in the NN, as well as data representing or consisting of parameters of the NN, in particular the connection weights within fully connected layers and kernel weights within convolutional layers. In a training phase, the network learns to map the input(s) (eye-related observation(s)) to the corresponding sufficiently precise output result(s) (3D eye state(s)).

[00138] Other embodiments include corresponding computer systems, (non-volatile) computer-readable storage media or devices, and/or computer programs recorded on one or more computer-readable storage media or computer storage devices, each configured to perform the processes of the methods described herein.

[00139] A system of and/or including one or more computers can be configured to perform particular operations or processes by virtue of software, firmware, hardware, or any combination thereof installed on the one or more computers that in operation may cause the system to perform the processes. One or more computer programs can be configured to perform particular operations or processes by virtue of including instructions that, when executed by a one or more processors of the system, cause the system to perform the processes.

[00140] Those skilled in the art will recognize additional features and advantages upon reading the following detailed description, and upon viewing the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[00141] The components in the figures are not necessarily to scale, instead emphasis being placed upon illustrating the principles of the invention. Moreover, in the figures, like reference numerals designate corresponding parts. In the drawings:

[00142] Fig. 1 A and Fig. IB illustrate respective flow charts of a method for training an eye state predictor according to embodiments;

[00143] Fig. 1C illustrates a flow chart of a method for predicting a 3D eye state for a subject in real time according to embodiments;

[00144] Fig. ID illustrates 3D eye states of a subject according to embodiments;

[00145] Fig. 2A, 2B illustrate respective flow charts of a method for training an eye state predictor according to embodiments; [00146] Fig. 2C illustrates a flow chart of a method for training an eye state predictor according to embodiments;

[00147] Fig. 3A, 3B illustrate respective flows charts of a method for training an eye state predictor according to embodiments;

[00148] Fig. 4A, 4B, 4C illustrate respective flows charts of a method for training an eye state predictor according to embodiments;

[00149] Fig. 5 A illustrates a flow chart of a method subject-specific parameter calibration according to embodiments;

[00150] Fig. 5B illustrates a flow chart of a method for training an eye state predictor according to embodiments; and

[00151] Fig. 5C illustrates perspective view of a system including a head-wearable device according to embodiments.

DETAILED DESCRIPTION

[00152] In the following Detailed Description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. In this regard, directional terminology, such as “top,” “bottom,” “front,” “back,” “leading,” “trailing,” etc., is used with reference to the orientation of the Figure(s) being described. Because components of embodiments can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

[00153] Reference will now be made in detail to various embodiments, one or more examples of which are illustrated in the figures. Each example is provided by way of explanation, and is not meant as a limitation of the invention. For example, features illustrated or described as part of one embodiment can be used on or in conjunction with other embodiments to yield yet a further embodiment. It is intended that the present invention includes such modifications and variations. The examples are described using specific language which should not be construed as limiting the scope of the appended claims. The drawings are not scaled and are for illustrative purposes only. For clarity, the same elements or manufacturing steps have been designated by the same references in the different drawings if not stated otherwise.

[00154] With reference to Figs. 1A and IB a method 1000 for training an eye state predictor ESP implemented as a neural network is explained.

[00155] In a block 1200 of method 1000 a first eye-related observation EROi is fed as input to the eye state predictor ESP to be trained which outputs a predicted 3D eye state 3DES corresponding to the first eye-related observation EROi for a given observation situation represented by the observation time tk, where k is an integer index. The time tk and the index k, respectively, may e.g. be selected randomly if training method 1000 is performed repeatedly and/or iteratively (as indicated by the dashed-dotted arrow in FIG. IB) and the eye-related observation EROs are retrieved from a database filled with the eye-related observations EROs in advance, in a block 1100.

[00156] Alternatively or in addition, the first eye-related observation EROi (tk) (and optionally a second eye-related observation ERO2 (tk)) may be determined using at least one eye camera or any other eye-related data providing component of a head-wearable device worn by the subject during performing training method 1000 such as a scene camera of the headwearable device.

[00157] At least one of the eye-related observations EROs for the respective observation situation typically include a left eye image Pi of a left eye and a right eye image P r of a right eye of the subject typically recorded by a respective eye camera during the observation situation and/or for the observation time tk as a respective primary eye-related observation pERO, and/or a corresponding secondary eye-related observations sERO derived from the left and right eye images Pi, P r , such as a semantic segmentation Si of the left eye image Pi and a semantic segmentation S r of the right eye image P r .

[00158] Further, the eye-related observations EROs for the observation situation may include a scene image Ps typically recorded by a scene camera as a primary eye-related observation pERO and/or at least one corresponding secondary eye-related observation derived from the scene image Ps, in particular derived 2D gaze points or even derived 2D gaze points. Further, a semantic segmentation Ss of the scene image Ps may be included as secondary eye- related observation.

[00159] In the exemplary embodiment, the predicted 3D eye state 3DES is, as indicated by the brackets “{ a set (e.g. a vector) consisting of a predicted 3D center Ec of an eyeball of at least one eye of a human subject, a predicted 3D gaze direction EG of the at least one eye, and a predicted 3D state Ep of a pupil of the at least one eye.

[00160] In other embodiments, the predicted 3D eye state 3DES may in addition (or instead of e.g. the predicted 3D state Ep of the pupil) include a predicted 3D state of an eyelid of the at least one eye.

[00161] As illustrated in Fig. ID, the predicted 3D eye state 3DES may be one of a monocular state 3DES mi of a left eye of the subject (with a predicted 3D center Eci of an eyeball of the left eye of the subject, a predicted 3D gaze direction EGI of the left eye of the subject, and a predicted 3D pupil state Epi of the left eye of the subject), a monocular state 3DESmr of a right eye of the subject (with a predicted 3D center Ec r of the eyeball of the right eye of the subject, a predicted 3D gaze direction EGT of the right eye of the subject, and a predicted 3D pupil state Ep r of the right eye of the subject), a monocular state 3DES mc of a (virtual) cyclopean eye of the subject (with a predicted 3D center Ec c of an eyeball of a cyclopean eye of the subject, a predicted 3D gaze direction EGC of the cyclopean eye of the subject, and a predicted 3D pupil state Ep c of the cyclopean eye of the subject), and a binocular state 3DESb (with corresponding monocular 3D-states of the left and right eyes of the subject).

[00162] While (the right, left and cyclopean) monocular states 3DESmi, 3DESmr, 3DESmc may be represented by a six-dimensional vector, the binocular state 3DESb may be represented by a ten-dimensional vector or twelve-dimensional vector.

[00163] As further illustrated in FIG. 1 A and FIG. IB, the predicted 3D eye state 3DES ({Ec, EG, Ep}) may be fed as input to a differentiable predictor DP outputting a prediction II referring to the at least one eye of the subject and corresponding to the observation situation, in a subsequent block 1300 of FIG. IB. Differentiable predictor DP may have one or more non-vanishing derivatives with respect to its input observables as variables.

[00164] In a subsequent block 1400, the prediction II and the first eye-related observation EROi (tk) (dashed arrow in FIG. 1 A) and/or a second eye-related observation ERO2 (tk) referring to the at least one eye of the subject in and/or during the same observation situation tk, typically one of EROi (tk) and ERO2 (tk) are fed to loss function LF outputting a corresponding training loss A which is used to train the eye state predictor ESP, in a block 1500.

[00165] For example, the first eye-related observation EROi may be a primary eye- related observation and the loss function LF may receive the prediction II and a secondary eye- related observation which is derived from the primary eye-related observation, or a different primary eye-related observation as the second eye-related observation ERO2 (tk).

[00166] Alternatively, N eye-related observations may be fed to the eye state predictor ESP outputting M predictions that are compared to L other eye-related observations to calculate the loss A (with positive integer numbers N, M, L each of which may be larger than 1). Typically, l<N<100, M<=N, and/or L<N.

[00167] In one example, 20>N>l, M=l, and L=M (or e.g. L<=N). Accordingly, one prediction II may be determined for a short sequence of N eye-related observation, for example as a mean prediction II, and compared with a representative (other) eye-related observation referring to the same time interval (L=M=1), or with a mean eye-related observation determined using L>1, for example L=N (other) eye-related observations.

[00168] After a plurality of training cycles, the resulting trained eye state predictor (model) tESP may be used for predicting 3D eye states for the same or a different subject in real time.

[00169] As illustrated in Fig. 1C for an exemplary predicting method 9000, a (newly) just determined eye-related observation ERO* may be input to the trained eye state predictor tESP outputting a corresponding predicted 3D eye state 3DES*.

[00170] Method 9000 may in particular be performed for a subject wearing a headwearable device of the same type as used for training method 1000.

[00171] Further, method 9000 may be performed several times and/or substantially continuously while the subject is wearing the head-wearable device and e.g. looking at a real 3D-scene, a 3D representation of the real scene or a virtual 3D-scene (presented on a screen).

[00172] Furthermore, the predicted 3D eye state(s) 3DES* may be further processed, in particular used to determine intentions of the subject and/or interact with the subject, for example present additional information (acoustically and/or visually) and change the 3D representation of the real scene or the virtual 3D-scene.

[00173] FIG. 2A and FIG. 2B illustrate a training method 2000. Training method 2000 is similar to training method 1000 explained above with respect to FIGs. 1 A, IB and includes blocks 2100 to 2500 each of which is typically similar to a corresponding block of blocks 1100 to 1500 of method 1000. However, training method 2000 is more specific. [00174] In the exemplary embodiment, the differentiable predictor of block 2300 is an identity operator I.

[00175] Accordingly, the predicted 3D eye state 3DES ({Ec, EG, Ep}) determined in block 2200 is forwarded as prediction II in block 2300, and used as input of the loss function LF in block 2400.

[00176] As illustrated by the dashed arrows in FIGs. 2A, 2B, using an identity operator in block 2300 may also be considered as bypassing block 2300 and using the predicted 3D eye state 3DES ({Ec, EG, Ep}), which is determined in block 2200 as output of the eye state predictor ESP upon receiving a first eye-related observation EROi (tk), as prediction II and input of the loss function LF in block 2400, respectively.

[00177] According to an embodiment which is illustrated in FIG. 2C, a method 2000’ for training an eye state predictor ESP includes feeding, in a block 2100, a first eye-related observation EROi as input to the eye state predictor ESP to determine a predicted 3D eye state 3DES, {Ec, EG, Ep} of at least one eye of a subject for an observation situation, that may refer to and/or be represented by an observation time tk, as output of the eye state predictor ESP, wherein the first eye-related observation refers to the at least one eye of the subject in and/or during the observation situation; determining, in a block 2400’, based on the first eye-related observation EROi and a second eye-related observation also referring to the at least one eye of the subject for (in and/or during) the observation situation, a training loss A; and using the training loss (A) to train the eye state predictor ESP, in a block 2500.

[00178] In the embodiments of FIG. 2A to FIG. 2C, the respective loss function LF typically receives as second input (second eye-related observation) a corresponding 3D eye state {Ec, EG, Ep}’ (with the same observables) determined differently compared to the 3D eye state ({Ec, EG, Ep}) but also referring to the at least one eye of the subject for (in and/or during) the observation situation.

[00179] As further shown in FIG. 2B, (direct supervision training) method 2000 may be performed until the training loss A (or a floating average of the training loss A) is below a predetermined threshold Ath.

[00180] With respect to FIGs. 3 A, 3B, a training method 3000 is explained. Method 3000 is also similar to training method 1000 explained above with respect to FIGs. 1A, IB and includes blocks 3100 to 3500 each of which is typically similar to a corresponding block 1100 to 1500 of method 1000. However, training method 3000 is more specific.

[00181] In the exemplary embodiments of FIGs. 3 A, 3B, two different differentiable predictors DP, DPI are used, in respective blocks 3300, 3301, to determine a respective prediction P, Pl referring to the at least one eye of the subject during the observation situation tk.

[00182] Further, the first prediction H and a second eye-related observation ERO2 referring to the same observation situation tk may be fed as input to a first loss function LF to determine a first training loss A as output of the first loss function LF, in block 3400.

[00183] In an alternative the first eye-related observation and the first prediction H may be fed as input to a first loss function LF.

[00184] Likewise, the second prediction Hl and a third eye-related observation referring to the same observation situation tk may be fed as input to a second loss function LF1 to determine a second training loss Al as output of the second loss function LF1, in block 3401.

[00185] For example, a differentiable Tenderer may be used in block 3300 to determine a synthetic left eye image SIi and a synthetic right eye image SI r (as first prediction P) which are, in block 3400, compared with semantic segmentation Si, S r of a left eye image Pi and a right eye image P r , respectively, which are part or even form the first eye-related observation, for determining the first loss A.

[00186] Further, a 2D gaze prediction G, G r for the left eye and the right eye, respectively, may be determined as the second prediction Hl in block 3401, and compared with respective 2D gaze values or labels Li, L r which have been determined independently.

[00187] The first training loss A and the second training loss Al may be used to train the eye state predictor ESP.

[00188] Typically, a total training loss A2 is determined as a function f(A, Al) of the first training loss A and the second training loss Al, in a block 3450, for example as a weighted average, and the training may be performed based on the total training loss A2 in block 3500.

[00189] With respect to FIGs. 4A, 4B, 4C a training method 4000 is explained. Method 4000 is also similar to training method 1000 explained above with respect to FIGs. 1 A, IB and includes blocks 4100, 4200, 4300, 4400, 4500 each of which is typically similar to a corresponding block 1100 to 1500 of method 1000. However, training of the eye state predictor3DES and determining the training loss(es) A, A’ respectively, according to training method 4000 depends on at least one situation-specific parameter PAR.

[00190] In particular, one or both of determining the predicted 3D eye state 3DES in block 4200 and determining the prediction H as output of the differentiable predictor DP in block 4300 may depend on one or more situation-specific parameters PAR. The situationspecific parameter(s) PAR used in blocks 4200, 4300 may (at least in part) be the same, but may also be completely different depending on the training setup.

[00191] For example, the differentiable predictor DP may receive as input the predicted 3D eye state 3DES, for example the binocular state 3DESb with a predicted 3D center Eci of the eyeball of the left eye, a predicted 3D gaze direction EGI of the left eye, a predicted 3D pupil state Epi of the left eye, a predicted 3D center Ec r of the eyeball of the right eye, a predicted 3D gaze direction E& of the right eye, and a predicted 3D pupil state Ep r of the right eye of the subject as shown in FIG. 4C, and hardware- specific parameters such as a respective camera intrinsics, relative camera extrinsics and a pose of an inertial measurement unit for the subject’s head relative to at least one of the cameras. Accordingly, determining of the prediction P, that may be determined as synthetic eye images, in block 4300 may be facilitated.

[00192] Further, a further loss function LF’ (also referred to as third loss function) may be used to determine a further (or third) loss A’ depending on the predicted binocular state 3DESb and the subject- specific parameter(s) such as interpupillary distance IPD, and angles between an optical and a visual axis.

[00193] For example, loss A’ may, optionally depending on the subject-specific parameter (s), be a measure for a symmetry loss, a deviation from an expected distance of or a relationship between the predicted left eye 3D-state and the predicted right eye state of predicted binocular state 3DESb.

[00194] For example, loss A’ may depend on or even correspond to an IPD loss, i.e. a deviation of an IPD in accordance with the predicted binocular state 3DESb from a physiological (average) IPD or measured IPD of the subject.

[00195] Likewise, the loss A’ may depend on or even correspond to a loss referring to a deviation from a (an expected) relation between 2D or 3D gaze angles of the left and right eyes, (gaze-angle-symmetry loss), to a loss referring to a difference between the pupils of the left and right eyes such as a pupil size difference, a deviation from a (an expected) 2D or 3D pupil orientation relation of the pupils of the left and right eyes (pupil-symmetry loss), or any (other) symmetry loss of the predicted binocular state 3DESb.

[00196] Further, training of the eye state predictor ESP may be based on a total training loss A2 determined as a function g of both partial losses A, A', as also shown in FIG. 4C.

[00197] Furthermore, in particular one or more subject-specific parameters PAR such as cornea radius, iris radius etc. may be changed during the training.

[00198] For example and as indicated by the dotted arrow in FIG. 4 A, the subjectspecific parameter(s) PAR may even be learned by optimizing the loss function LF.

[00199] With respect to FIG. 5 A a flow chart of a method 8000 for subject-specific parameter calibration is explained.

[00200] In a first block 8100, a trained eye state predictor as explained herein is provided. For example, the trained eye state predictor may be obtained as explained above with regard to FIG. 4A - FIG. 4C.

[00201] In a subsequent block 8200, a respective first eye-related observation is fed as input to the trained eye state predictor to determine a predicted 3D eye state of at least one eye of a new subject for a new observation situation as output of the eye state predictor. The first eye-related observation refers to the at least one eye of the new subject in and/or during the new observation situation.

[00202] In a subsequent block 8300, the predicted 3D eye state and a current value of at least one subject- specific parameter of the new subject is fed as input to a differentiable predictor, for example the differentiable predictor LF1 of FIG. 4 A to determine a prediction for the at least one eye of the new subject in and/or during the new observation situation as output of the differentiable predictor.

[00203] In a subsequent block 8400, based on the prediction and at least one of the respective first eye-related observation and a respective second eye-related observation referring to the at least one eye of the new subject in and/or during the new observation situation, a training loss is determined.

[00204] In a subsequent block 8600, the training loss is used to update the current value of the at least one subject- specific parameter. [00205] Updating the current value of the at least one subject-specific parameter may be done in accordance with an optimizing technique.

[00206] Thereafter, method 8000 may return to block 8200.

[00207] After a plurality cycles, the current updated value of the at least one subjectspecific parameter may be output a respective predicted value for the at least one subjectspecific parameter and/or stored in a database.

[00208] With respect to FIG. 5B, a training method 5000 is explained. Method 5000 is also similar to training method 1000 explained above with respect to FIGs. 1 A, IB and includes blocks 5100, 5200, 5300, 5400, 5500 each of which is typically similar to a corresponding block 1100 to 1500 of method 1000. However, training method 5000 is more specific.

[00209] In a first block 5100, a left eye image Pi and a right eye image P r of a subject are determined for an observation situation, e.g. taken using respective eye cameras or retrieved from a database.

[00210] An eye state predictor is used to determine a predicted 3D eye state 3DES, {Ec, EG, Ep} for the left eye image Pi and the right eye image P r , in a subsequent block 5200, in particular as a predicted binocular 3D eye state.

[00211] In a subsequent to block 5300, a synthetic left eye image SIi and a synthetic right eye image SI r are determined for the predicted 3D eye state 3DES (as a prediction H resulting from the predicted 3D eye state 3DES).

[00212] Further, a semantic segmentation Si of the left eye image Pi and a semantic segmentation Sr of the right eye image Pr are determined in another block 5350 subsequent to block 5100.

[00213] Based on a comparison of the synthetic eye images SIi, SI r with the semantic segmentation Si, S r , a training loss A is determined in a block 5400.

[00214] In a subsequent block 5500, the training loss A may be used to train the eye state predictor (model) such as an NN, in particular a CNN using machine learning algorithms, in particular a respective optimization algorithm.

[00215] Thereafter, method 5000 may return to block 5100 as indicated by the dashed- dotted arrow. [00216] For example, determining the trainings loss A may be based on a first comparison of the synthetic left eye image SIi with the semantic segmentation Si of the left eye image Pi, and/or on a second comparison of the synthetic right eye image SI r with the semantic segmentation S r of the right eye image P r , typically on the first comparison and the second comparison.

[00217] Alternatively or more typically in addition to determining the trainings loss A in block 5400, a further training loss A’ may, in a block 5450, be determined based on a comparison of the synthetic left eye image SIi with the synthetic right eye image SI r , and used to train the eye state predictor in block 5500.

[00218] Fig. 5C illustrates a system 500 for performing the methods explained herein, in particular the training methods 1000-5000.

[00219] In the exemplary embodiment, system 500 includes a head-wearable device 100 implemented as a spectacles device. Accordingly, a frame of spectacles device 100 has a front portion 114 surrounding a left ocular opening and a right ocular opening. A bridge portion of the front portion 114 is arranged between the ocular openings. Further, a left temple 113 and a right temple 123 are attached to front portion 114.

[00220] An exemplary camera module 140 is accommodated in the bridge portion and arranged on the wearer-side of the bridge portion. A passage opening for a scene camera 160 of module 140 and the field of view (FOV) of scene camera 160, respectively, is formed in the bridge portion.

[00221] The scene camera 160 is typically centrally arranged, i.e. at least close to a central vertical plane between the left and right ocular openings and/or close to (expected) eye midpoint(s) of a human subject wearing the head-wearable device 100 (user). The latter also facilitates a compact design. Furthermore, the influence of parallax error for gaze prediction may be reduced this way significantly.

[00222] Furthermore, the scene camera 160 may define a Cartesian coordinate system x, y, z and have an optical axis which is at least substantially arranged in the central vertical plane (to reduce parallax error), arranged in the central x, z-plane and/or pointing in x-direction in the exemplary embodiment. [00223] Leg portions 134, 135 of module 140 may at least substantially complement the frame below the bridge portion so that the ocular openings are at least substantially surrounded by material of frame and module 100.

[00224] A right eye camera 150 for taking right eye images of the user, which may be considered as eye-related observational data, and may form at least a part of respective (primary) eye-related observations, is arranged in right leg portion 135.

[00225] Likewise, a left eye camera 150 for taking left eye images of the user may be arranged in left leg portion 134 (for providing eye-related observational data).

[00226] In the exemplary embodiment, the head-wearable device 500 is additionally provided with an inertial measurement unit 170 for measuring movements and/or orientations of the head-wearable device 500 and a human subject wearing the head-wearable device 500, respectively.

[00227] Both, the scene camera 160 and the inertial measurement unit 170 may provide respective (primary)eye-related observational data for the user.

[00228] As indicated by the dashed-dotted arrows in Fig. 5C, a computing system 200, 300 of system 500 is connectable with the scene camera 160 for receiving scene images, connectable with the eye camera(s) 150 for receiving eye images, and connectable with the inertial measurement unit 170 for receiving data referring to movements and/or orientations of the head-wearable device 500.

[00229] The computing system 200, 300 is configured to perform the methods explained herein.

[00230] For this purpose, the computing system 200, 300 typically includes one or more processors and a (respective) non-transitory computer-readable storage medium comprising instructions which, when executed by the one or more processors, causes system 500 to carry out the methods as explained herein.

[00231] In the exemplary embodiment, the computing system 200, 300 comprises several interconnectable parts, namely a first computing unit 200 and a second computing unit 300.

[00232] While the first computing unit 200 is typically a local unit or system and/or configured to perform the predicting methods as explained herein, the second computing unit 300 may be a remote unit or system, e.g. even cloud-based, and/or is typically configured to perform the actual training steps (cycles) of the training methods as explained herein (after receiving eye-related observational data via the first computing unit 200).

[00233] The second computing unit 300 may be configured to determine eye-related observations from the received eye-related observational data, and even host and/or manage a database for the eye-related observation.

[00234] The first computing unit 200 may be implemented as a controller and may even be arranged within a housing of module 140, at least in part.

[00235] However, the first computing system 200 may also at least in part be provided by a companion device connectable (with the second computing unit 300 and) to the controller of the head-wearable device 100, for example via a USB-connection (e.g. one of the temples may provide a respective plug or socket), in particular a mobile companion device such as a smartphone, tablet or laptop.

[00236] Further, the second computing unit 300 may be configured to upload the trained eye state predictor tESP to the second computing unit 200.

[00237] According to an embodiment of a method for training an eye state predictor model, which may be implemented as a neural network, in particular as a convolutional neural network, the method includes feeding a first eye-related observation as input to the eye state predictor to determine a predicted 3D eye state of at least one eye of the subject for an observation situation (that may be represented by an observation ID or an observation time) as output of the eye state predictor model, the first eye-related observation referring to the at least one eye of the subject during the observation situation and/or including first eye-related observational data referring to the at least one eye of the subject during the observation situation. The predicted 3D eye state is fed as input to a differentiable predictor to determine a prediction for the at least one eye of the subject during the observation situation as output of the differentiable predictor. Based on the prediction and at least one of the first eye-related observation and a second eye-related observation referring to said observation situation and/or comprising second eye-related observational data referring to the at least one eye of the subject during said observation situation and being different from the first eye-related observational data, in particular based on the prediction and one of the first eye-related observation and the second eye-related observation, a (current) training loss is determined. The training loss is used to train the eye state predictor model and/or to change training parameters of the eye state predictor. In particular, the eye state predictor model may be trained by machine learning and using machine learning algorithms, respectively, in particular a respective optimization algorithm.

[00238] Although various exemplary embodiments of the invention have been disclosed, it will be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the spirit and scope of the invention. It will be obvious to those reasonably skilled in the art that other components performing the same functions may be suitably substituted. It should be mentioned that features explained with reference to a specific figure may be combined with features of other figures, even in those cases in which this has not explicitly been mentioned. Such modifications to the inventive concept are intended to be covered by the appended claims.

[00239] While processes may be depicted in the figures in a particular order, this should not be understood as requiring, if not stated otherwise, that such operations have to be performed in the particular order shown or in sequential order to achieve the desirable results. In certain circumstances, multitasking and/or parallel processing may be advantageous.

[00240] Spatially relative terms such as “under”, “below”, “lower”, “over”, “upper” and the like are used for ease of description to explain the positioning of one element relative to a second element. These terms are intended to encompass different orientations of the device in addition to different orientations than those depicted in the figures. Further, terms such as “first”, “second”, and the like, are also used to describe various elements, regions, sections, etc. and are also not intended to be limiting. Like terms refer to like elements throughout the description.

[00241] As used herein, the terms “having”, “containing”, “including”, “comprising” and the like are open ended terms that indicate the presence of stated elements or features, but do not preclude additional elements or features. The articles “a”, “an” and “the” are intended to include the plural as well as the singular, unless the context clearly indicates otherwise.

[00242] With the above range of variations and applications in mind, it should be understood that the present invention is not limited by the foregoing description, nor is it limited by the accompanying drawings. Instead, the present invention is limited only by the following claims and their legal equivalents. Reference numbers

100 head-wearable device

114 front portion of frame

113, 123 temple

140 camera module

150 (right) eye camera

160 scene camera

170 inertial measurement unit

200, 300 computing system / controller / computing unit / companion device

500 system

1000-1430 method / method steps

3DES,{E C , EG, EP} 3D eye state

DP, DPI differentiable predictor

ERO, EROi, Pi, P r eye-related observation

ESP eye state predictor (model) pERO primary eye-related observation sERO secondary eye-related observation

Pi, Pr eye images

Si, Sr semantic segmentation of Pi, P r

A, Al, A2, A’ training loss

Ps scene image

Ss semantic segmentation of Ps n,m prediction

1000 5500 methods, method steps