Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LEARNABLE VISUAL MARKERS AND METHOD OF THEIR PRODUCTION
Document Type and Number:
WIPO Patent Application WO/2017/209660
Kind Code:
A1
Abstract:
This technical solution refers to methods for production of a family of visual markers capable of encoding information in robotics, virtual and augmented reality domains. A synthesizing neural network that converts a sequence of bits into images of visual markers, a rendering neural network that converts input images of visual markers into images comprising visual markers, and a recognizing neural network that converts images containing visual markers into a bit sequence are created The synthesizing, rendering and recognizing neural networks are trained jointly via the minimization of the loss function, reflecting a probability of correct recognition of random bit sequences. A localizing neural network that translates images comprising a marker to various marker position parameters instead of or in addition to the recognizing neural network is created. The technical result is an increase in accuracy of recognition and localization of visual markers.

Inventors:
LEMPITSKY VICTOR SERGEEVICH (RU)
Application Number:
PCT/RU2017/050048
Publication Date:
December 07, 2017
Filing Date:
June 05, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
AUTONOMOUS NON-PROFIT ORGANIZATION FOR HIGHER EDUCATION «SKOLKOVO INST OF SCIENCE AND TECHNOLOGY» (RU)
International Classes:
G06N3/02; G05B13/02; G06T1/00
Foreign References:
US20150106370A12015-04-16
US20050108282A12005-05-19
US20090003646A12009-01-01
US20060110029A12006-05-25
US20150161522A12015-06-11
US20160098633A12016-04-07
US20130219327A12013-08-22
Attorney, Agent or Firm:
KOTLOV, Dmitry Vladimirovich et al. (RU)
Download PDF:
Claims:
CLAIMS

1 . A method for creating a family of visual markers capable of encoding information, the method comprising:

• formation of a synthesizing neural network that translates a sequence of bits into images of visual markers;

• formation of a rendering neural network that converts input images of visual markers into images comprising visual markers by applying various geometric and photometric transformations;

• formation of a recognizing neural network that converts images containing visual markers into bit sequences;

• joint training of synthesizing, rendering and recognizing neural networks by minimizing the loss function, which reflects the probability of correct recognition of random bit sequences;

• synthesis of visual markers by means of passing certain bit sequences through the trained synthesizing neural network;

• obtaining a set of visual marker images from video data sources;

• extraction of the encoded bit sequences, carried by passing of an obtained set of visual marker images through the recognizing neural network.

2. The method of claim 1 , wherein a rendering neural network converts input images of visual markers to images comprising visual markers placed on top of background images.

3. The method of claim 1 , wherein a synthesizing neural network consists of a single linear layer, followed by an element-wise sigmoid function.

4. The method of claim 1 , wherein a synthesizing and/or recognizing neural network has a convolutional type.

5. The method of claim 1 , wherein while learning a term that measures aesthetic acceptability of markers is added to the optimization objective.

6. The method of claim 1 , wherein while learning a term that measures correspondence of markers to the visual style specified in the form of a sample image is added to the optimization objective.

7. The method of claim 1 , wherein certain minimization of the loss function is carried out using a stochastic gradient descent algorithm.

8. The method of claim 1 , wherein while learning a bit sequence is selected uniformly from the Boolean cube.

9. The method of claim 1 , wherein a synthesizing, rendering and recognizing neural network are feed-forward neural networks.

10. The method for creating a family of visual markers capable of encoding information, the method comprising:

• creating of variables corresponding to values of pixels in terms of created visual markers;

• formation of a rendering neural network that converts pixel values of visual markers into images comprising visual markers, by applying various geometric and photometric transformations;

• formation of a recognizing neural network that converts images comprising visual markers into bit sequences;

• joint training of marker variables, rendering and recognizing neural networks by minimizing the loss function, which reflects the probability of correct recognition of the marker class;

• synthesis of visual markers by creating raster images with values of pixels, determined as a result of training;

• obtaining a set of visual marker images from video data sources; · extracting of marker class numbers from the obtained set of images by applying the recognizing network.

1 1 . The method of claim 10, wherein a rendering neural network converts input images of visual markers to images comprising visual markers placed on top of background images.

12. The method of claim 10, wherein while learning a term that measures aesthetic acceptability of markers is added to the optimization objective.

13. The method of claim 10, wherein while learning a term that measures correspondence of markers to the visual style specified in the form of a sample image is added to the optimization objective.

14. The method of claim 10, wherein certain minimization of the loss function is carried out using a stochastic gradient descent algorithm.

15. The method of claim 10, wherein a rendering and recognizing neural network are feed-forward networks.

16. The method for creating a family of visual markers capable of encoding information, the method comprising: • creating of variables corresponding to values of pixels in terms of created visual markers;

• formation of a rendering neural network that converts input images of visual markers into images comprising visual markers by applying various geometric and photometric transformations;

• formation of a localizing neural network that transforms images comprising a marker to the marker position parameters;

• joint training of the marker variables, rendering and localizing neural network together by minimizing the loss function, which reflects the probability of correct estimation the position of a marker on a certain image;

• synthesis of visual markers by creating raster images with values of pixels, determined as a result of training;

• obtaining a set of visual marker images from video data sources; · extracting of visual marker positions from the obtained set of visual marker images using the localizing network.

17. The method of claim 16, wherein a rendering neural network converts input images of visual markers to images comprising visual markers placed on top of background images.

18. The method of claim 16, wherein while learning a term that measures aesthetic acceptability of markers is added to the optimization objective.

19. The method of claim 16, wherein while learning a term that measures correspondence of markers to the visual style specified in the form of a sample image is added to the optimization objective.

20. The method of claim 16, wherein certain minimization of the loss function is carried out using a stochastic gradient descent algorithm.

21 . The method of claim 16, wherein a rendering and a recognizing neural network are feed-forward neural networks.

Description:
LEARNABLE VISUAL MARKERS AND METHOD OF THEIR PRODUCTION FIELD

[1 ] This technical solution generally refers to the technical computing field and in particular to visual markers and methods for their production, which can be used in such application areas as robotics, virtual and augmented reality. BACKGROUND

[2] Currently, visual markers (also known as visual fiducials or visual codes) are used in order to augment environments and to assist the computer vision algorithms in scenarios that are resource-limited and/or require very accurate/robust operation. The existing visual markers can be exemplified by simple (linear) bar-codes and their two- dimensional (matrix) counterparts such as QR codes or Aztec codes, which are used in order to embed visual information means into various objects and scenes. Such visual markers as AprilTags are wide-spread in robotics (Fig. 6) together with similar methods, which represent a popular way to simplify identification of locations, objects and agents for different types of robots. ARCodes and similar markers are used in the field of the augmented reality in order to provide camera position estimates with a high degree of accuracy, low latency, and on budget devices. In general, such markers can embed visual information into the environment more compactly and independently from any language; they can also be recognized and used by autonomous as well as human- controlled devices.

[3] Thus, all visual markers known in the prior art are developed heuristically and their appearance is motivated by the ease of their recognition by computer (machine) vision algorithms. Design and tuning of recognition algorithms, the purpose of which is then to provide reliable localization and interpretation of visual markers, is then carried out for all newly created families of markers. Creation of visual markers and corresponding recognition means can thus be divided into two stages, but this division is not optimal (since a certain type of "hand-crafted" markers may not be optimal from the point of view of a recognizer in the mathematical sense). In addition, the aspect of aesthetics is lost when creating various visual marker families, which leads to appearance of "intrusive" markers, which in many cases do not correspond to the style of the environment in which they are placed, or objects on which they are places. Such markers make the structure of this environment or objects "friendly to computers" (make them easy for recognition) and "non-friendly to humans".

SUMMARY

[4] This technical solution is aimed at eliminating drawbacks inherent in solutions known in the prior art.

[5] The technical object provided in this technical solution is related to creation of families of visual markers which do not have similar problems to the state of the prior art.

[6] The technical result is the increase in accuracy of the visual markers recognition by means of consideration of such factors as perspective distortions, confusion with a background, low resolution, blurring of images, etc., during the training of the neural network used to create markers. All such effects are simulated during neural network training in the form of piecewise-differentiable transformations.

[7] Additional technical result is the creation of visual markers with the increased similarity to a visual style of the interior or the design of a certain product.

[8] The said technical result is achieved by the implementation of a method serving for production of a family of visual markers, operating to encode information in which a synthesizing neural network is generated, which translates a sequence of bits into images of visual markers; they also form a rendering neural network, which converts input images of visual markers into images comprising visual markers through a range of geometric and photometric transformations; they also form a recognizing neural network which translates images comprising visual markers into a bit sequence; they also train a synthesizing, rendering and recognizing neural networks by minimizing the loss function, reflecting a probability of correct recognition of random bit sequences; one can also receive a set of images of visual markers from a video data source or extract encoded bit sequences from the obtained set of visual marker images using a recognizing neural network.

[9] In another embodiment of the technical solution, a rendering neural network converts input images of visual markers to images comprising visual markers placed on top of background images.

[10] In another embodiment of the technical solution, a synthesizing neural network consists of a single linear layer, followed by an element-wise sigmoid function.

[1 1 ] In another embodiment of the technical solution, a synthesizing and/or recognizing neural network has a convolutional form (is a convolutional neural network). [12] In another embodiment of the technical solution, a learning objective (loss function) is augmented with a representative term that measures aesthetic acceptability of markers during the optimization process.

[13] In another embodiment of the technical solution, a learning objective (loss function) is augmented with a representative term that measures correspondence of markers to the visual style specified in the form of a sample image.

[14] In another embodiment of the technical solution, the minimization of the loss function is carried out using a stochastic gradient descent algorithm.

[15] In another embodiment of the technical solution, while learning a bit sequence is selected with uniform probability from the set of vertices in the Boolean cube.

[16] In another embodiment of the technical solution, a synthesizing, rendering and recognizing neural networks represent feed-forward networks.

[17] Additionally, the said technical result can be achieved with the method serving for the production of a family of visual markers that creates variables corresponding to values of pixels in the visual markers; they also form a rendering neural network which converts pixel values, determined in visual markers, to images comprising such visual markers, applying various geometric and photometric transformations; they can also form a recognizing neural network which translates images comprising visual markers into a bit sequence; they train synthesizing, rendering and recognizing neural networks jointly, minimizing the loss function, reflecting a probability of correct recognition of the marker; they can also receive a set of images of visual markers from a video data source or extract marker class numbers from any set of images of visual markers.

[18] In another embodiment of the technical solution, a rendering neural network converts input images of visual markers to images comprising visual markers placed in a background center.

[19] In another embodiment of the technical solution, a learning objective is augmented with a representative term that measures aesthetic acceptability of markers.

[20] In another embodiment of the technical solution, a learning procedure is complemented with a representative member which measures correspondence of markers to the visual style specified in the form of a sample image.

[21 ] In another embodiment of the technical solution, minimization of the loss function is carried out using a stochastic gradient descent algorithm.

[22] In another embodiment of the technical solution, a rendering and recognizing neural network represents a direct propagation network. [23] Additionally, the said technical result is achieved with the method that creates variables corresponding to values of pixels in the visual markers; they also form a rendering neural network, which converts input images of visual markers to images comprising such visual markers, applying various geometric and photometric transformations; they also form a localizing neural network which translates images comprising markers into various marker positioning parameters; they also train a synthesizing, rendering and localizing neural networks jointly, minimizing the loss function, reflecting the probability of the correct marker position estimation on the image, they can also receive a set of images of visual markers from a video data source or extract encoded bit sequences through a recognizing neural network from any set of images of visual markers.

[24] In another embodiment of the technical solution, a rendering neural network converts input images of visual markers to images comprising visual markers placed on top of a background image.

[25] In another embodiment of the technical solution, a learning objective (loss function) is augmented with a representative term that measures aesthetic acceptability of markers.

[26] In another embodiment of the technical solution, a learning objective (loss function) is augmented with a representative term that measures correspondence of markers to the visual style specified in the form of a sample image.

[27] In another embodiment of the technical solution, minimization of the loss function is carried out using a stochastic gradient descent algorithm.

[28] In another embodiment of the technical solution, a localizing, rendering and recognizing neural network are feed-forward networks.

BRIEF DESCRIPTION OF THE DRAWINGS

[29] Features and advantages of this technical solution will become obvious from the following detailed description and from the accompanying drawings wherein:

[30] Fig. 1 shows an example of a method used for the creation and recognition of a visual marker family;

[31 ] Fig. 2 shows the rendering neural network. The input marker M, which is located at the left is passed through several transforms (all these are piecewise differentiable with respect to the input); outputs Τ(Μ, φ) corresponding to several random transformation parameters <p, are shown on the right. Application of piecewise differentiable transformations within T makes it possible to carry out backward propagation of a training error through the rendering network;

[32] Fig. 3 shows examples of visual markers obtained by means of the implementation of this technical solution. The legend of the figure determines the length of a bit string, the information capacity of the resulting encoding procedure (in bits) and the accuracy level achieved during training. Six markers are shown in each case: (1 ) - a marker corresponding to the bit sequence comprising all zeros, (2) - a marker corresponding to the bit sequence comprising all ones, (3) and (4) - markers corresponding to two random bit sequences which are different by one bit, (5) and (6 ) - two markers corresponding to two random bit sequences. A characteristic grid pattern emerges under many conditions.

[33] Fig. 4 shows examples of stylized 64-bit marker families. The texture prototype is shown in the first column, while remaining columns represent markers for the following sequences: all zeros, all ones, 32 consecutive zeroes followed by 32 consecutive ones, and two random bit sequences which differ by a single bit at their ends;

[34] Fig. 5 shows screenshots of markers recovered from the video stream in the real time mode, also including the number of correctly recognized bits and the total number of bits;

[35] Fig. 6 shows April Tag visual markers ;

[36] Fig. 7 shows the architecture of a rendering neural network: the network receives a batch of synthesized patterns of dimension (b x k x k x 3) and a batch of background images of dimension (b x s x s x 3). The network consists of rendering, an affine transformation, a color conversion and a blurring layer. The output dimension is s x s x 3;

[37] Fig. 8 shows a localizing neural network, in which the input image passes through three layers and predicts 4 point maps, corresponding to the position of each of the corners of a visual marker;

[38] Fig. 9 shows a created family of visual markers via joint optimization of a rendering, localizing and recognizing neural networks. These markers appear the same for a human, but a recognizing neural network achieves the recognition accuracy of 99%;

[39] Fig. 10 shows the architecture of a system intended for the production of a family of visual markers, encoding information and suitable for localization;

[40] Fig. 1 1 shows an example of the estimation of marker position with the help of a trained localizing neural network (for the markers from the family shown in fig. 9). The position of each marker is determined by coordinates of four corners. Predictions of a localizing neural network corresponding to the corners are shown by white dots.

DETAILED DESCRIPTION

[41 ] Notions and definitions necessary for detailed coverage of the implemented invention will be described below.

[42] The technical solution can be implemented as a distributed computer system.

[43] In this technical solution the system means a computer system, PC (personal computer), CNC (computer numeric control), PLC (programmable logic controller), computerized control system and any other devices that can perform defined, clearly determined sequences of operations (actions, instructions).

[44] Command processing device is an electronic unit or integral circuit (microprocessor) which executes machine instructions (programs).

[45] A command processing device reads and executes machine instructions (programs) received from one or more data storage devices. Data storage devices include but are not limited to hard drives (HDD), flash memory, ROM (read-only memory), solid-state drives (SSD), optic drives (CD, DVD, etc.).

[46] Program is a sequence of instructions intended for execution by a computer control device or command processing devices.

[47] Artificial neural network (ANN) is a mathematical model , as well as its software or hardware implementation built on the basis of a principle of a complex function which transforms input information by applying a sequence of simple operations (also called layers) that depend on learnable parameters of a neural network. ANNs considered below can belong to any of standard types (for example, a multi-layer perceptron, a convolutional neural network, a recurrent neural network).

[48] Training of an artificial neural network is a process of adjustment of various parameters of layers within an artificial neural network, as a result of which neural network predictions made on training data are improved. In this case, the quality of the ANN predictions made on training data is determined by a so-called loss function. In such a way, the training procedure corresponds to the mathematical minimization of the loss function.

[49] The backpropagation is a method intended for the effective calculation of the gradient of a loss function over the parameters of neural network layers via applying recurrence relations and using known analytical formulas in relation to partial derivatives of individual layers within the neural network. By backpropagation we will also imply the neural network training algorithm based on the said method of gradient calculation.

[50] The parameter of gradient methods of neural networks training is a parameter which allows to control the step size at each iteration.

[51 ] A visual marker is a physical object representing a printed image placed on one of the surfaces of a physical scene and designed to facilitate efficient processing of digital images containing this printed image using computer vision algorithms. The result of marker image processing can be either the retrieval of an information message (bit sequence) encoded by a corresponding marker, or the estimation of the position of a camera relative to the position of a marker at the moment when the digital image was taken. Examples of markers of the first type are represented by QR-codes. ArUko Markers and April Tags serve as an example of markers of the second type.

[52] Recognizing neural network is a neural network that receives an input image containing a visual marker and outputs an information message encoded in the marker.

[53] Localizing neural network is a neural network that receives an input image and outputs numerical information about the position of a visual marker on the image (for example, in the form of positions of marker corners). As a rule, this information is sufficient to determine the position of a camera relative to the marker (provided there is also some information about camera intrinsic parameters).

[54] Synthesizing neural network is a neural network that receives some numerical information as an input, for example, a bit sequence, and converts it into a color or a grayscale image (a visual marker).

[55] Rendering neural network is a neural network which receives an image nd converts it into another image in such a way that the output image becomes similar to a digital photo of the printed input image.

[56] Convolutional ANN is one of the types of an artificial neural network widely used in the field of pattern recognition, including, in computer vision. A characteristic feature of convolutional neural networks is a data representation in the form of a set of images (maps) and application of local convolutional operations which modify and combine maps with each other.

[57] Let's consider the method of creation of a trained visual marker, shown in Fig. 1 , in detail. The main goal of the procedure is to create a synthesizing neural network S(b; 0 s ) with training parameters 6 S which can encode a bit sequence b = {b lt . . , b n }, comprising n bits. Let's define a visual marker (pattern) M k (b n ) as a size image (k, k, 3) corresponding to the bit sequence b n . We assume that b t ε {-1; 1} in order to simplify subsequent derivations.

[58] A recognizing neural network with learning parameters 6 R is created and applied in order to recognize visual markers created by a synthesizing neural network R(I; 0 R ). This neural network receives an image I comprising a visual marker and outputs the estimated sequence τ = {τ 1( ... , ½}. The recognizing neural network interacts with the synthesizing neural network in order to comply with the condition r t = b t , i.e. the sign of a number provided by the recognizing neural network corresponds to bits encoded by the synthesizing neural network. In particular, it is possible to measure the success rate for the recognition procedure using a simple loss function based on the following sigmoidal curve:

[59] L(b; r) = - -∑ =1 σ τ ύ = - λ ∑ι =1 - 7 ~ ^ ( 1 )

[60] where the loss value is distributed between two extremes: -1 (perfect recognition) and 0.

[61 ] In real life conditions, algorithms which recognize markers do not directly receive marker images. Instead, visual markers are embedded in the environment (for example, by means of printing and placement on environmental objects or by using electronic means in order to display visual markers), after which their images are captured using a camera controlled by a human or a robot.

[62] Therefore, the training of recognizing and synthesizing neural networks needs to involve transformations applied to the visual marker created by the synthesizing neural network that can correspond to applying a special feed-forward neural network (rendering neural network) Τ(Μ, φ), wherein parameters of the rendering network are chosen in the process of training and correspond to variability of the background, variability of lighting conditions, perspective slanting, lurring, color / white balance changes within the camera pipeline, etc. They are selected from a certain distribution Φ in the process of training, which shall simulate variability of aforesaid effects under conditions wherein visual markers are planned to be used.

[63] The learning process can be carried out in terms of minimization of the following objective in the case when its purpose is the reliable recognition of markers (and does not include precise marker localization):

[64] /(0 S , 0 R = E b u(jl)<p ^L(b, R(T(S(b, 0 S ); <p); θ κ )) (2)

[65] The bit sequence b, provided here, is selected uniformly from U n) = {-1; +l} n , being transferred through the synthesizing neural network, rendering and recognizing neural networks. The loss function (1 ) is used to measure the recognition success. Parameters of the synthesizing neural network and recognizing neural network are optimized to minimize expectations of the loss function. Minimization of expression (2) can be subsequently performed using the stochastic gradient descent algorithm, for example, ADAM [1 ]. Each iteration of the algorithm takes the mini-batch of various bit sequences, sample parameters within the rendering neural network, passes the bit sequences through the synthesizing, rendering, and recognizing neural networks, and updates parameters of the synthesizing neural network and the recognizing neural network in order to minimize the loss function (1 ) for the sampled mini-batch.

[66] A localizing neural network is also added to the learning process in the other embodiments (Fig. 8). It detects examples of markers in the video stream and determines their position within the frame (for example, it determines coordinates of their corners). Coordinates are subsequently converted into a binary map with dimensions equal to the dimensions of input images. The binary map has a zero value in any point except for locations of corners, where the value is equal to one. The localizing network is trained to predict binary maps, which can be subsequently used to align markers before they are inputted to the recognizing neural network (Fig. 10) or within applications where the assessment of the position of a camera relative to the marker is needed. The incorporation of such localizing neural network into the training procedure forces the synthesizing neural network to create markers which differ from the background and have easily identifiable corners .

[67] In other embodiments a single marker or a small number of markers that is much smaller than the number of bit sequences of considerable length. The synthesizing network is not applied in such cases. Parameters of the synthesizing network are replaced directly by the optimization variables encoding the values of marker pixels. Usually, the localizing neural network is used in these cases and recognizing neural network is either implemented as a classifier for the number of classes equal to the number of markers, or is not applied at all (in the case of a one-marker variant). An example of trained markers for this embodiment is shown in Fig. 9.

[68] As it was shown above, architecture components, namely the synthesizing neural network, rendering neural network, recognizing neural network, localizing neural network can be implemented, for example, as feed-forward networks or as other architectures allowing training by means of the backpropagation. The recognizing network can be implemented as a convolutional neural network [2] with n outputs. The synthesizing neural network can also have a convolutional architecture (be a convolutional neural network). The localizing neural network can also have a convolutional architecture (be a convolutional neural network).

[69] Implementation of the rendering neural network Τ(Μ, φ) shown in Fig. 2, requires application of non-standard layers. The rendering neural network is implemented as a chain of layers, each of which introduces some "corrupting" transformation. A special layer is also implemented in the structure which overlays an input image (sample) over the background image taken from a random set of images simulating appearance of surfaces onto which trained markers can be placed during application. A spatial transforming layer (spatial transformer layer) is applied to implement the geometric distortions [5]. Color changes or any intensity changes can be implemented using differentiable element-wise transformations (linear, multiplicative, gamma transformations). All the above mentioned layers can be applied sequentially, forming a rendering neural network, which can thus simulate complex geometric and photometric transformations (Fig. 2).

[70] It shall be noted that the optimization of the objective (2) under variable conditions leads to appearance of markers which have a consistent and interesting visual texture (Fig. 3). Despite such visual "interestingness", it is desirable to control appearance of resulting markers more directly, for example, by providing sample images.

[71 ] In some embodiments, the training objective (2) is supplemented by the objective term, measuring the difference between textures of received markers and image texture in a sample [6]. Let's briefly describe this loss function introduced in [6]. Let's consider a direct propagation network C( , y) which computes the result of the t convolutional layer of a large-scale network trained for image classification such as VGGNet [7]. The output of the network C( , y) to the image M contains k two-dimensional channels (maps). The network C applies parameters that are pre-trained on a large data set and are not part of the learning process that creates visual markers. Furthermore, the image style M is determined using the following Gram matrix G( , y) with k x k dimensions, where each element is defined as:

[72] Gtj = ( , y) = (^( ; 7 ), ς ( ; 7 )) (3)

[73] where C t and Cj are the i and j maps and the scalar product is taken over all spatial positions. A training task can be supplemented with the following expression, considering the texture of a prototype M°:

[74] / s£yle (0 s ) = i ( n) l| G(S(b; 0 5 ); y) - G( °; y) || 2 (4) [75] Inclusion of the term (4) results in markers S(b; 0 5 ), created by the synthesizing neural network, having a visual appearance that is similar to texture prototype °[6].

[76] In some embodiments, the error correction coding methods are used for longer bit sequences. In this case, the recognizing neural network returns probabilities for each bit in the reconstructed signal and the claimed technical solution can utilize any probabilistic error correction coding procedure.

[77] In some embodiments, the simplest synthesizing neural network is used. Such network consists of a single linear layer (with a 3m 2 x n matrix and a bias vector), where m is the spatial size of the marker and n is the number of bits, followed by an elementwise sigmoid function. In other embodiments, the synthesizing neural network has a convolutional form, taking the binary code as input and then applying one or more fully-connected layers and one or more convolution layers. In the latter case, the convergence of the training procedure greatly benefits from the addition of batch- normalization [8] after each convolutional layer.

[78] In some embodiments, parameters of the rendering network may be selected as follows. The spatial transformation is performed in the form of an affine transformation, wherein the size affine parameters are selected from [1,0,0,0,1,0] + Ν(0, σ) (assuming the coordinates origin in the center of a marker). An example for σ = 1 is shown in Fig. 2. Let's take the image x: we can implement the color transformation layer as c x 2 + c 3 , wherein the parameters are chosen from the uniform distribution υ[-δ; δ]. Since it has been determined that printed visual markers tend to reduce the contrast, a contrast reduction layer shall be added to the structure which converts each value to kx + (1 - /c) [0.5] for a random k.

[79] In some embodiments, the recognizing and localizing neural networks can be convolutional neural networks.

[80] Results of this technical solution embodiment are shown in Fig. 4. They demonstrate that the technical solution can successfully recover encoded signals with a small number of errors. The number of errors can be further reduced by applying a set (group) of recognizing neural networks or by applying a recognizing neural network to several distorted versions of the input image (test-time data augmentation).

[81 ] In other embodiments, a marker can be aligned with a pre-specified square frame in order to improve accuracy (shown as part of the user interface in Fig. 5). As can be seen, the results deteriorate as the misalignment with the pre-specified frame increases. INFORMATION SOURCES USED

[82] D. P. Kingma and J. B. Adam. A method for stochastic optimization. International Conference on Learning Representation, 2015.

[83] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1 (4):541 -551 , 1989.

[84] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.

[85] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. Int. Conf. on Computer Vision (ICCV), pp. 2018- 2025, 201 1 .

[86] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. Advances in Neural Information Processing Systems, pp. 2008-2016, 2015.

[87] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. Advances in Neural Information Processing Systems, NIPS, pp. 262- 270, 2015.

[88] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[89] S. loffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. International Conference on Machine Learning, ICML, pp. 448-456, 2015.

[90] E. Olson. Apriltag: A robust and flexible visual fiducial system. Robotics and Automation (ICRA), 201 1 IEEE International Conference on, pp. 3400-3407. IEEE, 201 1 .