Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SEMI-OPTIMAL PATH FINDING IN A WHOLLY UNKNOWN ENVIRONMENT
Document Type and Number:
WIPO Patent Application WO/2001/078951
Kind Code:
A1
Abstract:
A mobile robot, which is guided by a new learning algorithm is presented. The first task for the robot is searching a path like a human being. That is, finding a semi-optimal path to a given goal in a wholly unknown, unpredictable and partly, dynamic large-scale environment. Therefore, it demands high technique of environment modeling through learning. Because the robot should realize its missions within the framework of the task knowledge, there appears a new phenomenon and a Dead Corner theory: genetic algorithm and reinforcement learning algorithm are the basis for the presented theory. Powerful methods are described in the mathematics tool of MDP models. The essential method of the sub-goal method is similar with the Sub-goal method of Options over MDP theory and has been successfully proved. The presented theory is a refined theory of A Learning Classifier System theory. Most of the theory has been proved in a visualization platform of an animation simulator.

Inventors:
LIN ZHIMIN (JP)
Application Number:
PCT/JP2000/002427
Publication Date:
October 25, 2001
Filing Date:
April 13, 2000
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
LIN ZHIMIN (JP)
International Classes:
G05D1/02; (IPC1-7): B25J9/16
Other References:
ZHANG J., ET AL: "ROBUST SUBGOAL PLANNING AND MOTION EXECUTION FOR ROBOTS IN FUZZY ENVIRONMENTS", PROCEEDINGS OF "INTELLIGENT ROBOTS AND SYSTEMS '94", IROS'94., 1994, pages 447 - 453, XP002160783
SMITHERS T: "On quantitative performance measures of robot behaviour", ROBOTICS AND AUTONOMOUS SYSTEMS,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 15, no. 1, 1 July 1995 (1995-07-01), pages 107 - 133, XP004062099, ISSN: 0921-8890
DEL R ET AL: "Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot", ROBOTICS AND AUTONOMOUS SYSTEMS,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 15, no. 4, 1 October 1995 (1995-10-01), pages 275 - 299, XP004001929, ISSN: 0921-8890
SAMEJIMA K ET AL: "Adaptive internal state space construction method for reinforcement learning of a real-world agent", NEURAL NETWORKS,ELSEVIER SCIENCE PUBLISHERS, BARKING,GB, vol. 12, no. 7-8, October 1999 (1999-10-01), pages 1143 - 1155, XP004180502, ISSN: 0893-6080
KANG D ET AL: "PATH GENERATION FOR MOBILE ROBOT NAVIGATION USING GENETIC ALGORITHM", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INDUSTRIAL ELECTRONICS,CONTROL, AND INSTRUMENTATION (IECON),US,NEW YORK, IEEE, vol. CONF. 21, 6 November 1995 (1995-11-06), pages 167 - 172, XP000586452, ISBN: 0-7803-3027-7
THRUN S ET AL: "Lifelong robot learning", ROBOTICS AND AUTONOMOUS SYSTEMS,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 15, no. 1, 1 July 1995 (1995-07-01), pages 25 - 46, XP004062095, ISSN: 0921-8890
BARSHAN, B. ET AL: "Inertial navigation systems for mobile robots", IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, vol. 11, no. 3, June 1995 (1995-06-01), pages 328 - 342, XP002160801
SMITH R E ET AL: "Combined biological paradigms: A neural, genetics-based autonomous systems strategy", ROBOTICS AND AUTONOMOUS SYSTEMS,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 22, no. 1, 10 November 1997 (1997-11-10), pages 65 - 74, XP004100314, ISSN: 0921-8890
Download PDF:
Description:
DESCRIPTION SEMI-OPTIMAL PATH FINDING IN A WHOLLY UNKNOWN ENVIRONMENT TECHNIQL E FIEL, D Tlle first task tor our autonomous teaming robot is searching a path like a human being in a wholly unknown environment. In other words. it is finding a semi-optimal path to a t en goal in a whole unknown environment. The task description for the robot is to follow goal's direction and approach to goal Thus. there is oniv one problem. which is collision avoidance in a wholly unknown environment.

Therefore. it demands a high technique of teaming for environment modeling. Bemg simitar with the theo a ot. A Leamg Classifier Svstem. which is a MIT press in 1998. our then also has two basis algorithms : genetic teaming algorithm and reinforcement eammg atgonthm.

Learning algorithm is an essential topic in artificial intetiigence. Utilizing the theory of Markov Decision Process. which is a discrete time process. is a powerful approach for analyzing human decision problems in artificial intetiigence. How to avoid obstacles is a kind of decision process. Moreover. the motion of a robot can be considered as a discrete time process. Therefore. we utilize the MDP theory to describe and analyze the basis of our HiMBS theory Temporal abstraction is the essential topic in artificial intelligence area for analyzing human decision process. Options over MDP is proposed by R. S. Sutton basing on MDP theorv and semi-MDP theory. which is a theory of temporal abstraction within the framework of reinforcement learning. We also utilize the Option concept to describe our powerful novel environment modeling method : Sub-goal task method.

A great important note is that. although our LIMBS theorv is also an approach of temporal abstraction in artificial intelligence. our approach is not an Action Replay Process (ARP) or a Synchronous Value Iteration (SVI) process as the theory of Options over MDP and Q-learning. Therefore-the most important things for understanding the MDP part of our HiMBS theory are the understandings of several fundamental MDP concepts : MDP. MDP tree. State. Action. State value. Poiicy and Options.

BACKGROUND ART Our algorithm of HiMBS is a new teaming algorithm for a mobile robot. It is designed for making an authoritative autonomous teaming mobile robot. A conventional opinion on the role of the autonomous learning robot is that it should be able to interact with a wholly unknown naturai environment, which is unpredictable in both time and space and perhaps. contains dynamic obstacles sometimes. This is one of the most difficult problems for most man-made machines ail over the worid.

Most conventional learning algorithms on mobile robot demand much human empirical knowledge to enable a robot to explore and exploit a wholly unknown environment. Mostly, the algorithm's designer not on) y gives the task question to a machine but aiso spends tremendous time to implant non-task method answer keys to this machine. that is. these methods cannot be realized as task description does but greativ depend on human's other non-task method empirical knowledge to realize task. Strictlv speaking. the environment in most of these conventional learning theories is not exactly wholly unknown. Some representative algorithms are discussed latterly.

Once more. human's other empirical knowledge we mention here means that the answer keys of these kinds of knowledge do not utilize knowledge of task to realize task. For instance. to the general task of path finding. that is. the task of following goal's direction from a start point and approaching to this given goal. method of pre-giving an environment map to a robot and method of wall followin< are not considered as task method answer keys. They are the human's other non-task empirical knowledge answer keys to realize the task.

In our HiMBS algorithm, we utilize task knowledge, which is the task description of following goal's direction and approach to goal. to realize the task of finding a semi-optimal path from a start point to an arbitrary given goal.

Our Background of Some Conventional Learning Algorithm Theories and Path Finding Theories The supervising algorithm is a conventional famous leaming algorithm. However. it often wastes much time on desiring a rarely happening architecture in an actual experiment. It cannot deal with a wholly unknown environment exactly.

A learning classifier system affords a very good algorithm framework. It consists of reinforcement learning algorithm and genetic algorithm. However. it is impossible to make a robot to avoid arbitrary obstacles by just constructing a blank algorithm structure to show avoiding several simple obstacles. We are not sure that this algorithm is possible to deal with a wholly unknown environment and make robot show more complex behaviors.

Jong Hwan Lim and Done Woo Cho in Robotica 1998 propose the method of employing the concept of Quadtree to solve paths finding task."Robust Sub-goal Planning and Motion Execution for Robots in Fuzzv Environments'-which) s proposed bv Jianwci Zhang and J. Raczkowskv in IROS. is also tor ealizin the same task as ours. Both of them also utilize the same Sub-loal method as ours to solve the environment-modeling problem.

The environment in path finding can be descnbed as a tree structure : a particular Quadtree or a L, eneral NIDP tree for a whole environment (See Fh. O). In the general MDP tree. each node represents a possible passing point for a robot. Every group of nodes that starts from the starting point to the final point of task goaf forms a path. An optimal path can be derived from all of the found paths. Therefore how to find the nodes of a path is the first essential topic for every path finding theorv.

The method of employing a Quadtree to find nodes is not a task method. that is. it is other additional human empirical knowledge from the task knowledge. This method assumes the robot has a Quadtree knowledge and can decompose its workspace into several sub-nodes. These nodes are chosen as sub-goais to be reached successively. However, this decomposition method can not solve the obstacle avoidance problem. in other words. this method can not model environment. In order to solve this problem. the designer implants another human empirical knowledge into robot. which is following the boundaries of objects. that means robot has the empirical knowledge of wall following know) edge to pass obstacles. In a word. this theory employs much human non-task empirical knowledge to enable a robot to successfully find a path in an unknown environment.

The method of robust sub-goai planning theory also does not utilize task method to find nodes. This method assumes that obstacles are polygons and these polygons are pre-known. Thus. it is possible for this method to utilize the method of searching a Tangent-graph to find nodes. Obviouslv. environment in this method is not an actuallv unknown environment.

In a word. although these two methods also treat the nodes of a tree as sub-goals to find a path. they utilize too much human non-task empincal knowledge to find the nodes of an environment's Silarkov tree.

However. our method of finding these nodes is a novel method. These nodes are found by a robot's own experience of teaming and interacting with its environment but not bv anv pre-assumption of our human being. The robot is guided bv its task knowledge : following goal's direction and approaching to goal.

Therefore. our invention here focuses on how our task method can find the nodes of a semi-optimal path by robot itself. That is. how can our HiMBS robot realize the problem of environment modeling throu (yh learnim^'9 An optimal path can be derived after all semi-optimal paths are found in our HiMBS theor. Our invention here firstlv considers how our robot can find a semi-optimal path.

Our Background of Options over. \IDP The theory of Options over MDP is newer than Dyna-Q ieaming. which is also developed bs R. S.

Sutton. It introduces many methods on exploring and exploiting an environment and how to calculate an optimal police. Many methods are such as :]. The reinforcement teaming (MDP) framework. 2. SMDP (option-to-option) methods. s. Sub-goals for learning options. Etc. They have proposed many mathematics methods to describe and solve human decision process as a Markov Decision Process.

Collision avoidance is a kind of human decision process and a discrete time process. Therefore. we utilize OMDP to describe our method.

However. as mentioned by R. S. Sutton himself. The theory of Options over MDP has offers a lot. but the most interesting cases are bevond it. We consider it means that the details of evaluating actions' credits, actual methods of how to interact with a large-scale whollv unknown environment and one of the most difficult actual problems : robot's position navigation system problem. Thev also sav that manv key issues remain incompletely understood. For example : the source of sub-goals (and of course the method of realizing them) and integration with state abstraction. etc. Although our approach is different from these ARP processes. a similar sub-goal method, which is the only essential method for solving our problem of ensironment modelimy throusJh learnimy. is proposed bv us to successfullv realize the task of environment modeling.

Benchmark Mobile Learning Robot Problems : Environment Problem. Learning Problem and Position Navigation Problem.

The benchmark mobile learning robot problem is how to make an authoritative autonomous teaming robot. That is. to judge whether a mobile robot is this kind of robot or not is according to its interacting environment s fuzzy degree. A mobile robot can be said an autonomous teaming robot if its interacting environment is whole unknown. as complex and arbitrarv as an actual space will be. which has even kind of the possibilities of containing dynamic obstacies at every corner of its. However. being capable of interacting with a wholly unknown environment is not the onty judging criterion of an autonomous learning robot. This environment capabilitv has been a pursuing final goal for the long past human's history That is. in order to enable a robot to interact with a reallv wholly unknown environment, we. as designers_ should implant a hinh certain teaming capability to a robot. What kind of a teaming capability we should impiant into a robot is a great problem.

Moreover, there is one ot'the most difficult actual problems tor a mobile robot. Setting up an accurate and robust position-iocalizing svstem is always aruablv one of the essential necessities to meet the objectives of an autonomous guided mobile robot in these environments. There are two representative existing position navigation svstems : Global Position Svstem (GPS) and Inertial Navigation System (INS).

Realizing these localization svstems in a small range area may be accurate and cost reasonable. On the contrary, the accuracy of these navigation systems in a targe range arbitrary natural environment will greativ decrease, cost expensive and destroy the correctness of robot's interactions.

DESCRIPTION OF THE INVENTION I. Our robot's conditions, improvements and the structure of this description of the invention.

Perhaps. there exist manv other methods to realize the same task as our system : the task of environment modeling through teaming. However. our methods are realized under a robot's worst conditions : the robot onlv has one kind of extern sensor. which is oniv capable of detecting obstacles one step around robot.

Here. one step refers to the length between robot's two footmarks. The robot does not have any outer position navigation system. Furthermore. the robot should realize its task within the framework of the task knowledge, which is the least knowledge that we should give to a robot when it is put into an environment. However. the robot has to interact with a wholly unknown environment.

Therefore, what is the task description for our robot9 We Live an example in Fig. 1. In such an empty and iighttess environment except for the Goal. the task is that robot should approach to goai and reach goal by foiiowing the goal's direction. One criterion for judging reaching the goal is that when the robot is stopped. in other words. the robot can judge that there are two same states. we can inform robot that it has reach its task goal. Corresponding to the theory of Options over MDP [R. S. Sutton]. the task can be described in such a MDP Policv model : Py:S x A@goal = TG # [0,1]. method = # ---Task MDPP model (1) Here # means following the goal's direction and approaching to the oal. We will describe more details of this model latter. The theory of Options over MDP is a framework of temporal abstraction in reinforcement teaming, It is as general as a mathematics tool on analyzing methods of exploring an environment and exploiting it bv calculating poticy-vaiues. action-values and state-values for a robot.

We have made some important improvements from these two theories : I. we touch on some most important actual issues : how to enable robot to interact with a large-scale wholiv unknown environment.

We propose a particular method of the framework of the task knowledge to realize environment modeling throimh sub-goats teaming. 2. we propose a Zhinan Zen Inertial Navigation System with only one inertial sensor of encoder to guide a robot and evaluate its actions'credits. 3. we propose the essential issue of sub-goal method to enable robot to overcome the complexity of an environment. to find a semi-optimal path quick) y. 4 Our sub-goal method can absorb the actual great mobile robot's problem : motor slipping (Because the slipping occurring state will be treated as a deadlock state) and the accumulative dead reckoning errors (Because robot moves according on orientation).

Our method is quite novel because we try to use the task knowledge to finish task and this method is as natural as human's daily intelligent activities. We compare our HiMBS robot with our human-being under the same task of finding a semi-optimal path to a given goal to show our robot's naturalness. (See table 1. 0) Table 1. 0. The conditions of a natural autonomous robot. A human Our HiMBS robot 0. Can interact with a wholly unknown 0. The same as a human environment 1. No need of any position navigation system I. The same as a human 2. Only need to know the goal s direction. 2. The same as a human, better to have an eigenvalue or goal's some literature features. estimated position of goal Has a strong memory's problem 3. Has a strong memory. 4. Has experience knowledge about environment 4. No knowledge on environment. and methods : e. g. walk along a wall to pass. 5 Has knowledge of its front. back. etc... 5. Not have such knowledge. 6. Can see further than one footstep around 6. Can only detect obstacle a step around it.

The structure of this description of our invention is : 1. Brief introduction of our robot s conditions. improvements and the structure of this description of the invention. 2. Our HiMBS theory. its program procedure and an essential sub-goal theorem. 3. Experiment results and an error function. 4. Conclusions and our future work.

2. Our HiMBS theory, its program procedure and an essential sub-goal theorem.

An autonomous learning svstem should be able to endow robot with human-level intelligence to interact with a natural environment. Mostly. the natura) environment is wholly unknown unpredictable at both time and space. Sometimes. it also contains dwnamic objects. for instance. an obstacle. an active predator or a locomotive goal. Of course, it consists of arbitrarv obstacles. which have uncertain size and shape. Conventionally-if an autonomous robot can not successfully realize this mission in such a natural environment. it is not considered as an authoritative autonomous teaming robot. Since an autonomous teaming robot interacts with an environment as uncertain as a natural creature does. Therefore, its interaction is a process of from sub-optimization to optimization. a process of from trials to planning, a process from errors to correctness. The genetic algorithm is quite suitable to such a process. We utilize it as the control central of robot's actions. However, the environment is so far complex. There are various different stimuli. Whereas, there is only one standard of credit for genetic algorithm to run rules (actions. etc.) process. The reinforcement teaming algorithm is suitable for translating various stimuli into single credit. It is a good evaluating tool. Thus. we use the reinforcement algorithm as a cooperative algorithm of GA.

The architecture of our HiMBS system is shown as Fig. 3. Since the environment is wholly unknown to our robot. genetic algorithm make our robot be sure to explore and exploit the environment thoroughly.

Genetic algorithm has four modules : reproduction. selection. crossover and mutation. Reproduction produce some fundamental strategies and of course. primary actions at each footmarks of robot. For instance. strategy of the framework of the task method knowledge is executed in this module and the primary strategy of sub-goal method. Selection acts as an auction selector to select every action and everv strategy to command the actuator or other component systems to produce actions or rules. The roles of both crossover and mutation are only realized at finding other paths in order to find an optimal path. which are not the central topic in this paper. They both change their genes of a path to find other paths.

Gene of a path means the important points of a path but not all the points of a path. The roles of reinforcement learning algorithm are : evaluating credits of each actions and each control rules in order to let the Selection module works. evaluating stimuli of a robot. for example the stimulus of encountering an obstacle. the stimulus of reaching task goal. etc.

We use the sub-oal's MtDP model to describe the most fundament strategy, which is the framework of the task method knowledge. For instance, as we show in the figure, when a robot is stopped. genetic algorithm transfer the current louai and the termination condition of the slopped process to \) DP sub-goal model. this model then produces sub-goals as the following current goal to reinforcement learning algorithm Reinforcement learning module use the current goal to continue to evaluate robot's even action. The MDP sub-goal model judges the termination condition of the sub-goal process The procedures of our whole system s programming are : In the Ist whole loop : robot utilizes the task method knowledge to explore the environment. then it can find one quite short path and can find the optimal path at once if there is no any obstacle because of our task method method. Our robot utilizes an inertia ! navigation method to build a two dimensions coordinates plane for its environment. Moreover. in the first whole loop. after robot arrives at its task goal. it can correct the estimated coordinates of task goal's position.

* During the 2nd whole loop : robot will try to optimize the found path by calling the mutation module and calculate the path's policy value according to its learning experience on environment's model.

We have mentioned that our HiMBS theory is a refined theorv of Options over MDP. One of the biggest improvements is the most fundament problem of the credit model. Together with the 0-teaming theory. the Dvna-Q learning theory. their credit models are only in a real number space, that is. they onlv consider the award of a credit. which is i-i, i = 0, 1, 2. 3.... However. according to the newest reinforcement learning theory. a credit does consist of not only an award but also a punishment. Therefore. we upgrade the most primary credit model into a complex number space. which is @ = ri + j* #i, i=0,1,2,3... . Here. #i represents the punishment part of a credit. Thus. from our credit model. we can have more accurate information on robot's interaction and can enable us to separate two paths that have the same policv-value as Q-learnin, Y but has two different environments. We can find the optimal path then. Therefore. all of the values calculations of policv-values (in mobile robot case. simplified speaking, path-values), state-values and action-values are changed. For example. the calculation of state-values is changed from formulation (2) to (3).

However in the whole loop of exploring an environment. we do not utilize the SVt process (Synchronous Value Iteration process). In other words. we do not start with an arbitrary approximation calculation to find paths and then make paths planning. Because exploring every points of an environment by starting with an arbitrary approximation calculation demands numerous memories and takes plentz of time.

Our method of utilizing the framework of the task method knowledge has an immediate approximation.

In addition we utilize the cumulative calculation formulations, such as formulation (3). at the second whole loop, that is. soon after a robot has finished its task. Therefore. our calculations of state-vales in the first whole loop of exploring an environment is simplified as formulation of (4). Here. means the probability of taking action a in the state of s under the policy of T (task MDPP model (1)).

/','means a step forward reward of taking action a in the state of s.

Thus. how to evaluate actions'credit and how to realize our three objectives are described as below : To realize the first objective of environment modeling. there is a new dead-comer theorv that appears in the first whole loop.

As we mentioned before, the method condition of our robot is described in an empty and lightless environment. In the task MDPP model (1). P !} means a policy.. S means the state space of a robot's environment. A means the actions unit of a robot. [0. 1] means the probabilitv of a policy is between 0 to 1. t9 represents to a certain method to realize the policy. A policy means the mapping from states to the probabilities of taking each available primitive action. A briefly and more accurate explanation of policy is that policy is a possible sequence of actions-states pairs. In paths finding theory. a policy can be simply called as a path.

Therefore, what does the model (1) means exactly ? When a robot is inputted into a wholly unknown environment for the goal. it receives the only environment knowledge : the goal's direction and the estimated coordinates of the goal. (Strictly speaking, coordinates of the goal are possible to be gotten from the goal's direction.) The task for the robot. even in an empty and lightless environment (except for the robot and the goal). is described as : Following the goal's direction If robot is not stopped, robot keeps on approaching. The initiating conditions are written as The goal is treated as an obstacle robot will fall into an oscillating state which is called a deadlock state. Thus. the deadlock state conditions can be a criterion to judge task goal. The deadlock termination condition of a global task process is represented by Thus. in order not to use an outer navigation svstem and to evaluate the actions credits. we propose a Zhinan Zen inertia) Navigation System and Evaluation method (abbreviated as ZZINS_EF see Fig. 3).

Our robot has eight moving freedoms. We use the ZZINS-EF to evaluate credits of these eight possible actions according to whether the direction of an action makes robot go farter or nearer to the task goal.

The credit of an action has the highest credit among the eight actions if the direction of an action is the same as the current direction of task goal. Because of our ZZINS EF method. our robot has an immediate efficiency of finding the shortest path at once in an empty and lightless environment and does not need any outer position navigation system. Our robot is initiated because of the direction of the current goal but not the coordinates of robot. Therefore. our robot can absorb a quite high level of motor's slipping errors.

Thus. eight possible moving directions have their own corresponding credit.

Therefore. what do the initiating conditions I and the deadlock termination conditions/ mean exactly in this empty and hghtiess environment ? The most simplified meaning of I is that if the initiating conditions are not equivalent with the deadlock termination conditions at the goal. the robot can keep on going. We define the state of robot's ith footmark as v ;. i = 0, 1. 2. 3, 4.... Therefore, its next state is The deadlock termination conditions 16 means that there exist at least two same footmarks'states at the same position of around goal in this empty and lightless environment. \) ost) y. a wholly unknown environment is quite complex. Plenty of particular literature characteristics of an objective (obstacle) can represent one unique state element to a human. however is difficult for a computer. Whereas. coordinates of a position are not difficult for a computer. we consider the coordinates of a footmark as a state element of the state at that footmark.

Thus. the value of a state sj at the ith footmark is written as o. I. I Is the same as the formulation (4). It summarizes all of the possible actions values at state f ' (. s,) = (. T,. ) is the coordinates particular state element of coordinates of state Therefore. corresponding to the theory of Options over MDP. the Options-MDP (abbreviated as OMDP) model of our framework of the task method knowledge can be written as : Here. the task can be described bv the option of (). It consists of the initiating conditions of I at the start point and the deadlock termination conditions of/ ? at the position of task goal. CG means the current goat. TG means the task goal. Py means the MDP policv model of (1).

0 (/. Pr./ ?)<BR> <BR> Py:S#A#CG = TG.method = # i = 0. 1.2.3.4...;k = 2.3,4,5,6... j=0. 1.

-------------Task OMDP model (TOMDP) (5) We assume the state now is Si+k. Because the method condition of our OMDP model framework there appears a Dead-Comer problems. However. in order to try to utilize the task knowledge to solve the Dead-Corners. to realize the given task. we find that it is still possible to use the OMDP model to finish the task. In our following paper. we will introduce our dead comer theory.

To solve these dead-comers problems in other words. the dead corners environment modeling, we propose a sub-goal theorem to realize it. which is mentioned as an essential issue by R. S. Sutton.

In order to construct a sub-goal OMDP model. we just change the task OMDP model little. We assume that the state of. i is the state at the position of judging a deadlock state. When a robot is stopped by a dead comer at the point of (xl,yl), it will set up sub-goals near the directions of its possible moving directions at a changeable distance of len Normallçv a 80 times length of a step of ler is enough for passing most arbitrary dead comer and obstacle if the environment does not contain a particular big obstacle We give an example of sub-goal in the Fig. 4. 2. The following (6) is our Sub-goal's OMDP model.

O(ß,Py.l) lu : S x A CG = SG # TG,method = # ----Sub-goal OMDP model (6) @ = 0,1,2,3,4...@k = dynamic const # [2,3,4,5,6,7,...]<BR> <BR> <BR> <BR> <BR> <BR> /= 0. 1.

We onlv switch the current goal between TG (task goal) and SG. switch the position of ß and /and limit dynamic changeable constant of k but not arbitrary variable of k in the task OMDP model (5). We do not change the essential part of task OMDP model. which is 0 method.

An important note is whv/is dvnamic chaneeable. There are several reasons : one is that the environment is whollv unknown One is that the robot has a non-zero size. One is that there exist arbitrary size dead comers. One is that sub-goals may be set up inside the bodv of an obstacle. That means the sub-goais are unreachable. The final reason is that we want to enable robot to learn an accurate dead-comer model.

We summarize a sub-goai theorem as below : In order to use se the natural sub-goal method and try to use the task knowledge to solve the Environment Vlodeling proglem. We set up a Sub-goal OMDP model (SOMDP) from the task OMDP model by <BR> <BR> <BR> <BR> <BR> changmg llttle ot ii.<BR> <BR> <BR> <BR> <BR> <BR> <P> 6'hen a robot is. tollowing i to a deadlock staae, some sub-goal. rill be. set a/p near to the po. ssible movmg directions at a certain distance. We assume the state at this footmark of judging a deadlock state is. s. s The SOMDP model will begin to be set up according to the termination condition of TOMDP's ß.

That means the current goal will be changed to sub-goal (SG) and then robot will begin to move for k. steps after s under the sub-goal poltcy PY1. When k steps are over and the state-value of <BR> <BR> <BR> <BR> <BR> <BR> @@@@(si+kl) satisfies the termination condition I, the current goal will be switched back to task goal (TG). That is, the option O of SOMDP is finished. However, the option O will be set up again at the -k. steps after si if the state value of VsPy0 (Si+2kl) satisfies the initial condition ß under the same task goal policy Py0. If, at the 2k step footmark, ß is incorrect. the SOMDP will be over. that means the robot has passed the dead corner and this dead corner's enviroment modeling has been fimshed. If ß is correct. k will increase to be k + 1. The same SOMDP is set up again and the sub-goal process will begin as the former sub-goal process.

Therefore, there occurs an interesting learning feature of TRIAL. The robot usually tries manv times to make an accurate dead-corner model. which is shown bv our experiment.

Thus. the first objective of our environment modeling through teaming by robot itself is realized. A model of a dead comer consists of three points nonnallv. Please see an example of Fig. 4 We treat the model points of a dead comer as genes of a path. Therefore. we activate the Mutation module of Genetic Algorithm to set every gene of a path to be current goai for some time in orders during the robot's approaching to the task goal. The different orders of the genes treated as current goal. the different paths will be found. Additionally. our sub-goal method in paths finding is also guaranteed to exploring the environment thoroughly and finds an optimal path. \tore details of these topics will be presented in our following papers.

3. Experiment results and an error function.

In order to show the correctness of most issues of our sub-goal learning theory. we present a dynamic experiment result here. In this experiment. robot encounters a dynamic obstacle firstly, a theoretical dead corner secondly and a particular physical dead corner finally. (See Fig. 6) Because the interacting environment is wholly unknown to robot and our robot has to learn to get a correct model of its environment, robot cannot get a correct model of this dynamic obstacle without trying. Fig. 5 is the TRIAL graphic of this dynamic experiment.

In the table 5. 0. DO means the dynamic obstacle. The red coordinates of (97, 440) means the 15' sub-goal, which is chosen firstly, is unreachable. It is in the body of Obstacle 1. DCI is the first Dead Corner. DL means the state of Dead Lock. Under DL. S means the start step of Dead Lock. E is the last step of this Dead Lock. TRIALs 1 (3) means the steps of the first TRIAL is 3 steps. SGl means the I" Sub-goal of this Dead Corner. SP means the position of the Dead Lock. MP means the last point of this Dead Comer's model.

Table 3. 0. A dynamic experiment result r r 7 :. \Conte DL TRIALs SG0 SG I SP MP (X. N) (X, Y) (X, Y) (X. i obus j Obs I, I I, I I ; acle ,, DO ! 88'93"94102H2 (5) H (488. 49) I (374. 326) 1 (358 342) C) 4 I I i I >, (908, 341) (794618) 0 2 ; i From this experiment, we can get a TRIAL function on Environment Modeling through learning as Formulation (5). which. according to the sense of modeling task. is an error function. Here.. S means the steps of robot from initial position to the position when robot encounters a Dead Comer. t means time. A process is considered as a TRIAL during which the current goal is sub-goal but not the task goal.

A TRIAL consists of several steps. S (t) = S + 2 * Smm*T(t) + P0*T(t)[1 + T (t)] 5 /'means the minimal moving steps during one TRIAL. 7 (t j means the number of trying nmes at time !. that is the number of TRIALs.. S n means the minimal step-size of a dead comer. it can be any integer number. In this dvnamic experiment. we let %'mln = 3. ste ! to fast the process.

According to this error function. we can get the size of one side of a Dead Corner from the corresponding error function. which is LengthDC = (Smm + Pn)*T(t) -------- We compare steps in the first loop of exploring and exploiting the environment with the steps in the second loop of original paths optimizing in Fig. 5. 1 and Fig. 5. 2. Results show the correctness of the teamed dvnamic obstacle's model and the righteousness of our sub-goal OMDP theorem.

4. Conclusions and our future work.

Our HiMBS algorithm is a novel learning approach of Environment Modeling in a wholly unknown environment. It enables a robot to independently and actually autonomously learn the obstacles'model. It thoroughly makes a mobile robot being free from the restriction of its environment.

We make an animation simulator simulate what an actual wholly unknown environment will be and successfully test our HiMBS algorithm. We try to pre-design all of the possible problems in an actual mobile robot experiment in its two dimensions plane. For instance. robot has a non-zero size : it has an arbitrary size and shape. Obstacle also has an arbitrary size and shape : furthermore. sometimes. obstacle is in the state of dvnamic moving. Moreover. there is one of the most difficult problems : a difficult dead reckoning problem. etc. These problems have been successfully overcome by our HiMBS algorithm Therefore. although there mav be still some defects of our HiMBS. we believe HiMBS algorithm is a powerful and successful algorithm for mobile robot to find a semi-optimal path to a given goal in a wholly unknown environment. This animation test platform has also testified our Zhinan Zen INS navigation theorem and evaluation theory. In a word. the industrial program experience enables us to find the essential actual issue of sub-goal method and the new Environment Modeling learning method of TRYING in mobile robot.

A three-dimension simutating environment will make our algorithm behaviors be more interesting and complex. Our algorithm may fail more times at the first loop. However, it will quickly correct its errors and learn successful modes. Additionally. in order to make an authoritative autonomous learning robot. it is necessary for us to test our algorithm in other tasks. Otherwise. we are not sure to make a general algorithm How can our sub-goal method in path planning enable a robot to find other paths except the first found path ? Is our sub-goal method in path planning guaranteed to explore and exploit a wholly unknown environment thoroughly ? What are the details of our path optimizing theory ? Etc. These topics will be made clear in our following papers.

Finally, many essential concepts in robotics are left for us to continuously develop, such as. the intelligence definition and the least knowledge for a learning algorithm, the state concept in a computer. how to judge whether an algorithm is better or not. Etc. This paper of ours is a beginning of the research approach of discussing the least knowledge for a learning algorithm and the state concept in a computer.

Our this invention will be start of a new artificial intelligence approach.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. L The Markov tree structure of every paths finding environment.

Fig. Z The task description in an empty and lightless environment.

Fig. 3 The architecture of our HiMBS.

Fig. Zhinan Zen Inertial Navigation System and Evaluation model (abbreviated as ZZINS_EF) Fig S The model of a dead comer.

Fig. 6 Learn to model a dynamic obstacle Fig. 7 A successful dynamic experiment result BEST MODE FOR CARRYING OUT THE INVENTION The least hardware conditions of our robot are : one step-around-robot extern sensor or better one for detecting obstacles. motors and wheels or other instruments for moving. an inertial encoder sensor but no needs of Gyro sensor, accelerator sensor or tachometer, a controller for running our algorithm. The spirit software condition of the robot is our HiMBS executable file. Etc.

INDUSTRIAL APPLICABILITY If a one-step-around-robot extern sensor or a better one for detection¢ obstacles is available, our algorithm can be utilized in industrial applications. For example, a waiter robot in restaurants. a mine searching machine. a deep-marine-object searching submarine. etc.

References [I I R. S. Sutton."Options over MDP". Artificial Intelligence. 1998.

[2] Macro Dongo. Macro Chorombetti."Robot shaping" [3] L. P Kaelbling. M. L. Littman and A. W. Moore."Reinforcement Learning : A Survey."Journal of Artificial intelligence Research. No. 4. pp. 237-285. 1996.

[4J K. Mivazaki. M Yamamura and S. Sobayashi."k-Certainty Exploration Method : An Action Selec-tor on Reinforcement Learning to identify the En-vironment." Journal of Japanese Societv for Ar-tificial Intelligence. Vol. 10. No. 3. pp. 454-463. 1995 [5] J. J. Grefenstette."Credit Assignment in Rule Dis-covery Systems Based on Genetic Algorithms." Machine Learning. Vol. 3. pp. 225-245. 1988 [6] Motivational system for regulating human-robot interaction. Breazeal-Cvnthia.

Proceedings-Of-the-National-Conference-on-Artificial-lntelli eence. 1998. AAAI. Menlo Park. CA.

L'SA. P54-6L [7] Poggio T and Bevmer D (1996). Learning to See. IEEE Spectrum.