Online Robot Learning by Reward and Punishment for a Mobile Robot

(1)

Online Robot Learning by Reward and Punishment

for a Mobile Robot

D. Suwimonteerabuth P. Chongstitvatana

Chulalongkorn University, Thailand

(2)

Aim

A robot that is more flexible in learning a new task.

Method

A human trainer influences the robot behavior.

The robot learns from reward and punishment from a human in real- time.

The factors that influence the learning process.

(3)

Mechanism

A Finite-State Machine is used as robot controller.

Genetic Algorithms is used to evolve FSM.

Method of experiment

A simulator is used to find out the appropriate parameters for GA.

The physical robot is used online in real time to study the learning

characteristics.

(4)

Tasks

A rectangular floor 1.5 × 2.2 m.

surrounded by walls.

The robot is able to detect walls and stays inside the designated floor.

The floor is painted with two colors:

black and white.

The goal : the robot will learn to stay in one color.

(5)

S0 [01]

S1 [11]

S2 [11]

S3 [00]

S4 [00]

S5 [10]

S6 [00]

S7 [10]

0,1

0 1

0

1

0

1

0 1

0

1

0

1

Figure 2: An example FSM encoding as 01 11 11 00 00 10 00 10

101101 101101 110100 001010 110000 100110

011110 111110

(6)

Problems

Factors that are relevant to the quality of learning.

effect of reward/punishment start position

maximum number of rewrd/punish size of population

the human trainer

Real-time experiment

each FSM runs 30 seconds

the robot behavior is evaluated by a human trainer

experiments are performed a couple of times, 60 minutes each

(7)

Measurement

The time that it stays in the designated color.

Results

0 0.5 1 1.5

5 10 15 20 25 30 35 40

Time (Minutes)

Robot-on-white-area ratio

Figure 3: The learning behavior when giving no reward/punishment.

(8)

0 0.2 0.4 0.6 0.8 1 1.2

5 10 15 20 25 30 35 40 45 50 55 60 Time (Minutes)

Robot-on-white-area ratio

1st2nd 3rdAverage

Figure 4: The learning behavior when start position is in white area.

0 0.2 0.4 0.6 0.8 1 1.2

5 10 15 20 25 30 35 40 45 50 55 60

Time (Minutes)

Robot-on-white-area ratio

1st2st

(9)

0 0.20.4 0.6 0.81 1.2

5 10 15 20 25 30 35 40 45 50 55 60

Time (Minutes)

LIMIT = 3 LIMIT = 6 NO LIMIT

Figure 6: The learning behavior when limiting number of rewrd/punishment.

0 0.2 0.4 0.6 0.8 1 1.2

5 10 15 20 25 30 35 40 45 50 55 60 Time (Minutes)

10 individuals 5 individuals 20 individuals

Figure 7: The learning behavior when

(10)

0 0.2 0.4 0.6 0.8 1 1.2

5 10 15 20 25 30 35 40 45 50 55 60

Time (Minutes)

Robot-on-white-area ratio

1st2nd

3rdAverage (Problem B)

Figure 8: The learning behavior when changing the trainer

(11)

Conclusions

an empirical study of the learning behavior using reward and

punishment.

the robot must learn the behavior that will receive reward and avoid the behavior that will receive

punishment.

Genetic algorithms evolves the controller in the form of a finite state machine.

Robot is flexible, its behavior can be shaped in real-time and

continuously.

It is still too tedious to train a robot.

(12)