Proposed Adaptive Fine-Tuning System for PID Gains in DPS

5. Reinforcement Learning-Based Adaptive PID Controller for DPS

5.3 Proposed Adaptive Fine-Tuning System for PID Gains in DPS

perpendiculars, and is a draft.

= Re QTF , , , exp −

− −

(122)

where is the number of the regular wave components for describing irregular sea state, Re denotes the real part of a complex number, QTF is the wave drift quadratic transfer function which can be obtained from a diffraction analysis in the frequency domain on the target ship, is the direction of the regular wave component relative to the ship’s heading, , , , are the wave period, amplitude, frequency, and phase lag of the regular wave component. Eq. (122) shows that QTF is applied to each pair of the regular wave components to yield each pair’s contribution to the wave drift load.

Lee and Seo, 2020b). For the normalization of , the error terms in are normalized by Eq. (124) where is all the states stored in the replay buffer and the std stands for standard deviation, and the integral of the error terms in are normalized by dividing them by 100 to make them small enough for the stable neural network training. The integral of the error terms are differently normalized because these terms usually increase and gradually decrease over time unlike the error terms which go around their mean values with certain standard deviations since the ship position goes around its reference position by the DPS.

⎩⎪

⎪⎨

⎪⎪

⎧ , , ⋯ , ,

, , ⋯ , ,

( ), ^{( )}, ⋯ , ^{( )}

( ) , ^{( )} , ^{( )}

⎭⎪

⎪⎬

⎪⎪

⎫

(123)

= − mean( ) / std( ) (124)

The actor basically outputs and , at every timestep. consist of the adaptive PID gains for the surge, sway, and yaw directions. The details of is shown in Eq. (125) where each ^(.)consist of the adaptive P, I, and D gains in each direction as shown in Eq. (126). is a vector of the base gains determined based on the ZN method, and is the output vector from the actor’s output layer.

In this paper, the actor uses a sigmoid activation function in its output layer. The sigmoid activation function is shown in Eq. (127). Therefore, ranges of each component in and are [0,1]and [0, ], respectively. With , the PID controller with the adaptive gains is expressed as Eq. (128) where the direction is denoted in the superscript parenthesis.

= , , ^{( )} = × (125)

(.)= ^(.)( ), ^(.)( ), ^(.)( ) (126)

( ) = 1/(1 + ) (127)

(.)= ^(.)( ) ^(.)+ ^(.)( ) ^(.) + ^(.)( ) (128)

The update-gate for integral of the errors determines how much error to update for the integral

of the errors which is in the integral terms in the PID controllers. The update equation of the integral of errors by is shown in Eq. (129) where ∈ [0,1]. The concept of was inspired by gated recurrent units (GRU) (Chung et al., 2014).

= + ( × ) (129)

= , , ^{( )} (130)

is designed to prevent a large rebound motion of a ship. When a sea state is severe, a ship experiences large drifting motions on a drifting side mostly due to aggressive wave drift loads. An illustration of the drifting side and an example of the wave drift load time history are shown in Fig.

86and Fig. 87, respectively. While a ship is drifted greatly due to the very large wave drift load, the integral of the error increases drastically as well. Due to the drastically-increased integral of the error over many drifting motions, the large rebound ship motions are caused. The large rebound ship motions caused by the increased integral of the error are illustrated in Fig. 88.

Fig. 86Illustration of the drifting side (direction of an environmental load: right to left)

Fig. 87Example of the wave drift load time history (sea state: very rough, environmental direction to a ship: 40^°)

Fig. 88Illustration of the large rebound motion caused by the drastic increase of the integral of error As to the adaptive gains from the actor, the I gain is fixed when a ship is on the drifting side, while the P and D gains are adaptively selected by the actor. The I gain is fixed on the drifting side to ensure the convergence of the station-keeping. On the other side of the drifting side, the I gain is adaptively selected. The drifting side can easily be identified by the integral of the error.

Architectures of the actor and critic are shown in Fig. 89 where the HL stands for a hidden layer, the BN stands for batch normalization (BN) proposed by Ioffe and Szegedy (2015), and the numbers below the HL refer to the size of the HL. Unlike the original DDPG implementation from Lillicrap et al. (2016) that uses the BN in both the actor and critic, the BN is used in the critic only for the proposed adaptive fine-tuning system. The choice of application of the BN to the actor and critic was made based on the station-keeping performance assessment in simulations. Given the study by Bhatt et al.

(2019), it is natural that application of the BN does not always increase the RL learning performance but the performance may increase or decrease depending on application domains.

(a) actor (b) critic

Fig. 89Architectures of the actor and critic

The reward function that is needed to train the actor and critic is defined as Eq. (131). It should be noted that the station-keeping in the yaw direction is implicitly designed since the good station- keeping of the surge and sway motions requires the good station-keeping of the yaw motion. To

further stabilize the RL learning, is clipped to [0,2]and is normalized as Eq. (132) where ℛ denotes all the rewards stored in the replay buffer. For the action noise, a Gaussian action noise with the std of 0.1 is used instead of an Ornstein-Uhlenbeck process since they result in the similar learning performance (Plappert et al., 2018). Finally, the other hyper-parameters for the DDPG are set the same as in the original DDPG implementation (Lillicrap et al., 2016).

= − ^{( )} + ^{( )} (131)

= − mean(ℛ) /std(ℛ) (132)

Dalam dokumen Thesis for the Degree of Master of Science (Halaman 111-115)