In addition, I would like to thank my laboratory mates Ibrahim and Tim, as well as my mentors Dr.
Introduction
Introduction
2022). the task at hand for one set of parameters and 2) can generalize the solution to a wide set of parameters. This ensures the convergence of the mixing network with respect to the set of individual policies learned to control the system under the variability of one parameter.
Related Work
Narita and Kroemer (2021) demonstrate the use of policy mixing for simple tasks such as opening a cap, flipping a breaker, and turning a dial. Given these approaches, it is worthwhile to combine robust policy mixing with state-of-the-art system identification as a new approach to generalized modeling.
Formalism
- Robust Markov Decision Process
- Domain Randomization
- System Identification Via Parameter Estimation
- Autotuned Search Parameter Model (SPM)
To effectively identify π∗ in simulation, a set of parameters must be defined to model the environment. After this iterative search, we can perform a level of system identification for our experiments that is not completely dependent on domain knowledge.
Approach
Blending Network Learning
In order to create a decoupled policy where we can solve individual tasks while maintaining robustness to a range of system parameters, we introduce a mixture network where we only consider the N system parameters ξ as its state space. This policy then only needs to output the weights W of its subpolicies at each time to generate the action for the environment. The subpolicies in this model are trained on a single set of constant system parameters, where a unique environment is identified through prior domain knowledge.
We then define the action of the mixing network asa=N1∑Ni=0wiπi(st) where wi∈R, N is the number of subpolicies and πi(st) represents the action taken by the subpolicy given a state at time t .
Concurrent System Identification
Experiments
- Assistive Gym
- Training of Sub-Policies
- Training of Blending Network
- Training of Domain Randomization
We then define the mixture network action soma=N1∑Ni=0wiπi(st)where wi∈R, N is the number of subpolicies and πi(st) represents the action performed by the subpolicy given a state at time t . since it performed best in the individual itch policies examined in Erickson et al. 2019) for our control task and define the success value of a task as the amount of force applied to the target scratching position throughout the episode. The first impairment is involuntary motion, which is handled by adding noise that is normally distributed among human joint actions. The second impairment involves weakness in the human's ability to move their arms, which is introduced by lowering the strength factor of the joint's PID controller.
This reward function considers a weighted combination of the distance from the robot arm to the target scratching position, a penalty for large actions, and the contact induced with the scratching target: R(t) =wdistance∗ ||postarget−posrobot||+ waction∗. Similar to the sub-policies, we consider the same reward and use a 2-layer PPO model with 64 nodes each for training the mixture network. For the mixture network, however, we train in 400,000 time steps, and the state space consists only of the system parameters.
Unlike the sub-policy, our mixture network is trained on a human with all three impairments and as such must consider many other cases of how the robot should act. By training the policy on all three general damages, we allow our mixture network.
Discussion and Results
We still use the given system identification method for estimating the state space of the mixing network. We consider experiments in which humans exhibit only a single disorder and the results are shown in b), c) and d) of Figure 2.4 and Table 2.3, respectively. From the box plots we can see that we outperform domain randomization for 100 individual episodes.
Furthermore, there is a difference between the system identification methods, as UKF and the system fed with the correct parameters outperform the autotuned search approach. First, policy mixing has a significant improvement over general domain randomization in terms of both sampling efficiency and performance. Second, our design can successfully employ different types of system identification; however, these identification methods can significantly affect the overall performance of the policy and should be based on the maximum amount of domain knowledge available.
In addition, since these subpolicies can be reused since they are now decoupled from the main mixing network, different approaches can be quickly tested and adapted—a problem that limits current domain randomization methods. Furthermore, it is not guaranteed that a linear combination of weights of the mixing network and the.
Introduction
Background on Fault Tolerant Control
The performance of model-based FTC methods depends on having accurate and comprehensive first-principle models. Data-driven FTC methods allow the development of complex control strategies, but they are inefficient in sampling requiring multiple experiments to achieve satisfactory performance. As such, deep reinforcement learning (DRL) approaches attempt to minimize this sampling inefficiency and have become very popular for solving continuous control tasks in recent years Dulac-Arnold et al.
Related Work
Formalism
Reinforcement Learning
The transition function T:S×A×S→[0,1] defines the probability of reaching state s′att+1 given that actiona∈Ablev chosen in states∈Satdecision epoch, t. The agent’s goal is to find the policy that maximizes the expected sum of reward. Fault-prone systems undergo changes that cause their dynamical model, represented by the transition function T, to change over time Dulac-Arnold et al.
Accordingly, we propose an error-adaptive control scheme that avoids the use of DRLs for direct control, but.
Octorotor Design
UAV Model
Designing Realistic Faults
Proposed Framework
The core of the proposed approach is based on the combination of parameter estimation techniques with DRL. We suggest updating the PID controller when the value of the parameter(s) associated with errors affects the performance of the control. We assume that the errors can be accurately measured through widely studied estimation techniques such as Unscented Kalman Filter and Particle Filter Daigle et al.
Experiments and Results
Experimental Design
The only information that the agent receives is the estimated motor resistance, and the reward function is defined by R= (10−error)/10, where the error is the Euclidean distance calculated between the octocopter's position and the reference trajectory. Thus, we defined a curriculum learning approach where we first subject the agent to the lower bound of the error size until it converges, and then for each episode we generate with a probability of 0.5 an error of maximum size. We initially start training the DRL agent on a 3x error (3 times the nominal value of the parameter), and after convergence is reached, we then introduce a larger 8x, where at the beginning of the episode, a choice is made between 3x. and 8x with equal probability.
Furthermore, we then evaluate the robustness of our controller by introducing variance in the estimated parameter so that the value given to the controller is biased. However, our controller now gets two values for the estimated motor resistors instead of the one above. After this we then test our approach by comparing the trajectory where both cars have a 4x error.
We then introduce variance into the condition estimator of both motors so that the fed parameters are normally distributed with mean equal to the true value and a standard deviation of 0.5x the nominal resistance. In this exploration, it is worth noting that we are considering multiplicative errors where changes in motor dynamics can affect controller dynamics and control allocation requirements.
Results For Single-Motor Faults
Results of Dual Motor Faults
As such, in the trajectory of a single episode, we can see in Figure 3.9 that both the nominal controller and the reinforcement learning scheme converge to the exact position X. However, we can see that the nominal controller performs poorly as initially becomes unstable and then compensates more for the error while our approach compensates accurately to provide a much faster convergence. Furthermore, in the Y-direction plot from Figure 3.10, we can see that the reinforcement learning approach compensates slightly, but still converges as fast as the nominal controller.
We can see that the approach shown is not perfect in the Y direction, but the improvements in the X direction are significantly larger than the marginal overshoot in the Y direction. Similar to Figure 3.7, we can also see in 3.11 that the hybrid-based controller outperforms the single PD controller under large errors when the PD controller starts to degrade in stability. However, in this case we set a single motor to a 4x error and vary the error of the second motor.
Despite the variance of the second motor error, we see that the controller is robust to parameter estimation noise, as even the deviations in the dual motor error experiment have smaller errors than that of the nominal controller. In addition, this shows that our approach is viable for different size errors in each motor, as Figure 3.11 shows similar or better performance than the PD control scheme at almost every single error size.
Limitations and Future Work
Furthermore, we have limitations in that our approach was only tested in simulation. Given these, there is a large amount of research to be done with UAVs and fault tolerant. We first demonstrate a hierarchical reinforcement learning architecture that combines system identification and policy blending of different human-assistive environments.
We show our approach on both single and dual motor errors successfully navigating the ocotrotor over a PD control method. In particular, we considered applying a policy mixing approach for fault-tolerant control in which we do not need the time delay required by system diagnostic checks. Furthermore, our hierarchical reinforcement learning architecture shows great promise for multi-impairment systems that can be combined in non-trivial ways.
Finally, we mentioned that the nature of the policy mix does not require linearity, and other approaches to combining parameter weights may be worth exploring. Learning to recover: Improving UAVs with reinforcement learning-assisted flight control under cyberattacks.
Example of our robot completing the itching task even when the human
Training reward average over 50 episodes for single impairment policies
Application of the trained policy to a real environment for 100 separate
Training reward averaged over 50 episodes for our policy exploration
Cascade Control scheme for the octocopter
Octocopter motor fault configuration
Fault Adaptive Control framework
Reward Function for Training the DRL Agent on a Single Motor 3 Fault 27
Comparison of Y-trajectory Between PD Control and Hybrid Scheme
Robustness of Hybrid Scheme for Single Motor 3 Fault. The robustness
Reward Function for Training the DRL Agent on a Dual Motor 5 and 6
Comparison of X-trajectory Between PD Control and Hybrid Scheme
Comparison of Y-trajectory Between PD Control and Hybrid Scheme
Robustness of Hybrid Scheme for Dual Motor 5 and 6 Faults