TRAINING NEURAL NETWORKS FOR VISUAL SERVOING

Classical visual servoing uses multiple cameras and computer vision techniques to coordinate robot movements. One important improvement of direct visual servoing is that it generalizes steering error to objects of different shapes. Position-based visual serving aims to optimize the difference between two hand positions.

I believe that this study can stimulate the improvement of visual service techniques and evaluate the ability of CNN. Supervised learning as the name says requires supervision for the learning of the agent. Convolution in the name of networks means matrix multiplication in network layers [6].

Likewise, the pooling layer reduces the size of the feature map and subsequently reduces the consumption of computational resources. The padding has a unique, pragmatic function of limiting the size of the filter and the dimension of the output. Moreover, it reduces gradient values and provides and enables higher learning speed for the network.

By adding momentum to the SGD and adding part of the previous vector to the operation.

Literature Review

Development of visual servoing

Computing the pose is more complex and uses intrinsic 3D parameters of the camera. Alternatively, some works emphasize the ego-movement technique using feature training of the model. It is important that the authors used UAV to calculate the position of the quadcopter.

Input and output of the model operate at values of 6 Degrees of Freedom (DoF). Therefore, the author's work used a control system to predict velocity but not the robot's position. Therefore, the visual servo task should solve the camera's difficulty in tracking and grasping the objects.

Their model uses image input and pose values from the robot arm as output. It takes the feature points in the image and learns connections between the points and the pose in space. PointNet has a two-stage pipeline, with the first part of the model processes containing points in the image.

Combining the feature maps of two networks gives the 6 floats of freedom after a fully connected layer. The system with embedded camera in the robot hand, some of the points on the object may not be visible to the system. The solution to this problem is to mix knowledge about the location of feature points in the environment, which was introduced before to the model.

The controller network will be a combined CNN of the image input and the reference position and desired position. This case implies that the desired position is closer to the robot arm, and in this case the condition to eliminate calculation of the position will be activated. For this task, researchers created a simulation of a robot with a built-in camera and two images were taken. passed to the input of the visual servo system.

The researchers noted that the robot's images and pose were saved and automatically tagged with position values. The number of features was limited by reducing the dimension of the output feature map to 96 with a small kernel.

Methodology

Their receptive field is larger than my developed model as it tries to create a "belief map" together with a feature map of the image taken from the robot hand. Third study [40] discussed the robot manipulator's ability to recognize and move clothes correctly, and in the chapter on contrastive prediction, authors provide a disclaimer about a fully observable environment that I had in simulation. Authors propose the use of direct visual servo mode, where they used learning pixel-by-pixel and use regression for tuples of normalized values of the image.

Moreover, the size of the dataset affected the convergence speed, especially with CNN+DVS configuration. The activation function, one of the crucial parts of the network, uses mathematical function to transfer significant values to the final fully connected layers, which collect and connect all information from layers. The problem with this method is that it does not fit the concept of interaction matrix optimization due to the high non-linearity of the image data.

I developed the Euclidean distance function routine (3) for loading the set of 6 floats which were taken from the data frame as true ground tags and compared with the generated marks of the last linear activation layer. Consequently, the visual servoing problem is to optimize the positioning of the robot, not predicting the correct class of the object. The purpose of the simulation test was to evaluate the potential performance of the visual service predictor under laboratory conditions.

The actions of the robot arm were closer to the predicted point with a negligible error of 1~2 cm. One of the essential parts of the thesis project was the generation of the dataset. I split the dataset into 3 parts with 80% of the dataset (5600 instances) for training, 1600 for validation and a small partition I kept for test evaluation.

The images were resized according to the model input, reducing their original size to 224x224 and the RGB channel, to save memory and speed up training. The distance to the camera in the simulation scene was 20 cm from the viewpoint, and points were distributed by Gaussian drawing according to the 6 joint positions, 1 cm shift for three distances and 1 degree for angles per iteration. In addition, the dataset was fed to the data generator module for justified distribution of the instances and shuffled before training with standard random seed of 42.

Figure 1. The search of minima by SGD, Adam and RMSprop optimizers

Results

XceptionNet produced good convergence of the training loss, but did not improve on the validation dataset. Depth convolution operations were effective in image classification, but regression tasks with unpredictable input did not produce significant loss results. Although all models showed high accuracy and the MAE and MSE metrics demonstrate that I considered the root mean square error as the main factor in model performance during the training phase, it did not decrease for models with low performance.

Certainly, all models did not converge, as 100 epochs were not enough for a full training process, but it was important to show intermediate, preliminary results after some time of model training. As a result, the training of the ResNet model stopped at an approximate loss position of 45 with further fluctuations in the test data set, as its accuracy was not stable, and only 3 positions out of 6 were correctly predicted, and the majority of labels in the data set were not in line with predicted values. The decrease in time also correlated with losses in the first epochs, but later results did not change and the model did not converge in the validation data set.

RMSprop accelerated the learning process of the model and the loss stopped at position 55. Also, the EfficientNet versions with the Adam optimizer did not converge and the significant loss of 197 highlighted inaccurate model predictions, resulting in unacceptable model accuracy for further experiments. Training process of 50-layer ResNetv2 model, convergence rate issues. A second stage with a pre-interference of 100 epochs together with the test dataset and an undetected cutoff of the dataset showed the true performance of the models.

AgeNet's first attempt revealed that test loss did not decrease and actually grew, with the training dataset showing a loss of 60. The second attempt to apply DenseNet indicated a potential improvement in testing loss along with training loss. The purple line shows a significant decrease in the training loss, the validation loss shows a moderate decrease from 200 to 180.

Further improvement of the AlexNet by increasing the training epochs significantly reduced the value of test loss. Figure 13 illustrates improvements in moving the robot's head in the direction of possible image coordinates. The AlexNet test loss of the modified and pretrained model with serial execution of Adam and RMSprop optimization.

Figure 3. AlexNet Loss and Accuracy metrics after 100 epochs

FUTURE WORK

CONCLUSION

Bibliography/References