This thesis proposes a deep pose estimation network applied to the textureless model assembly task, which aims to assemble six textureless models with different shapes into a complete aircraft model. We use ROS as the development environment to integrate the proposed pose estimation network and the control system of the 7-DoF manipulator to perform the assembly task, in which the target objects are randomly placed in the workspace. The proposed pose estimation network firstly extracts image feature maps of the input RGB image through the VGG network, and then performs object detection and attitude estimation through multi-task convolution layers. Since the target models are textureless objects, we found that using the original VGG network to extract feature maps cannot achieve a desired detection rate. Therefore, in order to improve the efficiency of image feature extraction, we modify the existing VGG network to improve the detection rate of textureless objects. In the network training, the supervised training method is used for multi-task training of the proposed network, which can use different loss functions for different tasks to update the weights of different networks, so that the deep convolutional neural network can predict the projection of the 3D bounding box of the training target onto the 2D image plane. With the output of the network model, the existing PnP algorithm can be used to estimate the relative pose information between the camera and the target object, so that the robot can locate the 3D coordinates of the target object and accurately grasp the target object to achieve the task of model assembly.