The last week of October 2020, Sadako and IBEC hit an important milestone in the progress of the HR Recycler project as the first integration tests were organized between the two partners. The aim of these tests was to assess the joint functioning of the computer vision modules developed by Sadako with the Human-Robot interaction modules developed by IBEC, in the simulated setting of a worker dismantling an electronic waste object on a Disassembly workbench. More specifically, the functionalities tested were how well the worker can interact with the workbench robot through a predetermined set of gestures, and how well the robot changes its behavior as a function of the human-robot distance. These tests were the object of a post by IBEC on this blog last month; we thought from Sadako that we could add a little insight from the computer vision perspective:
Software integration:
During the tests, production-ready versions of the software were employed, using ROS as an interface both to acquire the images from the Realsense cameras and to output the inferred information later processed by IBEC. A high-level architecture diagram of the software can be seen on the image below:

Both the position and action recognition software get images from the camera through ROS, handle the images and send them to the detectors. The detections are then handled back to the pose and action detection software and sent to IBEC via the ROS topic.
Action recognition software:
Detecting specific actions on a video in real time required the use of a different type of neural network than the one usually used at Sadako Technologies. Indeed, the standard neural networks employed in computer vision extract spatial features from images to detect some specific target object. To infer information from a video feed, temporal features need to be examined in addition to the spatial ones (ie. how the features evolve through time). In 2018, Facebook research published VMZ, a neural network architecture that has the particularity of performing spatial convolutions (examining one area of the image) and temporal convolutions (examining how one singular pixel evolves over time) in parallel on video data. This allows the detection of time-dependent features, such as gestures for example in this particular case. This network uses state of the art deep learning techniques (residual learning, aka ResNets) with some changes and adjustments to the neural network architectures to make it able to detect temporal features.

POSE DETECTION:
The other tested functionality was the ability of the robot to change its behavior (i.e movement speed) when the worker gets closer to the robot than some specific distance. To develop this functionality, the Openpose software was used to retrieve the worker’s skeleton joint coordinates on the RGB image. Those joint coordinates were then matched with the depth image given by the camera to locate the worker in space, relative to the camera. However, knowing where a worker is relative to the camera is not enough information to measure the worker-robot distance. To output the worker-robot coordinates, a calibration procedure had to be developed by Sadako where the camera’s position is measured using the RGB and depth information with respect to a marker visible on the scene. The robot’s position with respect to this marker is also measured; crossing both measurements allows the code to return an accurate measurement of the human-robot distance in real time:

The integration tests with IBEC were successful despite being organized against the ticking clock of the rising Covid-19 cases threatening to limit the access to the facilities and the ability to join multiple teams from different companies. Further tests are currently being organized between Sadako and IBEC to test improved versions of both parties software, taking into account the feedback gathered during the first integration test session.