Vision-Based Robotic Object Manipulation: Using a Human-Mimicking Hand Design with Pure Object Recognition Algorithms to Intelligently Grasp Complex Items

Journal for High Schoolers


Evan Huang, Joanne Hui, Michelle Lu, Ganesh Pimpale, Jennifer Song


Robotic dexterity and adaptivity are extensively valued in industrial settings such as manufacturing companies or assembly lines due to their propensity to reduce latency and also the requirement for human involvement. Consequently, these attributes are often modeled after the human hand, which is considered to be one of the most versatile mechanisms concerning object manipulation given its powerful grip and its ability to manipulate small objects with great precision. Although that hardware with the potential to mimic the human hand does exist, there are few options for intelligent software that can autonomously handle objects in conjunction with this hardware. As a step towards producing this software, we created a pure object identification algorithm to discern the optimal means of holding a complex object. The algorithm proceeds by deconstructing complex objects into pure shapes of different parameters, which are then manipulated to determine the grasp that imposes the least amount of movement from the hand’s initial position and the least amount of pressure applied to the hand and object. As a matter of course, this program is also able to validate the grasp and, upon confirmation, undergo a test process involving the optimal grasp and pure object.



In addition to its complexity and utility, the human hand is known to be one of the most intricate mechanisms of the human body. Furthermore, most products are manufactured to be used by human-like hands. Giving such dexterity to robots allows them to interact with products designed for human use, granting functionality toward this software for social and interactive robots. Its utility is also highly valued in industrial scenarios where the dexterity and adaptiveness of robots directly impact the amount of human involvement needed and the number of manufacturing delays. Using hardware and software that allows for robotic dexterity enables robots to fix misplaced or improperly assembled products in a more efficient manner that requires less human interaction.

Prior Work

In the field of robotic dexterity, there have been projects focusing on this topic but the majority of the work is done using a two-finger claw or suction cups. The current research provides information on the main challenges and algorithms that can be repurposed for this project. However, these algorithms are not created for the same hardware, making it difficult to transfer over. Consequently, only research done in pure image and data processing can transfer to this project.

Since the concept of vision-based grasping algorithms is not entirely new, such algorithms have been used in the past for claw based mechanisms but there has not been an intelligent translation given to human-like hardware. Some vision-based systems utilize feature detection, which searches for certain places to grasp the object and often rely on object identification. Other vision-based grasping algorithms also work by using object identification and determining the grasping method from the classification of the object. Although this method works for common and recognizable objects, complex objects can be hard to classify, leading to nonoptimal grasping methods.

Apart from vision-based grasping algorithms, there have also been data-driven approaches. These algorithms are blind to the type of object and instead look for certain features on the object. This eliminates the need to create 3D models of the objects and to estimate the position of the objects, allowing robots to work with ease regarding very complex objects. However, this method calls for fairly high-quality training data, which also requires a variety of environments, objects to grasp, and other physical aspects of the robot. Training data has also taken a “learning by example” approach as data has been produced by remote robot operation as well as directly editing the training data to achieve faster learning and better performance. Although there has been an increase in performance, this approach of data collection is not scalable and it is difficult to obtain such high-quality training data. 

Commonly Used Terms and Definitions

Listed here are some commonly used terms and their definitions:

  • Object: The physical object to pick up
  • Pure Object: A simple component of the object
  • Pure Object Parameters: Values that determine the size and shape of the pure object; these values change between objects.
  • Specificity: The amount of precision and accuracy the calculations have OR The level of detail something is calculated to
    • In the case of our project, this will be defined as the voxel size
  • Simple Object: Object that can be represented using one pure object
  • Complex Object: Object that can be only be represented using multiple pure objects
    • Note: Simple and Complex object depends on the specificity of the scenario

Simulation Fundamentals


Due to COVID-19 restrictions, as well as the cost of actual hardware, we decided to do all of our testing through simulations. To do this we designed all of our hardware using Solidworks, a GUI CAD (Computer-Aided Design) tool, and PyBullet, a virtual physics simulator, to run our simulations. 

Designing the Hand

The objective of the hand is to mimic the design and function of the human hand and match the performance of current mechanical hands. Our initial design was based on the Open Bionics Brunel V2.0 hand. This is a 3D printable hand that has two joints per finger and a two-axis thumb. Although the hand had all the mechanical functionalities to work in our scenario, due to the detail of components inside the hand, it was very difficult to simulate due to the lack of computing power needed. As a result of this issue, we created our hand design as seen to the right. This hand is scaled to match the size of a human hand and includes joints that mimic the motion of human fingers as accurately as possible. Each finger has three joints except for the thumb, which has two. The thumb is attached to the thumb mount at a joint that allows it to move in front of the palm. The thumb mount is attached to the palm at a joint that allows the thumb to move laterally closer and further from the rest of the fingers. The hand has more functionality than most market available mechanical hands, making it not ideal as we would like to match the current hardware as best as possible. However, this still provides the level of functionality that we need.

Designing the Arm

The arm design is based on BCN3D’s Moveo arm, which features six axes of rotation. Those six axes provide more mobility and dexterity to the arm and hand compared to a simpler arm with just one joint. In the figure below, the pink square represents where the hand would be mounted onto, and the green arrows represent the directions in which each joint is able to move. In the original model of the Moveo arm, there is a two-finger claw attached at the top. However, in our design, we mount our human-mimicking hand to create the most realistic and human-like robotic arm possible. The majority of the arm is 3D printable and the rest of the parts constitute relatively low-cost hardware, making it a very realistic option for real-world use. A diagram of the arm and some annotations can be seen below.

Designing the Setting

In the setting (as seen above), there is a green and red section and a metal stand. The metal stand is used to hold cameras and lights. Cameras can have one of two purposes: data input, where the data are given to the software for the robot to process, or monitoring, where they judge the grasp used compared to other algorithms. The object that the robot needs to pick up is placed in the green section so that its shape can be identified. The solid matte green background helps with performance in the initial meshing by removing the issue of textures that can interfere with the color comparison stages of the mesh generation. Once the robot has determined the optimal grasp, it picks the object up and places it into the red section. 

Creating URDF Simulation Files

In order to run the simulation, all of the CAD files have to be exported as a URDF file using the SolidWorks to URDF exporter (SW2URDF). To do that, we had to configure the parts of the hand into a link tree, as seen on the right. The trunk of the link tree is the base part and then each other link extends from the base.

Creating the Designed Hardware

Most of this hardware can be made using either a CNC or a 3D printer. The setting, which is simple and doesn’t have many fine details, could be made using wood and a CNC. However, the arm, which has small and complex parts, would need to be made using a 3D printer so that not as much material is wasted. The hand could be 3D printed, but it would not be able to move, since the hand was designed solely for simulation purposes. The hand design does not include any motors, tendon threads, or any mechanisms to make it move, making it useless to print.

Collecting Data and Data Processing

Simulation Logic and Data Collection

The purpose of the simulation is to produce data on the best gripping motion for each shape, depending on its unique parameters such as radius, height, width, length, etc. Because an optimal grip is defined as the grip requiring the least amount of movement and the minimum amount of pressure on the hand, we collected the torque and position values for each joint on the hand. Thus, it was only necessary to import the hand and each shape into the simulation. An overview of the process is detailed below: 

To generate each shape with a range of sizes, we used openSCAD’s Python extension, SolidPython, along with the Python package Itertools. We generated data for four different shapes: cones, cylinders, ellipsoids, and rectangular prisms. The program takes in a range of values, with which Itertools is used to find all possible permutations of a length-dependent on what the shape is. For example, rectangular prisms have three parameters (width, length, and height), so the program finds all possible permutations of length three. Because there is no built-in function in SolidPython to directly build ellipsoids, we scaled a sphere with a vector instead. However, this requires four parameters (radius, x, y, and z), creating many more shapes for ellipsoids than the other three shapes. After using SolidPython to generate the openSCAD code to create each shape, the shapes are rendered as .scad files and exported using openSCAD’s command line to a single STL file that is continually updated with each shape. This STL file is referenced in the URDF file that is loaded into the PyBullet simulation. By constantly updating a single STL file with each iteration, the shape in the simulation will be updated with each step of the simulation. 

After the shapes are generated, the hand and current shape iteration are imported into a PyBullet simulation on a plane. Then, the hand must move to grip the shape with a predefined gripping strategy. We currently have defined two gripping methods for each of the four shapes: a two-fingered grip, meant for small objects, and a full-palm grasp, meant for larger objects. For small cylinders, the thumb and index fingers grip the two bases; for large cylinders, the full palm grasps the middle of the cylinder body, wrapping the fingers around. For all rectangular prisms and ellipsoids, the hand will hold the narrower side, whether with two fingers or with the full palm. For cones, the hand will hold it at the base. To delineate the border between “small” and “large” objects, each object is tested with both grips, and the torque and position values for each joint are exported to a CSV file using Pandas. An artificial limit between the two grips is set by the user based on these torque and position values.

Unfortunately, the simulation was not finished within the time frame of the program; we finished generating the shapes, importing them into the simulation, and retrieving the torque and position values for each joint, but the code for each grip is unfinished. More gripping techniques may be necessary to cover all types of objects, especially very large ones. 

Data Processing and Produced Trendlines

After each grip has been set to be used for a certain range of sizes, trend lines are made for each grip to compare hand position to the size of each shape. An example is shown below: 

This data was retrieved from very preliminary test trials. The joint position values are retrieved by PyBullet’s built-in getJointStates() function. After the hand moves to grip the designated object, the position values are saved and added to a CSV file to create the graph. Each joint’s position is kept track of, and each line in the graph corresponds to the movement of a certain joint as cylinder radius increases. In general, as the cylinder’s radius increases, most joints will have to rotate more, with the direction dependent on which joint it is. However, most of the base joints generally do not move much. After more trials are done with each grip, these trend lines would be used for determining where to grasp a certain pure shape; after the shape to be gripped has been determined, the parameters for the shape would be plugged into the corresponding trendline to see where each joint should position in order to properly grip the object. Because we were unable to finish the code for gripping techniques, we do not currently have all of the trend lines completed. 

Pure Object and Parameter Recognition

System Overview

The complete Pure Object Recognition (POR) system’s purpose is to find pure objects and their size from multiple photos of the object. This algorithm mainly consists of SfM (Structure from Motion) concepts which will be briefly described below.

Input: Camera data

  1. Use the ODM implementation of OpenSfM to generate a mesh from images
  2. Classify voxel groups into:
    1. Rectangular Prisms
    2. Cylinders
    3. Cones
    4. Ellipsoids
  3. Collect the size data of each of the identified pure objects
  4. Select an object to grasp based on how similar each identified object is to the pure object
  5. Use the generated trendlines [explained above] to calculate the initial values
  6. Perform a grasp validity test to ensure there is a proper grip on the object

Output: Confirm grasp and perform placement

(All steps of the flowchart will be described with much more detail further below)

Mesh and Voxel Generation

The first step in the POR system is to create a predicted 3D model of the object. This will allow the software to estimate the shape and size of the object and predict the shape of the object that is not visible to the camera.

Input: Multiple Images

  1. Look for recognizable objects in the provided images
  2. Is there a recognized object?
    1. If yes: 
      1. Run the collected image data and classification through a Neural Network trained on the ShapeNet dataset
      2. Generate a mesh from the given point cloud
      3. Voxelize the mesh and export it to a file 
    2. If no:
      1. Run the ODM software on the images
      2. Voxelize the given mesh and export it to a file

Output: File containing Voxelized mesh

The first step in this process is to determine if the object is recognizable. This step can significantly improve the performance as if the object is recognizable it is much easier to build a predicted mesh using the already known data. If an object is recognizable, the classification of the object and the image data can be run through a neural network trained on the ShapeNet dataset. The network will generate a point cloud that can be meshed and then turned into voxels using a simple recursive algorithm. Examples of each can be seen below.

Point Cloud:Mesh:Voxels:

If the object is not recognizable, the process becomes a bit more complex. In order to obtain similar results as one could from the ShapeNet model, we now use OpenSfM (Structure from Motion). OpenSfM allows you to take multiple images of a setting and it will then stitch them into a point cloud formation. OpenSfM does this by using an incremental reconstruction algorithm. This complex algorithm can be reduced to three main steps: First, find pairs of images that can create the initial reconstruction. Images that have a large overlap are usually the best.  

After that, the algorithm will bootstrap the reconstruction and test each image pair until one will work as the initial reconstruction. Once an initial reconstruction is found, more images can be added one at a time to build a point cloud formation. These processes are often used in creating 3D interactive maps so the use of GPS data can help in the reconstruction of the images. A diagram of this process can be seen below.

Image Attribution: OpenMVG, part of Mozilla Public License V2

Instead of using our own implementation of OpenSfM, we decided to use OpenDroneMaps (ODM) as it has its own implementation of OpenSfM that performed much better than the one we produced. In addition, ODM also provided the option to create a Node server that could then be referenced by a Python API. Instead of providing point clouds, ODM generates meshes from the given images which we directly convert to voxels. Although ODM can produce decent results underneath the right conditions, any textures or shadows heavily interfere with the meshing algorithm. Examples can be seen below:

Initial Images:Output Mesh:
Initial Images:Output Mesh:

Due to the meshing errors that are caused by texture issues, the setting for this project has been designed with a matte green screen and with light and camera mounts to make sure that there are no shadows and there are no issues with the texture of the background. Although this will work in the confines of this project, this problem is not scalable and will have to be fixed.

Voxel Group Classification

Once the voxelized mesh has been obtained, voxel relation analysis can now be performed to identify pure objects in the voxel mesh. Voxels are much easier to compute compared to meshes due to the fact that they are binary 3D arrays. This makes it easy to use voxel relation analysis, where each voxel is represented with an array containing values if there is another voxel right next to the initial one in every dimension (X, Y, and Z). 

A flowchart of the process can be seen below:

In order to work with voxels, the main tools used here were Numpy and PyVista, which allowed for the creation and visualization of the voxels. One of the main components of the algorithm as shown above is to estimate edges as straight lines or curves by using the differences in the height. This is done by creating a sequence out of the differences, which are then estimated to see what kind of sequence each is. The edges are compared by judging if the graph is a straight line or a curve. Another important note is that if an object is over classified, as in it has two different classifications, it will default to a rectangular prism. This is because rectangular prisms closely fit all sorts of shapes the best and an object that is not a rectangular prism can usually also be held as one.

Pure Object Parameter Recognition

Once the segmented voxel mesh has been created, finding the parameters of each of the voxel groups is relatively easy. This is because the voxels are a set size and can be counted to find general lengths. Using voxel counting, all the software has to know is what to count, which is also simple as there are only four different classifications, so there are only four different methods or parameter collections. A flowchart and explanation are below.

Input: Segmented Voxel Mesh / Stacked shape classification array

Input: Parameter calculations by pure object type 

  1. Create a list of all voxel segments and the corresponding classifications
  2. Run calculation method depending on voxel classification
  3. Save the parameters in an array
  4. Repeat step one until all the parameters of the segments have been calculated

Output: Array of object parameters

All the counting is done through NumPy and collects the following parameters:

  • Rectangular Prism: Length, Width, Height
  • Cylinder: Radius 1, Radius 2, Height
  • Sphere: Radius 1, Radius 2, Radius 3
  • Cone: Radius 1, Radius 2, Height

Performing the Grasp

Checking Grasp Validity

At this stage, the software has decided to grasp a certain object and know how to hold it. The grasp validity test is a sequence of short tests to make sure that: there is a solid grasp on the object, the method being used to hold the object is the most optimal for the given grip type. A flowchart and explanation are below.

Input: Initial grasping values

  1. Perform initial grasp and use decrease all valued to make hand position larger than the calculated value
  2. Increase finger positions to the IAP (initial applied pressure) value
  3. Constantly accelerate upwards and measure the acceleration changes where moving.
    1. If the acceleration difference is less than the ADT (acceleration difference tolerance):
      1. Grasp is validated
    2. Else:
      1. Increase forces applied by each of the fingers
      2. Even out the pressure between each of the fingers
      3. Start over from step 1

The initial values here (IAP, GTH, MTV, ADT) are all set by a human as these are calibrated values. Depending on these values, this process can be very short or very long. In addition, due to the voxelization and the number of changes the data goes through in the algorithm, this step acts as the final barrier before the task is completed and performed. It checks if the calculations are correct and accounts for the fact the meshes are voxelized and an error range is introduced into the scenario. Once the grasp is validated the robotic arm performs a hardcoded task to move from the green section to the red and drop off the object.


Current Results 

Throughout the past eight weeks, the progress we have made includes creating a Computer-Aided Design (CAD) model of the hardware using Solidworks, forming the first prototype of the mesh generation algorithm, designing parts of the voxel classification, and also collecting test data on finger placement defined by pure object parameters. As explained above, a CAD model is used to improve the quality of the design without physical hardware. Our CAD model of the hand, based off of the human hand, consists of three hinges per finger and sized to match the average size of a human hand. Additionally, we also created a CAD model of a six-axis arm and the first prototype of the mesh generation algorithm, which helps the hand recognize the pure objects of each chosen object. (e.g. cone, cylinder, rectangular prism, ellipsoid). To do so, the camera input from the robot, displaying an image of the object, was employed. Using the mesh generation algorithm, we then created parts of the voxel classification algorithm. These 3D pixels, or groups of voxels, are analyzed and segmented into different pure shapes, such as cones, cylinders, rectangular prisms, and ellipsoids, all with unique parameters. Lastly, we collected some data on finger placement defined by pure object parameters. Using our object creation code, which generates different shapes and sizes through a set range, and our simulation code, which creates a simulation in Pybullet, we were also able to collect some data on finger placement to eventually determine trendlines for the optimal grasp based on the size of each shape.

Future Directions 

Due to the time constraints and physical limitations posed by Covid-19, we were not able to completely meet our objectives. Future steps for this project would be to obtain more training data with a wider variety of shapes and sizes, along with more gripping techniques for varying complex objects. We propose to run more trials and to collect data to create precise, definitive trend lines that would be able to determine how to optimally grasp certain pure objects. Currently, we are using a complex mathematical algorithm to segment the voxel mesh, but using an edge detection or deep learning approach could greatly expedite the process. As for the hardware, we would use the Computer Numerical Control Router (CNC) to construct the setting and 3D print a majority of the parts for the arm, but the hand would either need to be redesigned or replaced by an existing hand model, as our current model is aligned toward simulation purposes only.

Regarding the applications of the hardware, we expect to be able to explore the possibilities of applying our research and testing data to the creation of a more dexterous robotic prosthetic hand. Although the current state of our software is not collaborative because it cannot work in conjunction with a human, pure object recognition can be added to prosthetic limbs in order to optimize the functionality. This type of task will also need more technology to predict what the user will want to hold and also when to let go of a certain object.


We would like to thank everyone who supported our project. We would like to acknowledge Professor Tsachy Weissman of Stanford’s Electrical Engineering Department and the head of Stanford Compression Forum for his guidance throughout this project. In addition, we would like to acknowledge Cindy Nguyen, STEM to SHTEM Program Coordinator, for the constant check-ins and chats and thank you to Suzanne Sims for all of the behind-the-scenes work. We would like to thank Shubham Chandak for being our mentor and advising us. Thank you to all of the alumni who presented in the past eight weeks or gave us input regarding our project. Lastly, thank you to past researchers; your work has helped and inspired our project. 


[1] Shreeyak S. Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, & Shuran Song. (2019). ClearGrasp: 3D Shape Estimation of Transparent Objects for Manipulation.

[2] Shuran Song, Andy Zeng, Johnny Lee, & Thomas Funkhouser. (2019). Grasping in the Wild: Learning 6DoF Closed-Loop Grasping from Low-Cost Demonstrations.

[3] Zeng, A., Song, S., Lee, J., Rodriguez, A., & Funkhouser, T. (2019). TossingBot: Learning to Throw Arbitrary Objects with Residual Physics.

[4] Zeng, A., Song, S., Yu, K.T., Donlon, E., Hogan, F., Bauza, M., Ma, D., Taylor, O., Liu, M., Romo, E., Fazeli, N., Alet, F., Dafle, N., Holladay, R., Morona, I., Nair, P., Green, D., Taylor, I., Liu, W., Funkhouser, T., & Rodriguez, A. (2018). Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching. In Proceedings of the IEEE International Conference on Robotics and Automation.

[5] Song, S., Yu, F., Zeng, A., Chang, A., Savva, M., & Funkhouser, T. (2017). Semantic Scene Completion from a Single Depth ImageProceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition.

[6] R. Jonschkowski, C. Eppner, S. Hfer, R. Martn-Martn, and O.Brock. Probabilistic multi-class segmentation for the amazon picking challenge. In2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–7, Oct 2016.

[7] J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in ICRA, 2015.

[8] D’Avella, S., Tripicchio, P., and Avizzano, C. (2020). A study on picking objects in cluttered environments: Exploiting depth features for a custom low-cost universal jamming gripper. Robotics and Computer-Integrated Manufacturing.

[9] C. Wu, “Towards Linear-Time Incremental Structure from Motion,” 2013 International Conference on 3D Vision – 3DV 2013, Seattle, WA, 2013.

[10] Saxena, A., Driemeyer, J., & Ng, A. Y. (2008). Robotic Grasping of Novel Objects using Vision. The International Journal of Robotics Research, 27(2), 157–173. 

[11] Billard, A., & Kragic, D. (2019, June 21). Trends and challenges in robot manipulation., Science Magazine

[12] Ji, S., Huang, M., & Huang, H. (2019, April 2). Robot Intelligent Grasp of Unknown Objects Based on Multi-Sensor Information. 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.