In 2020, from June to August, 64 high school students attended the STEM to SHTEM (Science, Humanities, Technology, Engineering and Mathematics) summer program hosted by Prof. Tsachy Weissman and the Stanford Compression Forum. During this summer program, the high schoolers pursued fun research projects in various domains under the supervision of 34 mentors, where the entire collection of the high schoolers’ reports can be found below.
In 2019, from June to August, 40 high school students attended the STEM to SHTEM (Science, Humanities, Technology, Engineering and Mathematics) summer program hosted by Prof. Tsachy Weissman and the Stanford Compression Forum. During this summer program, the high schoolers pursued fun research projects in various domains under the supervision of 18 mentors, where the entire collection of the high schoolers’ reports can be found below.
Renee Aziz, Stella Chen, Raken Ann Estacio, Kyle Heller, Anthony Sky Ng-Thow-Hing, Jonathan Sneh, Shirley Wang, Patrick Zhu
COVID-19 has challenged the current status quo of all aspects of society; theatre and theatrical performances have been swept up within the uncertainty and face multiple challenges to continue in their current form. As such, virtual adaptations have become a shaky new ground of experimentation as theatre companies — both ametuer and professional — struggle to maintain performances and ensure theatre’s future. Our work through Stanford’s STEM to SHTEM internship allowed us to experiment with various forms of virtual experiences in order to produce a performance that could accurately fulfill the requirements of theatre while confined to an entirely virtual space. The final production, entitled “YOU ARE HERE (AND HERE AND THERE) focused on themes of relativity, perspective, and morals, and built an entirely online performance consisting of multiple platforms and story “tracks” for different audience members to experience. In this paper, we will iterate through our writing process and the technology and platforms used to build the performance, discuss our experience as the creators and crew, examine audience feedback, and discuss the tentative future of this form of performance, as well as the prospects this opens up for theatre as a whole.
The COVID-19 pandemic has ravaged the world, tearing down corporations, disrupting education, and forever altering society as a whole. Theatre is no exception from the chaos as many state and county health orders barred the gatherings required to produce and perform theatrical works. Traditional methods of theatre have been crippled, unable to maintain normal operation. These unprecedented times have forced the theatre community to experiment with technological implementations for live performance and adjust the parameters of theatre, venturing into risky, unexplored realms.
As thespians started to explore the viable possibilities of digital platforms such as video conferencing software, the shortcomings of virtual theatre were swiftly revealed. Instead of being fully immersed in a play on a stage, the experience of audience members is confined by a single rectangular screen. The lively ambiance emitted by characters has dissipated; instead, they have become shallow and one-dimensional. Much of what has characterized Western theatrical performances over the past millennia is evaporating before our eyes.
All of this leads to a new, pertinent question: Can technology resolve the jeopardized state of theatre? If not used wisely, technology itself could easily complicate problems in the theatrical world. Some attempts to modernize traditional Broadway musicals through the production of films—such as the adaptation of Cats—drew widespread criticism, as the “life” of the performance was suppressed by a one-dimensional forced perspective. In an environment when digital plays seem to be the only way to keep the theatrical culture alive, theatre professionals are forced to reevaluate the core characteristics of theatre, and how technology can be used to enhance —rather than eliminate—those values. Clearly, there are differing definitions of these characteristics amongst the theatre community.
Through Stanford’s STEM TO SHTEM summer internship, we created a theatrical performance called “YOU ARE HERE (AND HERE AND THERE)” consisting of multiple platforms and three different paths for audience members to explore. In this paper, we will introduce the technical aspects of the performance we created, discuss our observations as creators as well as the audience’s perspective, and talk about the prospective future of theatre through the deeper implications of our performance.
We began our production by exploring the platform most popularly used for virtual performances – Zoom video conferencing. Turning our cameras on and off served as a virtual alternative to entering and leaving the stage. The virtual background feature was used in place of backdrops and sets, with experimentation in exploiting the spotty background-filtering software to create illusions of floating objects or portals. We used the video filter software Snap Camera to alter our faces and apply masks for characters, mimicking the costume set-ups for traditional theatre while also giving us access to effects not available in conventional theatre without costly prosthetics.
Even with the numerous features, Zoom was still a lackluster performance platform. Video latency was choppy. The audience was in control of their own settings and could accidently see our “backstage”. Instead of using technology to enhance our performance, it felt as if technology was merely a poor translator: Zoom forcefully adapted a physical performance into a digital one. To combat the one-dimensionality of Zoom, we decided to take a multi-platform approach instead. Hours of research went into deciding the optimal platforms to use. As a team, we emphasized the notion that the platforms chosen must bring an aspect into our digital performance not achievable through traditional means.
We started by drafting a story that could take advantage of the live interactive experience that characterizes theatre. The final draft of the script centered on themes of spacetime, celestial bodies, and perspective, using multiple different pathways to enable greater interactivity and cast-audience connection. The story sees the audience as a group of space explorers taking their final exam to receive a space exploration license. After a lecture about stars—led by an eccentric face-filtered alien professor—their exam is hacked by a mysterious “star merchant”, who leads the explorers to buy property on a star and transports them to one of three celestial objects: a black dwarf, neutron star, or black hole. We divided audience members into three separate “tracks”, one for each star type. To navigate the separate tracks, we developed our own dynamically loading website that progressed the audience members through a quiz, as well as serving as a controlling hub and gateway between acts of the performance. As the quiz concluded, audience members returned from their track with different overall experiences. Each track used different digital platforms in their performances.
On the “Black Dwarf” track, audience members received a transmission from a space-protection agent through the live streaming platform Twitch. We wanted to emulate the feeling of being in a spaceship and having the ship’s screens and controls hijacked. Twitch’s emphasis on the stream itself and limited interactivity best helped us achieve that effect. Initially, we used Zoom, but we noticed that it was hard to immerse and engage an audience through a monologue. Many people easily got “Zoom fatigued.” At the end of the Twitch stream, the audience was transported to a star, which turned into a Black Dwarf. The space-protection agent forced them to aid her in restoring the star, transporting them to a different planet to meet the locals and collect materials. This planet was entirely built out of a web of Google Docs, which acted as live chat-rooms with clickable links and images. Audience members were tasked with retrieving fuel, and saw characters interact through a text-format. At times, this experience almost ventured into the world of video games due to a high level of interactivity.
On the “Neutron Star” track, the scene started with an argument between a mother and her son. Audience members dialed in to a Zoom phone call to give the participants a sense that they were eavesdropping on the conversation, making the entire scene feel more intimate and connected. The track then progressed to a platform called High Fidelity, which emulated the surface of a neutron star. This platform allowed people to join two-dimensional rooms, where audience members traversed around the map and talked to each other through spacial audio. The two-dimensional properties of High Fidelity represented the powerful gravity of a neutron star that would instantly flatten anyone within the celestial body’s vicinity. Throughout the scene, our actors interacted with audience members by asking them for their opinions mid-argument, encouraging them to vocalize their own opinions about the practicality of living on a neutron star.
The “Black hole” track opened in a traditional Zoom room with virtual backgrounds configured to give the appearance of being on a spaceship. The scene centered around two ship captains bickering about their morals. One character had more capitalistic values and supported harvesting stars while the other was an environmentalist and believed in preserving space. The Zoom meeting was designed to be interactive by having audience members perform specific actions, such as using objects around them as props. This forced them to stay engaged with the scene and provided them with agency. Through Zoom’s screen share feature, we played a video to simulate traveling through outer space. We synthesized prerecorded effects and live performance by having actors react to the different events occurring in the video. Following the Zoom scene, the audience was taken to a YouTube livestream where they were led through a virtual tour of a star. Here, the audience was engaged by answering questions in the live chat and taking a poll, causing them to ponder their own ideas regarding the tradeoff between capitalism and environmentalism.
The closing scene aimed to connect the theme of perspective to contemporary moments while bringing audience members back into reality. Through Google Earth, we displayed the houses of each audience member, emphasizing connectivity despite being all around the world, and creating a sense of intimacy and interaction. It also served to cement our theme of perspective by showing that, while they had wildly different experiences in the show, they all lived in the same reality.
Throughout the performance, we additionally utilized OpenAI’s GPT-2 text generation model to generate certain character’s lines, as well as the poetry used in the closing Google Earth sequence. We mainly used it because of the experimental nature of the project, but to also add a new area of constraint in order to further induce creativity in writing with the AI-generated segments.
During performance runs, we also had a subset of the crew act as “tech support”, allowing audience members to refer to them whenever they got lost while switching between or within platforms. Our performance tech support crew had roles that could be seen as analogous to what a stage manager would do for a physical performance, giving cues and ensuring that things went smoothly.
The virtual space provided a unique safety net; messages could be sent to actors through Slack or Zoom chat to give cues or raise alerts, even while they performed. As the performances were taking place, we had various technical difficulties both from the audience and from the crew. For example, not every actor or member of our team was able to tell what was happening in other pathways. When we felt that audience members were not getting the experience we intended, we attempted to resolve the issue through Slack or private messaging in the Zoom chat. Throughout the performance, we communicated constantly and relied heavily on our improvisation skills, mitigating the impact of technical issues on the performance.
Sometimes audience memberswould take a path not assigned to them. Some users did not transition to the website or tried to join the wrong Zoom room, causing them to jump to a different track; this was likely indicative of an issue with this form of performance, a technological ineptitude or skill-based barrier to entry. One audience member returned to the performance to experience a different pathway but instead was placed on the same pathway three times. Internet connection was thankfully not a huge issue. Most performers did not experience their internet cutting out, but if it did, their roles were covered. Almost no audience members experienced connectivity issues, and for those who did, their internet problems were resolved quickly.
Even though our performances contained a few technical errors and bits of unplanned improvisation, we were delighted with our results. We surveyed audience members one week after our performances and many respondents expressed awe in the creativity allowed by the art form. The fact that, even a week removed from the performance, audience members felt a lasting sense of amazement demonstrates that a virtual space does not limit the emotional impact a performance can leave. Our belief that this form of performance is viable going forward was validated by our audience feedback.
Audience members had conflicting opinions pertaining to the connection they felt with other participants. Nearly three quarters of survey respondents felt a significant connection with others, especially when using Zoom—where everyone’s faces were visible—or in the Youtube livestream—where audience members could discuss their opinions in the comment section. On the other hand, there were times when the audience felt confused. For instance, some reported that the instructions were not clear in the beginning, while others felt overwhelmed due to quickly switching between platforms. All of these could make the performance hard to navigate.
Over three quarters of audience members polled considered this performance to be “theatre.” Some expressed that this was more engaging than traditional theatre at times. One anonymous audience member wrote:
“For me, theatre is about real-time interaction among the actors, between the actors and the audience, and sometimes among the audience members, in ways that are adaptive and may influence the experience. This performance had all of those elements, and to a stronger extent than in traditional theatre.”
However, some audience members disagreed with this sentiment. They believed that it was a “digital performance” or an “experiment” rather than theatre. Regardless, most of the audience reported a shift in their perception of what a virtual performance entails. One anonymous audience member described:
“It became clear that one can become extremely engaged and immersed, no less and even more than with traditional theatre, via virtual theatre when it’s done right, and that virtual theatre has a lot of potential to develop substantially given how effectively it was carried out in this performance using existing technology.”
Despite all the challenges that we faced during the performance, we were able to broaden the scope of theatre while exploring the immense possibilities that technology had to offer.
Our performance, through the abilities of our actors and unique platforms, created a compelling story—one that left viewers to contemplate the unstable relativity of time, the ethics between good and bad, and unity during isolation. Our team whittled down the crucial facets of theatre into a few components/principles: theatre must captivate the audience and leave them with a newfound perspective. It must usher them into contemplation about fundamental ideas within our world and how we function. It is not the setting that makes the actor; it is the actor that carves out the space. An excellent actor should be able to intrigue viewers from any environment, in-person or virtual. Audiences can still be moved to tears or into fits of laughter even in the comfort of their own homes. The pure, unbridled feeling of seeing actors perform and elegantly present a story should be the focus of theatre.
Although we did not have the physical space that theatre traditionally requires, we incorporated unconventional environments for performance such as Twitch and High Fidelity, ultimately altering our vision of theatre and what constitutes as a theatrical performance. The three paths provided each group of audience members a different perspective—creating room for discussion. The closing Google Earth scene gave the audience a feeling of uncertainty and disorientation, mirroring the same emotions created by the COVID-19 pandemic. Our show’s interactivity added to the engagement aspect of the performance: audience members were constantly on their toes, needing to move from platform to platform.
Modern society is becoming increasingly technological, and this move towards technology has only been further exacerbated by the presence of the pandemic. To adapt to this almost completely virtual way of life, we must be willing to broaden the scope of what we define as theatre, and discover new ways to interact with audiences. We have been forced to alter our perception by reconsidering some elements of theatre that we had originally thought of as crucial to the performance (i.e. a physical space). The changes that COVID-19 has brought upon the theatre industry should not be seen as disadvantages; instead, they should be viewed as opportunities of experimentation that could potentially transform the artform. Like Willem Defoe proclaimed, “With theatre, you have to be ready for anything.”
Our unprecedented performance has opened up an avenue of theatrical production that must be explored. As artists navigate this new environment, we must experiment with the range of resources available to revolutionize the vision of theatre in the 21st century. The possibilities of experimenting with virtual platforms and different technologies are endless; it is up to us to uncover what their roles in theatre are.
 Donaldson, Kayleigh. “Why The Cats Movie Is So Bad.” ScreenRant, 15 Jan. 2020, screenrant.com/cats-movie-bad-reasons-cgi-songs/.
 You Are Here (And Here And There). By Byte-Size, STEM-TO-SHTEM. 4-25 Jul. 2020, Online. Performance.
Evan Huang, Joanne Hui, Michelle Lu, Ganesh Pimpale, Jennifer Song
Robotic dexterity and adaptivity are extensively valued in industrial settings such as manufacturing companies or assembly lines due to their propensity to reduce latency and also the requirement for human involvement. Consequently, these attributes are often modeled after the human hand, which is considered to be one of the most versatile mechanisms concerning object manipulation given its powerful grip and its ability to manipulate small objects with great precision. Although that hardware with the potential to mimic the human hand does exist, there are few options for intelligent software that can autonomously handle objects in conjunction with this hardware. As a step towards producing this software, we created a pure object identification algorithm to discern the optimal means of holding a complex object. The algorithm proceeds by deconstructing complex objects into pure shapes of different parameters, which are then manipulated to determine the grasp that imposes the least amount of movement from the hand’s initial position and the least amount of pressure applied to the hand and object. As a matter of course, this program is also able to validate the grasp and, upon confirmation, undergo a test process involving the optimal grasp and pure object.
In addition to its complexity and utility, the human hand is known to be one of the most intricate mechanisms of the human body. Furthermore, most products are manufactured to be used by human-like hands. Giving such dexterity to robots allows them to interact with products designed for human use, granting functionality toward this software for social and interactive robots. Its utility is also highly valued in industrial scenarios where the dexterity and adaptiveness of robots directly impact the amount of human involvement needed and the number of manufacturing delays. Using hardware and software that allows for robotic dexterity enables robots to fix misplaced or improperly assembled products in a more efficient manner that requires less human interaction.
In the field of robotic dexterity, there have been projects focusing on this topic but the majority of the work is done using a two-finger claw or suction cups. The current research provides information on the main challenges and algorithms that can be repurposed for this project. However, these algorithms are not created for the same hardware, making it difficult to transfer over. Consequently, only research done in pure image and data processing can transfer to this project.
Since the concept of vision-based grasping algorithms is not entirely new, such algorithms have been used in the past for claw based mechanisms but there has not been an intelligent translation given to human-like hardware. Some vision-based systems utilize feature detection, which searches for certain places to grasp the object and often rely on object identification. Other vision-based grasping algorithms also work by using object identification and determining the grasping method from the classification of the object. Although this method works for common and recognizable objects, complex objects can be hard to classify, leading to nonoptimal grasping methods.
Apart from vision-based grasping algorithms, there have also been data-driven approaches. These algorithms are blind to the type of object and instead look for certain features on the object. This eliminates the need to create 3D models of the objects and to estimate the position of the objects, allowing robots to work with ease regarding very complex objects. However, this method calls for fairly high-quality training data, which also requires a variety of environments, objects to grasp, and other physical aspects of the robot. Training data has also taken a “learning by example” approach as data has been produced by remote robot operation as well as directly editing the training data to achieve faster learning and better performance. Although there has been an increase in performance, this approach of data collection is not scalable and it is difficult to obtain such high-quality training data.
Commonly Used Terms and Definitions
Listed here are some commonly used terms and their definitions:
Object: The physical object to pick up
Pure Object: A simple component of the object
Pure Object Parameters: Values that determine the size and shape of the pure object; these values change between objects.
Specificity: The amount of precision and accuracy the calculations have OR The level of detail something is calculated to
In the case of our project, this will be defined as the voxel size
Simple Object: Object that can be represented using one pure object
Complex Object: Object that can be only be represented using multiple pure objects
Note: Simple and Complex object depends on the specificity of the scenario
Due to COVID-19 restrictions, as well as the cost of actual hardware, we decided to do all of our testing through simulations. To do this we designed all of our hardware using Solidworks, a GUI CAD (Computer-Aided Design) tool, and PyBullet, a virtual physics simulator, to run our simulations.
Designing the Hand
The objective of the hand is to mimic the design and function of the human hand and match the performance of current mechanical hands. Our initial design was based on the Open Bionics Brunel V2.0 hand. This is a 3D printable hand that has two joints per finger and a two-axis thumb. Although the hand had all the mechanical functionalities to work in our scenario, due to the detail of components inside the hand, it was very difficult to simulate due to the lack of computing power needed. As a result of this issue, we created our hand design as seen to the right. This hand is scaled to match the size of a human hand and includes joints that mimic the motion of human fingers as accurately as possible. Each finger has three joints except for the thumb, which has two. The thumb is attached to the thumb mount at a joint that allows it to move in front of the palm. The thumb mount is attached to the palm at a joint that allows the thumb to move laterally closer and further from the rest of the fingers. The hand has more functionality than most market available mechanical hands, making it not ideal as we would like to match the current hardware as best as possible. However, this still provides the level of functionality that we need.
Designing the Arm
The arm design is based on BCN3D’s Moveo arm, which features six axes of rotation. Those six axes provide more mobility and dexterity to the arm and hand compared to a simpler arm with just one joint. In the figure below, the pink square represents where the hand would be mounted onto, and the green arrows represent the directions in which each joint is able to move. In the original model of the Moveo arm, there is a two-finger claw attached at the top. However, in our design, we mount our human-mimicking hand to create the most realistic and human-like robotic arm possible. The majority of the arm is 3D printable and the rest of the parts constitute relatively low-cost hardware, making it a very realistic option for real-world use. A diagram of the arm and some annotations can be seen below.
Designing the Setting
In the setting (as seen above), there is a green and red section and a metal stand. The metal stand is used to hold cameras and lights. Cameras can have one of two purposes: data input, where the data are given to the software for the robot to process, or monitoring, where they judge the grasp used compared to other algorithms. The object that the robot needs to pick up is placed in the green section so that its shape can be identified. The solid matte green background helps with performance in the initial meshing by removing the issue of textures that can interfere with the color comparison stages of the mesh generation. Once the robot has determined the optimal grasp, it picks the object up and places it into the red section.
Creating URDF Simulation Files
In order to run the simulation, all of the CAD files have to be exported as a URDF file using the SolidWorks to URDF exporter (SW2URDF). To do that, we had to configure the parts of the hand into a link tree, as seen on the right. The trunk of the link tree is the base part and then each other link extends from the base.
Creating the Designed Hardware
Most of this hardware can be made using either a CNC or a 3D printer. The setting, which is simple and doesn’t have many fine details, could be made using wood and a CNC. However, the arm, which has small and complex parts, would need to be made using a 3D printer so that not as much material is wasted. The hand could be 3D printed, but it would not be able to move, since the hand was designed solely for simulation purposes. The hand design does not include any motors, tendon threads, or any mechanisms to make it move, making it useless to print.
Collecting Data and Data Processing
Simulation Logic and Data Collection
The purpose of the simulation is to produce data on the best gripping motion for each shape, depending on its unique parameters such as radius, height, width, length, etc. Because an optimal grip is defined as the grip requiring the least amount of movement and the minimum amount of pressure on the hand, we collected the torque and position values for each joint on the hand. Thus, it was only necessary to import the hand and each shape into the simulation. An overview of the process is detailed below:
To generate each shape with a range of sizes, we used openSCAD’s Python extension, SolidPython, along with the Python package Itertools. We generated data for four different shapes: cones, cylinders, ellipsoids, and rectangular prisms. The program takes in a range of values, with which Itertools is used to find all possible permutations of a length-dependent on what the shape is. For example, rectangular prisms have three parameters (width, length, and height), so the program finds all possible permutations of length three. Because there is no built-in function in SolidPython to directly build ellipsoids, we scaled a sphere with a vector instead. However, this requires four parameters (radius, x, y, and z), creating many more shapes for ellipsoids than the other three shapes. After using SolidPython to generate the openSCAD code to create each shape, the shapes are rendered as .scad files and exported using openSCAD’s command line to a single STL file that is continually updated with each shape. This STL file is referenced in the URDF file that is loaded into the PyBullet simulation. By constantly updating a single STL file with each iteration, the shape in the simulation will be updated with each step of the simulation.
After the shapes are generated, the hand and current shape iteration are imported into a PyBullet simulation on a plane. Then, the hand must move to grip the shape with a predefined gripping strategy. We currently have defined two gripping methods for each of the four shapes: a two-fingered grip, meant for small objects, and a full-palm grasp, meant for larger objects. For small cylinders, the thumb and index fingers grip the two bases; for large cylinders, the full palm grasps the middle of the cylinder body, wrapping the fingers around. For all rectangular prisms and ellipsoids, the hand will hold the narrower side, whether with two fingers or with the full palm. For cones, the hand will hold it at the base. To delineate the border between “small” and “large” objects, each object is tested with both grips, and the torque and position values for each joint are exported to a CSV file using Pandas. An artificial limit between the two grips is set by the user based on these torque and position values.
Unfortunately, the simulation was not finished within the time frame of the program; we finished generating the shapes, importing them into the simulation, and retrieving the torque and position values for each joint, but the code for each grip is unfinished. More gripping techniques may be necessary to cover all types of objects, especially very large ones.
Data Processing and Produced Trendlines
After each grip has been set to be used for a certain range of sizes, trend lines are made for each grip to compare hand position to the size of each shape. An example is shown below:
This data was retrieved from very preliminary test trials. The joint position values are retrieved by PyBullet’s built-in getJointStates() function. After the hand moves to grip the designated object, the position values are saved and added to a CSV file to create the graph. Each joint’s position is kept track of, and each line in the graph corresponds to the movement of a certain joint as cylinder radius increases. In general, as the cylinder’s radius increases, most joints will have to rotate more, with the direction dependent on which joint it is. However, most of the base joints generally do not move much. After more trials are done with each grip, these trend lines would be used for determining where to grasp a certain pure shape; after the shape to be gripped has been determined, the parameters for the shape would be plugged into the corresponding trendline to see where each joint should position in order to properly grip the object. Because we were unable to finish the code for gripping techniques, we do not currently have all of the trend lines completed.
Pure Object and Parameter Recognition
The complete Pure Object Recognition (POR) system’s purpose is to find pure objects and their size from multiple photos of the object. This algorithm mainly consists of SfM (Structure from Motion) concepts which will be briefly described below.
Input: Camera data
Use the ODM implementation of OpenSfM to generate a mesh from images
Classify voxel groups into:
Collect the size data of each of the identified pure objects
Select an object to grasp based on how similar each identified object is to the pure object
Use the generated trendlines [explained above] to calculate the initial values
Perform a grasp validity test to ensure there is a proper grip on the object
Output: Confirm grasp and perform placement
(All steps of the flowchart will be described with much more detail further below)
Mesh and Voxel Generation
The first step in the POR system is to create a predicted 3D model of the object. This will allow the software to estimate the shape and size of the object and predict the shape of the object that is not visible to the camera.
Input: Multiple Images
Look for recognizable objects in the provided images
Is there a recognized object?
Run the collected image data and classification through a Neural Network trained on the ShapeNet dataset
Generate a mesh from the given point cloud
Voxelize the mesh and export it to a file
Run the ODM software on the images
Voxelize the given mesh and export it to a file
Output: File containing Voxelized mesh
The first step in this process is to determine if the object is recognizable. This step can significantly improve the performance as if the object is recognizable it is much easier to build a predicted mesh using the already known data. If an object is recognizable, the classification of the object and the image data can be run through a neural network trained on the ShapeNet dataset. The network will generate a point cloud that can be meshed and then turned into voxels using a simple recursive algorithm. Examples of each can be seen below.
If the object is not recognizable, the process becomes a bit more complex. In order to obtain similar results as one could from the ShapeNet model, we now use OpenSfM (Structure from Motion). OpenSfM allows you to take multiple images of a setting and it will then stitch them into a point cloud formation. OpenSfM does this by using an incremental reconstruction algorithm. This complex algorithm can be reduced to three main steps: First, find pairs of images that can create the initial reconstruction. Images that have a large overlap are usually the best.
After that, the algorithm will bootstrap the reconstruction and test each image pair until one will work as the initial reconstruction. Once an initial reconstruction is found, more images can be added one at a time to build a point cloud formation. These processes are often used in creating 3D interactive maps so the use of GPS data can help in the reconstruction of the images. A diagram of this process can be seen below.
Image Attribution: OpenMVG, part of Mozilla Public License V2
Instead of using our own implementation of OpenSfM, we decided to use OpenDroneMaps (ODM) as it has its own implementation of OpenSfM that performed much better than the one we produced. In addition, ODM also provided the option to create a Node server that could then be referenced by a Python API. Instead of providing point clouds, ODM generates meshes from the given images which we directly convert to voxels. Although ODM can produce decent results underneath the right conditions, any textures or shadows heavily interfere with the meshing algorithm. Examples can be seen below:
Due to the meshing errors that are caused by texture issues, the setting for this project has been designed with a matte green screen and with light and camera mounts to make sure that there are no shadows and there are no issues with the texture of the background. Although this will work in the confines of this project, this problem is not scalable and will have to be fixed.
Voxel Group Classification
Once the voxelized mesh has been obtained, voxel relation analysis can now be performed to identify pure objects in the voxel mesh. Voxels are much easier to compute compared to meshes due to the fact that they are binary 3D arrays. This makes it easy to use voxel relation analysis, where each voxel is represented with an array containing values if there is another voxel right next to the initial one in every dimension (X, Y, and Z).
A flowchart of the process can be seen below:
In order to work with voxels, the main tools used here were Numpy and PyVista, which allowed for the creation and visualization of the voxels. One of the main components of the algorithm as shown above is to estimate edges as straight lines or curves by using the differences in the height. This is done by creating a sequence out of the differences, which are then estimated to see what kind of sequence each is. The edges are compared by judging if the graph is a straight line or a curve. Another important note is that if an object is over classified, as in it has two different classifications, it will default to a rectangular prism. This is because rectangular prisms closely fit all sorts of shapes the best and an object that is not a rectangular prism can usually also be held as one.
Pure Object Parameter Recognition
Once the segmented voxel mesh has been created, finding the parameters of each of the voxel groups is relatively easy. This is because the voxels are a set size and can be counted to find general lengths. Using voxel counting, all the software has to know is what to count, which is also simple as there are only four different classifications, so there are only four different methods or parameter collections. A flowchart and explanation are below.
Create a list of all voxel segments and the corresponding classifications
Run calculation method depending on voxel classification
Save the parameters in an array
Repeat step one until all the parameters of the segments have been calculated
Output: Array of object parameters
All the counting is done through NumPy and collects the following parameters:
Rectangular Prism: Length, Width, Height
Cylinder: Radius 1, Radius 2, Height
Sphere: Radius 1, Radius 2, Radius 3
Cone: Radius 1, Radius 2, Height
Performing the Grasp
Checking Grasp Validity
At this stage, the software has decided to grasp a certain object and know how to hold it. The grasp validity test is a sequence of short tests to make sure that: there is a solid grasp on the object, the method being used to hold the object is the most optimal for the given grip type. A flowchart and explanation are below.
Input: Initial grasping values
Perform initial grasp and use decrease all valued to make hand position larger than the calculated value
Increase finger positions to the IAP (initial applied pressure) value
Constantly accelerate upwards and measure the acceleration changes where moving.
If the acceleration difference is less than the ADT (acceleration difference tolerance):
Grasp is validated
Increase forces applied by each of the fingers
Even out the pressure between each of the fingers
Start over from step 1
The initial values here (IAP, GTH, MTV, ADT) are all set by a human as these are calibrated values. Depending on these values, this process can be very short or very long. In addition, due to the voxelization and the number of changes the data goes through in the algorithm, this step acts as the final barrier before the task is completed and performed. It checks if the calculations are correct and accounts for the fact the meshes are voxelized and an error range is introduced into the scenario. Once the grasp is validated the robotic arm performs a hardcoded task to move from the green section to the red and drop off the object.
Throughout the past eight weeks, the progress we have made includes creating a Computer-Aided Design (CAD) model of the hardware using Solidworks, forming the first prototype of the mesh generation algorithm, designing parts of the voxel classification, and also collecting test data on finger placement defined by pure object parameters. As explained above, a CAD model is used to improve the quality of the design without physical hardware. Our CAD model of the hand, based off of the human hand, consists of three hinges per finger and sized to match the average size of a human hand. Additionally, we also created a CAD model of a six-axis arm and the first prototype of the mesh generation algorithm, which helps the hand recognize the pure objects of each chosen object. (e.g. cone, cylinder, rectangular prism, ellipsoid). To do so, the camera input from the robot, displaying an image of the object, was employed. Using the mesh generation algorithm, we then created parts of the voxel classification algorithm. These 3D pixels, or groups of voxels, are analyzed and segmented into different pure shapes, such as cones, cylinders, rectangular prisms, and ellipsoids, all with unique parameters. Lastly, we collected some data on finger placement defined by pure object parameters. Using our object creation code, which generates different shapes and sizes through a set range, and our simulation code, which creates a simulation in Pybullet, we were also able to collect some data on finger placement to eventually determine trendlines for the optimal grasp based on the size of each shape.
Due to the time constraints and physical limitations posed by Covid-19, we were not able to completely meet our objectives. Future steps for this project would be to obtain more training data with a wider variety of shapes and sizes, along with more gripping techniques for varying complex objects. We propose to run more trials and to collect data to create precise, definitive trend lines that would be able to determine how to optimally grasp certain pure objects. Currently, we are using a complex mathematical algorithm to segment the voxel mesh, but using an edge detection or deep learning approach could greatly expedite the process. As for the hardware, we would use the Computer Numerical Control Router (CNC) to construct the setting and 3D print a majority of the parts for the arm, but the hand would either need to be redesigned or replaced by an existing hand model, as our current model is aligned toward simulation purposes only.
Regarding the applications of the hardware, we expect to be able to explore the possibilities of applying our research and testing data to the creation of a more dexterous robotic prosthetic hand. Although the current state of our software is not collaborative because it cannot work in conjunction with a human, pure object recognition can be added to prosthetic limbs in order to optimize the functionality. This type of task will also need more technology to predict what the user will want to hold and also when to let go of a certain object.
We would like to thank everyone who supported our project. We would like to acknowledge Professor Tsachy Weissman of Stanford’s Electrical Engineering Department and the head of Stanford Compression Forum for his guidance throughout this project. In addition, we would like to acknowledge Cindy Nguyen, STEM to SHTEM Program Coordinator, for the constant check-ins and chats and thank you to Suzanne Sims for all of the behind-the-scenes work. We would like to thank Shubham Chandak for being our mentor and advising us. Thank you to all of the alumni who presented in the past eight weeks or gave us input regarding our project. Lastly, thank you to past researchers; your work has helped and inspired our project.
 Shreeyak S. Sajjan, Matthew Moore, Mike Pan, Ganesh Nagaraja, Johnny Lee, Andy Zeng, & Shuran Song. (2019). ClearGrasp: 3D Shape Estimation of Transparent Objects for Manipulation.
 Shuran Song, Andy Zeng, Johnny Lee, & Thomas Funkhouser. (2019). Grasping in the Wild: Learning 6DoF Closed-Loop Grasping from Low-Cost Demonstrations.
 Zeng, A., Song, S., Lee, J., Rodriguez, A., & Funkhouser, T. (2019). TossingBot: Learning to Throw Arbitrary Objects with Residual Physics.
 Zeng, A., Song, S., Yu, K.T., Donlon, E., Hogan, F., Bauza, M., Ma, D., Taylor, O., Liu, M., Romo, E., Fazeli, N., Alet, F., Dafle, N., Holladay, R., Morona, I., Nair, P., Green, D., Taylor, I., Liu, W., Funkhouser, T., & Rodriguez, A. (2018). Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching. In Proceedings of the IEEE International Conference on Robotics and Automation.
 Song, S., Yu, F., Zeng, A., Chang, A., Savva, M., & Funkhouser, T. (2017). Semantic Scene Completion from a Single Depth ImageProceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition.
 R. Jonschkowski, C. Eppner, S. Hfer, R. Martn-Martn, and O.Brock. Probabilistic multi-class segmentation for the amazon picking challenge. In2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–7, Oct 2016.
 J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in ICRA, 2015.
 D’Avella, S., Tripicchio, P., and Avizzano, C. (2020). A study on picking objects in cluttered environments: Exploiting depth features for a custom low-cost universal jamming gripper. Robotics and Computer-Integrated Manufacturing.
 C. Wu, “Towards Linear-Time Incremental Structure from Motion,” 2013 International Conference on 3D Vision – 3DV 2013, Seattle, WA, 2013.
 Saxena, A., Driemeyer, J., & Ng, A. Y. (2008). Robotic Grasping of Novel Objects using Vision. The International Journal of Robotics Research, 27(2), 157–173.
 Billard, A., & Kragic, D. (2019, June 21). Trends and challenges in robot manipulation., Science Magazine
 Ji, S., Huang, M., & Huang, H. (2019, April 2). Robot Intelligent Grasp of Unknown Objects Based on Multi-Sensor Information.