Ibrahim Ibrahim · AR SketchNet

Human-AI Collaboration

Despite the earlier apocalyptic predictions about workers being replaced by intelligent machines as Artificial Intelligence (AI) finds its way to businesses, the course of AI seem to be taking a new tactic: actively searching for ways to collaborate with human. Human intelligence paralleled with AI capabilities and and computational speeds, is a novel gateway to modern design tools. Humans and AI actively enhance each other’s complementary strengths: the leadership, teamwork, creativity, and social skills of the former, and the speed, scalability, and quantitative capabilities of the latter [15]. Unlike the commonly know Human-AI interaction, collaboration involves a mutual goal understanding, task co-management, and shared progress tracking [14]. In a creative collaborative creation process, human-AI collaboration could occur in multiple forms. For example, humans could initiate a design workflow with a first parameter that resembles an intent, from which an AI model could begin to complete, suggest or analyze the content giving live feedback that would enrich the process. The user then navigates through the option to steer the next steps. Using reinforced learning techniques, a reward and penalty system is implemented to curate the process as the AI model continues to learn from the human input. Therefore, prototyping faster, with wider range of design exploration, and efficient analysis. A similar example, when an AI model transforms and visualizes simple creations done by humans in real time into new more complex forms.

3D Sketching

For many, sketching, which ranges from quick gestures to refined drawings, is a root step in any creation process. It has been predominantly flattened into a two dimensional surface, minimizing the many design information a simple gestural sketch could embody. The interactive design experience and process that is usually initiated with a ’sketch’ is further enriched by moving beyond two dimensional surface into three dimensional environment. Hence, space as a canvas of creative and interactive prototyping. Spatial 3D sketching has been for long limited to VR hardware and typically requires special hardware for motion capture and immersive stereoscopic displays. Recent advances in motion-tracking algorithms in smart phones, such as Concurrent Odometry and Mapping (COM) [10] and Visual Inertial Odometry (VIO) [8], enabled the utilization of mobile phones as a stylus and screen for mid-air 3D sketching. Mid-Air sketching using the mobile phone is a technique that enables designers and creators to draw virtual gestures or curves directly in the air in 3D. Unlike VR, AR allows users to author 3D sketches that are more directly linked to real-world objects or environments [7], making it a powerful tool for rapid prototyping.

Fologram

Fologram is a mixed reality application for Rhinoceros and Grasshopper 3D that is compatible with HoloLens 2, HoloLens 1, iOS and Android [1]. It accurately positions digital content in 3D space, streams and interacts with models, working with gestures and markers, modelling and controlling data flow. Some of its most common use cases within the architectural space beyond visualization is overlaying fabrication instructions for complex and atypical systems. It allows for various levels of control of the geometry, layers and grasshopper parameters in real time through its interface.

Deep Learning for 3D Design

Deep networks have been used for various tasks around 3D object understanding and geometry generation. Li et al. [9], Su et al. [13], and Girdhar et al. [5] proposed to learn a joint embedding of 3D shapes and synthesized images for 3D object recognition. Using a recurrent network, Wu et al. [16], Xiang et al [18], and Choy et al. [3] attempted to reconstruct 3D objects from a wide selection of images. Other attempts at understanding 3D voxel based geometry representation was via exploring autoencoder based deep networks [5], [12]. Generative Adversial Networks (GAN) [6] introduces the use of a generator and an adversial discriminator, where the discriminator classifies real objects and objects created by the generator, while the generator attempts at confusing it. This in particular a is a favored framework for 3D object modeling as 3D object are highly structured, therefore the model is able to capture the structural difference of two 3 D objects [17]. Furthermore, 3D GAN could be used to sample through the probabilistic latent space creating a wide range of 3D dataset for design exploration and model training.

Approach

The biggest disadvantage of employing a 3D GAN to explore ideas is that the latent vector’s majority of dimensions are incomprehensible to a human designer. Furthermore, due to the enormous number of parameters, such a design space is difficult to explore. In this research, we aim to present a new, creative way of human-AI collaboration through the medium of AR, for the use-case of interior design - particularly, we focus on designing and placing furniture in a given space. To do this, we develop an AR based 3D drawing app, where users’ can use their phones to move in three dimensions and draw strokes that resemble a furniture of intent. The reason for using this mode of human-computer interaction is to allow users to move around their space, and conceptualise objects to-scale and in-position in their space through an easy-to-use AR drawing interface. Through interactive AR based 3D sketching, this project also aimed at creating a natural approach to interact with 3D GANs. We also present a novel algorithmic workflow for processing these sketches and represent them automatically, for use in a machine learning pipeline. We evaluate several machine learning algorithms to predict the mesh / mesh-representations and propose a Multi-Layer Perceptron (MLP) based regression method to reconstruct 3D meshes from user-input sketches through weightencoded neural implicit [4] prediction.

We begin preparing our data-set. For our proposed problem, we need a data-set that comprises of mesh-sketch pairs. There are several large-scale 3D object data-sets available like ShapeNET [2] that can be labeled with appropriate data. However - in our problem - we aim to predict meshes that are able to ”morph” themselves to different sizes and proportions of sketch strokes. So we would not only require perfect geometries, but also interpolations between several geometries to get hybridized results that work well with sketches of diverse shapes and sizes. Hence, we decide to generate our data artificially. To do this, we choose to use the pre-trained chair and sofa generators (Figure 1) from [17]. It presents a novel method to generate 3D objects with an architecture called 3D Generative Adversarial Network (3D-GAN), which uses recent breakthroughs in volumetric convolutional networks and generative adversarial nets to generate 3D objects from a probabilistic latent space.

Figure below: 3D GAN generator architecture

Data-set Preparation - Sampling from 3D GAN

We first sample 1000 randomly generated 128-dimensional latent vectors that are distributed normally in the range [-2, 2]. We use these latent vectors and feed them into the 3D GAN to generate corresponding 3D meshes from the latent space of the generator. Each sampled latent vector corresponding to the generated mesh is stored in a csv file for later use. We also save the generated meshes as obj files. Some of the generated geometries are shown below.

Figure below: Generated 3D objects from 3D GAN latent space

Data-set Preparation - 3D Sketch Generation

For our problem, there were no 3D sketch data-sets available. Hence, we design an algorithmic approach to generate our sketches from the mesh geometries sampled from the 3D GAN, in a parametric design tool called Grasshopper3D [11]. We aim to generate sketches that are natural looking and organic, that would look like a quick conceptual sketch - similar to what users would draw. In order to achieve such sketch strokes from our sampled meshes, the saved obj files of the meshes are loaded iteratively and several points are samples from the geometry. The number of sampled points is decided qualitatively based on how much they can represent the mesh morphology. In our case, 128 samples were identified to be sufficient. These sampled points are then sorted according to their spatial proximity in world coordinates. The sorting algorithm is shown below in the algorithm below.

Once we have the sorted point-list, the points are joined by a polyline in order. This create a dense and jagged linelike curve along the sorted points. The polyline is then simplified and smoothed using B-spline approximation to a 3-degree curve. All the variables used in processing the line are parametric, and the final threshold values are calculated qualitatively for random samples from 3D GAN. The final smoothed curve resembles a conceptual sketch stroke, similar to what we were aiming. The overall process is illustrated in Figure 3. This process is repeated for all 1000 sampled meshes, and corresponding sketch strokes are recorded. Some of the generated sketches and their corresponding mesh samples are shown in the below figures.

Algorithm to generate 3D sketch strokes from mesh. (clockwise) Sampling points on mesh, sampled points, sorting points based, polyline connecting sampled points, reduced polyline, B-spline smoothing to get sketch curve

Vector-based Sketch Representation

Once we have the generated sketch strokes for the sample meshes, there needs to be a way to represent them. This representation is also required to process user-drawn sketches to use in a machine learning pipeline. We also want this representation to be proportional in dimension to the latent vectors that were used to sample different meshes from the 3D GAN. The latent vectors are 128 dimension vectors, and hence we initially decide the sketch representation vectors to be 128 dimensions as well. In order to do so, we sample 128 points from the generated sketch stroke of each mesh. Then, a bounding-box with a centroid is created around the sketch stroke, each of the sampled points from the previous step are connected from the bounding-box centroid. This creates a list of 128 vectors, each anchored at the centroid and connected to each corresponding point from the sampled points. These vectors are normalised to bounding box dimensions, to finally fall in the [0,1] domain. We then use this list of normalized vectors (taken in order of points) to represent our sketch vector. The overall process is illustrated in the below figures.

‍

Examples of sampled meshes (above) and their generated sketch strokes (above).

K-Nearest Neighbor Regression

Our initial attempt to map sketches to meshes was to try to predict the latent vectors that were used to sample meshes - from the generated sketch representation vectors. We use a K-nearest neighbor (KNN) regression algorithm to try to achieve this. The intuition behind using KNN was to try to identify the latent vector that is closest to (in-terms of distance) the input sketch representation vector, and sample a mesh geometry from the 3D GAN with the predicted latent vector. KNN is also non-parametric to the given number of closest neighbors (K value), which means we won’t have the issue of average collapse - where we get a mean prediction across all input sketch representation vectors. We first run a hyper-parameter search using cross-validation, with the following parameter-space dictionary.

‍

{ ’ n n e i g h b o r s ’ : [ 1 , 2 , 3 , . . . , 2 5 ] ;

’ we i g h t s ’ : [ ’ uni form ’ , ’ d i s t a n c e ’ ] ;

’ p ’ : [ 1 , 2 , 3 ]

}

The result of the hyper-parameter search yielded 14 neighbors, ’distance’ weights - where each feature is weighed inversely proportional to their distance - and p value of 1 (Manhattan distance).

‍

Overall training cycle for KNN regressor (below).

Preliminary Inferences & Inferences

On using KNN Regression, we found that the model performed satisfactorily with a cross-validation accuracy of 71% (5 folds). These predictions were then fed back into the AR app, displaying the sampled mesh in the users’ environment. However, the results where very close to each other, predominantly predicting latent vectors that generated similar looking meshes. While the semantic features of the mesh - like hands, legs, shape - were well mapped, the overall shape, scale and proportions of the mesh did not match the input sketch. Some of the predictions in the AR app interface are shown below.

We noticed that the predicted meshes were semantically correct - where if the sketch indicates no hands, the mesh has no hands as well. However, they are either very similar in nature (1st, 2nd and 3rd images in Figure 7) or undesirable (4th image in Figure 7). We also noticed that for two very different latent vectors, the sampled meshes were very similar qualitatively. This suggested that rather than mapping sketches to latent vectors and sampling meshes, there needed to be a much more direct representation of the meshes. And consequently, we also need similar embeddings for the sketches for everything to work smoothly.

Neural Implicit Representation

In our subsequent iteration, we represent the meshes and the sketches as neural implicits [4]. A Neural Implicit representation is a function that determines whether a queried point lies inside, outside or on the surface of our geometry. In this paper, we use Signed Distance Function (SDF) - that returns at any point, the distance to the outer surface of the geometry with the sign indicating whether the point is inside or outside. We recognize that it is easy to represent our meshes and the sketches using neural implicit functions. On training a model on these values, we can map between the sketch and mesh representations bidirectionally. We begin by sampling 100,000 points around the mesh and sketch objects independently, using Latin Hypercube Sampling. Then, we accept 10,000 samples from them based on their distance from the mesh/sketch. We do this to make sure that all our samples useful (i.e.) lie close to the object. The final samples and their corresponding distances are recorded as points and float respectively. This process is repeated iteratively for all 1000 mesh objects and sketch strokes, giving us four feature vectors :

The overall data representation process is illustrated in the diagram below. These vectors are then fed into a simple MLP regression in different combinations and evaluated. First, we train pairs of sketchSamples and meshDistances to be able to predict implicit mesh representation, given the sketch samples. If we get the meshDistances, we can reconstruct the mesh using algorithms like cube-marching. In our next iteration, we train using pairs of sketchDistances and meshDistances. The intuition behind this experiment was to be able to predict meshDistances from sketchDistances. And then finally, we train the model with pairs of sketchSamples and meshSamples to be able to map directly between the two samples of points. These experiments are further explained in the next section.

Above: Accepted points from Latin Hypercube sampling around meshes and sketches with decreasing distance visibility. We felt that at the 10,000 sample mark, the points were spread well around both the sketch and the mesh.

MLP Regression - Training Experiments

The model is a simple MLP Regressor, with 8 hidden layers of 512 dimensions each as shown in Figure 8. We use ReLU activation and Adam Optimizer with a learning rate of 0.0001. The model is trained for 200 epochs with a batch-size of 50 samples. We use L1 Loss for training.

The training data used for the model is as follows:

Exp.1 X: sketchSamples, Y: meshDistances

Here the goal was to predict meshDistances on the sketch sample points in order to get a SDF of the mesh for reconstruction. So, when a user draws a sketch, we can sample points around the sketch and get a SDF of the mesh on those samples from the MLP. The results were really interesting, with more diversity in predictions. Along with semantic features like arms or legs, even the scale and proportions are matched to a great extent. Training loss is shown below.

The objective of this experiment was to be able to predict a SDF of the mesh from the sketchDistances for reconstruction. This is to accommodate multiple strokes drawn by the user. If the user draws a new stroke on an existing stroke, instead of sampling points again, we can recalculate the distances and directly get the mesh SDF for reconstruction. The results were qualitatively good, similar to the previous experiment. Training loss is shown below.

Exp.3 X: sketchSamples, Y: meshSamples

The intuition behind this final experiment was to see if point-to-point regression would yield better results. On training with sketchSamples and meshSamples, the results were fairly similar to the previous two examples. All experiment predictions are shown below.

Conclusion

While the KNN regression performed satisfactorily in predicting latent vectors for input sketch-vectors, the MLP regression with SDF based neural implicit representation worked best for the current problem. Through this paper, we aim to create a robust, modular and easy to use AR application that translates mid-air drawn 3D sketches into mesh geometries using an MLP based regression model. The front-end interface provides various controls and tools that aid in the drawing processes and the visualization of the 3D geometries. Simple 3D gestural sketches suffice for an accurate prediction. Therefore, allowing users with minimal drawing skill-set to equally use the tool and prototype.

References

[1] Fologram.com. In https://fologram.com.

[2] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[3] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016.

[4] Thomas Davies, Derek Nowrouzezahrai, and Alec Jacobson. On the effectiveness of weight-encoded neural implicit 3d shapes. arXiv preprint arXiv:2009.09808, 2020. 2

[5] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pages 484–499. Springer, 2016.

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

[7] Kin Chung Kwan and Hongbo Fu. Mobi3dsketch: 3d sketching in mobile ar. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–11, 2019.

[8] Mingyang Li and Anastasios I Mourikis. High-precision, consistent ekf-based visual-inertial odometry. The International Journal of Robotics Research, 32(6):690–711, 2013.

[9] Yangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. Joint embeddings of shapes and images via cnn image purification. ACM transactions on graphics (TOG), 34(6):1–12, 2015.

[10] Esha Nerurkar, Simon Lynen, and Sheng Zhao. System and method for concurrent odometry and mapping, Oct. 13 2020. US Patent 10,802,147.

[11] David Rutten, Robert McNeel, et al. Grasshopper3d. Robert McNeel & Associates: Seattle, WA, USA, 2007.

[12] Abhishek Sharma, Oliver Grau, andMario Fritz. Vconv-dae: Deep volumetric shape learning without object labels. In European Conference on Computer Vision, pages 236–250. Springer, 2016.

[13] Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proceedings of the IEEE international conference on computer vision, pages 2686–2694, 2015.

[14] Dakuo Wang, Elizabeth Churchill, Pattie Maes, Xiangmin Fan, Ben Shneiderman, Yuanchun Shi, and Qianying Wang. From human-human collaboration to human-ai collaboration: Designing ai systems that can work together with people. In Extended abstracts of the 2020 CHI conference on human factors in computing systems, pages 1–6, 2020.

[15] H. James Wilson and Paul R. Daugherty. Collaborative intelligence: Humans and ai are joining forces. In Harvard Business Review.

[16] Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single image 3d interpreter network. In European Conference on Computer Vision, pages 365–382. Springer, 2016.

[17] JiajunWu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in neural information processing systems, 29, 2016.

[18] Yu Xiang,Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1903–1911, 2015.

🛩️ Status Quo

A growing demand for drone systems in highly constrained indoor environments - especially warehouse inventory management.

❗ Problem Space

The use of single drones may lead to payload limitations that are proportional to their size and hardware. More powerful and larger drones might prove to be expensive and unscalable.

Solution

Warehouses can make use of multiple small drone swarms with a simple rgb camera for moving large payloads together.

🙄 We’ve always done it this way..

Where human driven navigation might not be possible or efficient - autonomous drones are seen as a highly valid solution.

🛬 Approach

We use digital-twin drone system and simulated virtual warehouse environment as our source of data.

This is to tackle the data unavailability and manual annotation tediousness problems.

‍

🧰 System Setup

We create a simulated environment of a large warehouse in Unity3D platform. We make sure we mimic lighting, materials and textures to real warehouses.

We develop a digital-twin of the drone - that mimics an actual drone in terms of physical, technical and visual features. We create the digital-twin of the whole two-drone system with payload.

We setup virtual RGB sensors on each digital drone to collect our input data

We recognize the resemblance of a two-drone system to a stereoscopic vision system. We use CNN to stereo-match and in real-time, we reconstruct the environment that is immediately in-front of both the drones.

Since we consider the input pipe-line to be a stereo-vision system, the two-drones can perceive all non-occluded obstacles “seen” by both drones.

🔀 Domain Randomization

‍To gain diversity in data collection, we capitalising on the simulated environment, by creating a procedural generation algorithm, that randomizes our environment for each run.

→ We randomize obstacle positions, orientations, materials and textures.

→ We also vary environment lighting to diversify our synthetic dataset for diff. Illumination settings.

🤖 Synthetic Data Pipeline

The digital-twin system (2 drones + payload) is completely physics informed.

→ We develop a depth shader to render depth maps along with the RGB images for each drone in the system.

→The digital-twin system can be now used to explore the environment, capture input RGB images, and depth-map ground truth labels. We also log the Euler rotations.

Data Collection

We collect ~600 x 4 images.

- 1200 RGB images from left and right

- 1200 Depth-map labels from left and right

- 600 Euler rotations (camera odometry)

‍

👀 Perception model

→ We study two CNN architectures, PSMNet and RecNet (deep learning for stereo-matching).

→ We build a custom CNN combining PSMNet for disparity estimation and RecNet for 3D reconstruction.

‍

→ We use PSMNet for the first part as it helps in efficiently extracting image features - like edges or textures - that can be used to predict the disparity more accurately through feature matching.

‍

→ We use our synthetically generated RGB images from the left and right drones as inputs and the generated depth-map labels as the targets and train the PSMNet model to estimate disparity.

→ We then use the RecNet architecture to reconstruct the 3D model of the environment from the output disparity image from PSMNet. The overall trained model should be able to reconstruct voxel environment given a stereoscopic RGB image-pair as input.

Network Architecture

→ The RGB stereo images are input to a CNN that calculates the 2D feature maps in the images. Both the left and right images go through a CNN module for feature extraction and these two modules have weight sharing.

→ Then, they pass through Spatial Pyramid Pooling (SPP) modules that harvest these features and converts them to concatenated representations of different sizes. They are then passed through a convolution layer for feature fusion from previous layers.

→ Then the image features from both sides are used to from a 4D cost volume which is fed into a 3D CNN for cost volume regularization and disparity regression to get the disparity map prediction. They are evaluated with smooth L1 loss.

→ Then the disparity and the RGB image are fed to RecNet, that includes 6 convolution layers, one fully connected layer, and 7 deconvolution layers to finally build the volumetric mesh.

‍

Perception Predictions

→ We initially 60-40 split as training and testing data-set.

→ We implement the training using the PyTorch library.

→ For training, we train with Adam optimizer with a learning rate of 0.01.

→ During the training process, the images and labels are randomly flipped horizontally and vertically for 20% of the data samples. The maximum disparity was set to a 50m.

→ We started from the pre-trained PSMNet model and trained with our training data-set for 100 epochs with a batch size of 10.

AR SketchNet

'ARSketchNet: Real-time 3D Mesh Reconstruction and Editing from Conceptual 3D Sketches in Augmented Reality using Neural Implicit Learning'

Company

Deeplocal

Course

MIT 4.453 Creative Machine Learning for Designers

Year

2022

Advisor

Caitlin Mueller & Renauld Drevian

Client

Google

Team

Vishal Vaidhyanathan

Research

Concept

Drone Flight

DJI Matrice 300 RTK Enterprise

Data Processing

Information Stack

Design

AgentsComputation

ML Agents

NPC Behavior

Game Interface

Basic Interface

Interface Zooming

3 Sea Level Rise Incorportated

Flood City Simulations

Base Case: Today

Sea-level Rise: 3ft

Sea-level Rise: 5ft

Sea-level Rise: 6ft

Geometry Analysis

Computation:Inverse Kinematics

Computation:Geo Orientation

Sensor Tech:Capacitive

Cantilevers

A support arm projects forward for the next rock to rest on temporarily.

Machine Details

Systems

Electronic Systems & Capacitive Base

Delta Robot

Technical Details

Gripper

Delta Robot

Cantilever Arm Robot

Sea-level Rise: 6ft

Delta Robot Rail

Custom PCB Electronic Board

Site Typologies

Simulation: Gases

Interface Design

Interfacing the Atmospheric Milieu

Digital Twins of the Sites

Site 1: Sumner St., Cambridge, MA

Site 2: Harvard Arnold Arboretum, Boston, MA

Site 3: Downtown Boston, Boston, MA

Computational Layers

1 - The 120m Site

2 - 2m x 2m Pixels Grid

3 - 2m x 2m Discrete Grid

4 -Wind Direction Map

5 - Wind Forces as Weights

6 - Sensor Network and Distance Accuracy

7 - Splitting Units into Four Gases

8 - Personalized Exposure Basemap in Real-time

Technical Systems

Systems Architecture

Calibration Network

Artifacts OnSite

BUILT FOR KIDS

BUTTONS AND LEVERSFOR HANDS-ON PLAY

9,300 GAMES PLAYED

DURING THE 11-DAYIOWA STATE FAIR

‍

Making Through Scavenging

Geometric Cataloging

Reciprocal Frame Structure

Locking Mechanism

Agents
Computation

Computation:
Inverse Kinematics

Computation:
Geo Orientation

Sensor Tech:
Capacitive

BUTTONS AND LEVERS
FOR HANDS-ON PLAY

DURING THE 11-DAY
IOWA STATE FAIR