Publications in 2021

Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa

VideoPose: Estimating 6D object pose from videos Technical Report

2021.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, object detection, pose estimation

Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa

Text as Neural Operator: Image Manipulation by Text Instruction Proceedings Article

In: ACM International Conference on Multimedia (ACM-MM), ACM Press, 2021.

Abstract | Links | BibTeX | Tags: computer vision, generative media, google, multimedia

Peggy Chi, Nathan Frey, Katrina Panovich, Irfan Essa

Automatic Instructional Video Creation from a Markdown-Formatted Tutorial Proceedings Article

In: ACM Symposium on User Interface Software and Technology (UIST), ACM Press, 2021.

Abstract | Links | BibTeX | Tags: google, human-computer interaction, UIST, video editting

Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song

Neural Temporal Logic Programming Technical Report

2021.

Abstract | Links | BibTeX | Tags: activity recognition, arXiv, machine learning, openreview

Nathan Frey, Peggy Chi, Weilong Yang, Irfan Essa

Automatic Style Transfer for Non-Linear Video Editing Proceedings Article

In: Proceedings of CVPR Workshop on AI for Content Creation (AICC), 2021.

Links | BibTeX | Tags: computational video, CVPR, google, video editing

AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

Unsupervised Discovery of Actions in Instructional Videos Proceedings Article

In: British Machine Vision Conference (BMVC), 2021.

Abstract | Links | BibTeX | Tags: activity recognition, computational video, computer vision, google

Harish Haresamudram, Irfan Essa, Thomas Ploetz

Contrastive Predictive Coding for Human Activity Recognition Journal Article

In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021.

Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, machine learning, ubiquitous computing

Anh Truong, Peggy Chi, David Salesin, Irfan Essa, Maneesh Agrawala

Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos Proceedings Article

In: ACM CHI Conference on Human factors in Computing Systems, 2021.

Abstract | Links | BibTeX | Tags: CHI, computational video, google, human-computer interaction, video summarization

@inproceedings{2021-Truong-AGTHTFIMV,

title = {Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos},

author = {Anh Truong and Peggy Chi and David Salesin and Irfan Essa and Maneesh Agrawala},

url = {https://dl.acm.org/doi/10.1145/3411764.3445721

https://research.google/pubs/pub50007/

http://anhtruong.org/makeup_breakdown/},

doi = {10.1145/3411764.3445721},

year  = {2021},

date = {2021-05-01},

urldate = {2021-05-01},

booktitle = {ACM CHI Conference on Human factors in Computing Systems},

abstract = {We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.},

keywords = {CHI, computational video, google, human-computer interaction, video summarization},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Dan Scarafoni, Irfan Essa, Thomas Ploetz

PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction Technical Report

no. arXiv:2103.15987, 2021.

Abstract | Links | BibTeX | Tags: activity recognition, arXiv, computer vision

Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra

Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views Proceedings Article

In: Proceedings of American Association of Artificial Intelligence Conference (AAAI), AAAI, 2021.

Abstract | Links | BibTeX | Tags: AAAI, AI, embodied agents, first-person vision

@inproceedings{2021-Cartillier-SMBASRFEV,

title = {Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views},

author = {Vincent Cartillier and Zhile Ren and Neha Jain and Stefan Lee and Irfan Essa and Dhruv Batra},

url = {https://arxiv.org/abs/2010.01191

https://vincentcartillier.github.io/smnet.html

https://ojs.aaai.org/index.php/AAAI/article/view/16180/15987},

doi = {10.48550/arXiv.2010.01191},

year  = {2021},

date = {2021-02-01},

urldate = {2021-02-01},

booktitle = {Proceedings of American Association of Artificial Intelligence Conference (AAAI)},

publisher = {AAAI},

abstract = {We study the task of semantic mapping -- specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (`what is where?') from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space -- navigating to objects seen during the tour (`Find chair') or answering questions about the space (`How many chairs did you see in the house?'). 

Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric 

 

Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.},

keywords = {AAAI, AI, embodied agents, first-person vision},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Publications in 2021

Leave a Reply Cancel reply

Paper in UIST 2021 on “Automatic Instructional Video Creation from a Markdown-formatted Tutorial”

Paper in ICLR 2022 on “Discrete Representations Strengthen Vision Transformer Robustness”