A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is https://orcid.org/0000-0002-6236-2969Publications:
Gong Zhang, Kihyuk Sohn, Meera Hahn, Humphrey Shi, Irfan Essa
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2024.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, machine learning, NeurIPS
@inproceedings{2024-Zhang-FFCSPTM,
title = {FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models},
author = {Gong Zhang and Kihyuk Sohn and Meera Hahn and Humphrey Shi and Irfan Essa},
url = {https://neurips.cc/virtual/2024/poster/96863
https://openreview.net/forum?id=1SmXUGzrH8},
year = {2024},
date = {2024-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {Few-shot fine-tuning of text-to-image (T2I) generation models enables people to create unique images in their own style using natural languages without requiring extensive prompt engineering. However, fine-tuning with only a handful, as little as one, of image-text paired data prevents fine-grained control of style attributes at generation. In this paper, we present FineStyle, a few-shot fine-tuning method that allows enhanced controllability for style personalized text-to-image generation. To overcome the lack of training data for fine-tuning, we propose a novel concept-oriented data scaling that amplifies the number of image-text pair, each of which focuses on different concepts (e.g., objects) in the style reference image. We also identify the benefit of parameter-efficient adapter tuning of key and value kernels of cross-attention layers. Extensive experiments show the effectiveness of FineStyle at following fine-grained text prompts and delivering visual quality faithful to the specified style, measured by CLIP scores and human raters.
},
keywords = {computer vision, generative AI, generative media, machine learning, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
Photorealistic Video Generation with Diffusion Models Proceedings Article
In: European Conference on Computer Vision (ECCV), 2024.
Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, google
@inproceedings{2024-Gupta-PVGWDM,
title = {Photorealistic Video Generation with Diffusion Models},
author = {Agrim Gupta and Lijun Yu and Kihyuk Sohn and Xiuye Gu and Meera Hahn and Li Fei-Fei and Irfan Essa and Lu Jiang and José Lezama
},
url = {https://walt-video-diffusion.github.io/
https://arxiv.org/abs/2312.06662
https://arxiv.org/pdf/2312.06662
},
doi = {10.48550/arXiv.2312.06662},
year = {2024},
date = {2024-07-25},
urldate = {2024-07-25},
booktitle = {European Conference on Computer Vision (ECCV)},
abstract = {We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512×896 resolution at 8 frames per second.},
keywords = {arXiv, computational video, computer vision, generative AI, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang
Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings) Proceedings Article
In: Proceedings of European Conference on Computer Vision (ECCV) , 2024.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, ECCV, generative AI, google, reinforcement learning
@inproceedings{2024-Lee-PPMRLFTG,
title = {Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings)},
author = {Seung Hyun Lee and Yinxiao Li and Junjie Ke and Innfarn Yoo and Han Zhang and Jiahui Yu and Qifei Wang and Fei Deng and Glenn Entis and Junfeng He and Gang Li and Sangpil Kim and Irfan Essa and Feng Yang
},
url = {https://arxiv.org/abs/2401.05675
https://arxiv.org/pdf/2401.05675
https://dl.acm.org/doi/10.1007/978-3-031-72920-1_26},
doi = {10.48550/arXiv.2401.05675},
year = {2024},
date = {2024-07-25},
urldate = {2024-07-25},
booktitle = {Proceedings of European Conference on Computer Vision (ECCV)
},
abstract = {Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.
},
keywords = {arXiv, computer vision, ECCV, generative AI, google, reinforcement learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang
VideoPoet: A large language model for zero-shot video generation Best Paper Proceedings Article
In: Proceedings of International Conference on Machine Learning (ICML), 2024.
Abstract | Links | BibTeX | Tags: arXiv, best paper award, computational video, computer vision, generative AI, google, ICML
@inproceedings{2024-Kondratyuk-VLLMZVG,
title = {VideoPoet: A large language model for zero-shot video generation},
author = {Dan Kondratyuk and Lijun Yu and Xiuye Gu and José Lezama and Jonathan Huang and Grant Schindler and Rachel Hornung and Vighnesh Birodkar and Jimmy Yan and Ming-Chang Chiu and Krishna Somandepalli and Hassan Akbari and Yair Alon and Yong Cheng and Josh Dillon and Agrim Gupta and Meera Hahn and Anja Hauth and David Hendon and Alonso Martinez and David Minnen and Mikhail Sirotenko and Kihyuk Sohn and Xuan Yang and Hartwig Adam and Ming-Hsuan Yang and Irfan Essa and Huisheng Wang and David A. Ross and Bryan Seybold and Lu Jiang
},
url = {https://arxiv.org/pdf/2312.14125
https://arxiv.org/abs/2312.14125
https://sites.research.google/videopoet/},
doi = {10.48550/arXiv.2312.14125},
year = {2024},
date = {2024-07-23},
urldate = {2024-07-23},
booktitle = {Proceedings of International Conference on Machine Learning (ICML)},
abstract = {We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
},
keywords = {arXiv, best paper award, computational video, computer vision, generative AI, google, ICML},
pubstate = {published},
tppubtype = {inproceedings}
}
Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models Proceedings Article
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 8682–8692, 2024.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, CVPR, generative AI
@inproceedings{2024-Xu-PDTTTDM,
title = {Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models},
author = {Xingqian Xu and Jiayi Guo and Zhangyang Wang and Gao Huang and Irfan Essa and Humphrey Shi
},
url = {https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf
https://openaccess.thecvf.com/content/CVPR2024/html/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.html
https://arxiv.org/abs/2305.16223
},
doi = {10.48550/arXiv.2305.16223},
year = {2024},
date = {2024-06-18},
urldate = {2024-06-18},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
},
pages = {8682--8692},
abstract = {Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet one pain point persists: the text prompt engineering and searching high-quality text prompts for customized results is more art than science. Moreover as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details hence necessitating more additional controls from the visual domain. In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. Our proposed framework Prompt-Free Diffusion relies on only visual inputs to generate new images: it takes a reference image as "context" an optional image structural conditioning and an initial noise with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder) substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on with promising quality. Our code and models will be open-sourced.
},
keywords = {arXiv, computer vision, CVPR, generative AI},
pubstate = {published},
tppubtype = {inproceedings}
}
Harish Haresamudram, Irfan Essa, Thomas Plötz
A Washing Machine is All You Need? On the Feasibility of Machine Data for Self-Supervised Human Activity Recognition Proceedings Article
In: International Conference on Activity and Behavior Computing (ABC) 2024 , 2024.
Abstract | Links | BibTeX | Tags: activity recognition, behavioral imaging, wearable computing
@inproceedings{2024-Haresamudram-WMNFMDSHAR,
title = {A Washing Machine is All You Need? On the Feasibility of Machine Data for Self-Supervised Human Activity Recognition},
author = {Harish Haresamudram and Irfan Essa and Thomas Plötz
},
url = {https://ieeexplore.ieee.org/abstract/document/10651688},
doi = {10.1109/ABC61795.2024.10651688},
year = {2024},
date = {2024-05-24},
booktitle = {International Conference on Activity and Behavior Computing (ABC) 2024 },
abstract = {Learning representations via self-supervision has emerged as a powerful framework for deriving features for automatically recognizing activities using wearables. The current de-facto protocol involves performing pre-training on (large-scale) data recorded from human participants. This requires effort as recruiting participants and subsequently collecting data is both expensive and time-consuming. In this paper, we investigate the feasibility of an alternate source of data for its suitability to lead to useful representations, one that requires substantially lower effort for data collection. Specifically, we examine whether data collected by affixing sensors on running machinery, i.e., recording non-human movements/vibrations can also be utilized for self-supervised human activity recognition. We perform an extensive evaluation of utilizing data collected on a washing machine as the source and observe that state-of-the-art methods perform surprisingly well relative to when utilizing large-scale human movement data, obtaining within 5-6 % Fl-score on some target datasets, and exceeding on others. In scenarios with limited access to annotations, models trained on the washing-machine data perform comparably or better than end-to-end training, thereby indicating their feasibility and potential for recognizing activities. These results are significant and promising because they have the potential to substantially lower the efforts necessary for deriving effective wearables-based human activity recognition systems.
},
keywords = {activity recognition, behavioral imaging, wearable computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR) , 2024.
Abstract | Links | BibTeX | Tags: AI, arXiv, computer vision, generative AI, google, ICLR
@inproceedings{2024-Yu-LMBDVG,
title = {Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation},
author = {Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Vighnesh Birodkar and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang},
url = {https://arxiv.org/abs/2310.05737
https://arxiv.org/pdf/2310.05737},
doi = { https://doi.org/10.48550/arXiv.2310.05737},
year = {2024},
date = {2024-05-14},
urldate = {2024-05-14},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)
},
abstract = {While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
},
keywords = {AI, arXiv, computer vision, generative AI, google, ICLR},
pubstate = {published},
tppubtype = {inproceedings}
}
Harish Haresamudram, Irfan Essa, Thomas Ploetz
Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition Journal Article
In: Sensors, vol. 24, no. 4, 2024.
Abstract | Links | BibTeX | Tags: activity recognition, arXiv, wearable computing
@article{2023-Haresamudram-TLDRSWHAR,
title = {Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition},
author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},
url = {https://arxiv.org/abs/2306.01108
https://www.mdpi.com/1424-8220/24/4/1238},
doi = {10.48550/arXiv.2306.01108},
year = {2024},
date = {2024-02-24},
urldate = {2023-06-01},
journal = {Sensors},
volume = {24},
number = {4},
abstract = {Human activity recognition (HAR) in wearable computing is typically based on direct processing of sensor data. Sensor readings are translated into representations, either derived through dedicated preprocessing, or integrated into end-to-end learning. Independent of their origin, for the vast majority of contemporary HAR, those representations are typically continuous in nature. That has not always been the case. In the early days of HAR, discretization approaches have been explored - primarily motivated by the desire to minimize computational requirements, but also with a view on applications beyond mere recognition, such as, activity discovery, fingerprinting, or large-scale search. Those traditional discretization approaches, however, suffer from substantial loss in precision and resolution in the resulting representations with detrimental effects on downstream tasks. Times have changed and in this paper we propose a return to discretized representations. We adopt and apply recent advancements in Vector Quantization (VQ) to wearables applications, which enables us to directly learn a mapping between short spans of sensor data and a codebook of vectors, resulting in recognition performance that is generally on par with their contemporary, continuous counterparts - sometimes surpassing them. Therefore, this work presents a proof-of-concept for demonstrating how effective discrete representations can be derived, enabling applications beyond mere activity classification but also opening up the field to advanced tools for the analysis of symbolic sequences, as they are known, for example, from domains such as natural language processing. Based on an extensive experimental evaluation on a suite of wearables-based benchmark HAR tasks, we demonstrate the potential of our learned discretization scheme and discuss how discretized sensor data analysis can lead to substantial changes in HAR.},
howpublished = {arXiv:2306.01108},
keywords = {activity recognition, arXiv, wearable computing},
pubstate = {published},
tppubtype = {article}
}
Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2023.
Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, NeurIPS
@inproceedings{2023-Yu-SSPAMGWFL,
title = {SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs},
author = {Lijun Yu and Yong Cheng and Zhiruo Wang and Vivek Kumar and Wolfgang Macherey and Yanping Huang and David A. Ross and Irfan Essa and Yonatan Bisk and Ming-Hsuan Yang and Kevin Murphy and Alexander G. Hauptmann and Lu Jiang},
url = {https://arxiv.org/abs/2306.17842
https://openreview.net/forum?id=CXPUg86A1D
https://proceedings.neurips.cc/paper_files/paper/2023/hash/a526cc8f6ffb74bedb6ff313e3fdb450-Abstract-Conference.html},
doi = {10.48550/arXiv.2306.17842},
year = {2023},
date = {2023-12-11},
urldate = {2023-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.},
howpublished = {Advances in Neural Information Processing Systems (NeurIPS) (arXiv:2306.17842v2)},
keywords = {arXiv, computational video, computer vision, generative AI, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan
StyleDrop: Text-to-Image Generation in Any Style Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2023.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, NeurIPS
@inproceedings{2023-Sohn-STGS,
title = {StyleDrop: Text-to-Image Generation in Any Style},
author = {Kihyuk Sohn and Nataniel Ruiz and Kimin Lee and Daniel Castro Chin and Irina Blok and Huiwen Chang and Jarred Barber and Lu Jiang and Glenn Entis and Yuanzhen Li and Yuan Hao and Irfan Essa and Michael Rubinstein and Dilip Krishnan},
url = {https://arxiv.org/abs/2306.00983
https://openreview.net/forum?id=KoaFh16uOc
https://proceedings.neurips.cc/paper_files/paper/2023/hash/d33b177b69425e7685b0b1c05bd2a5e4-Abstract-Conference.html},
doi = {10.48550/arXiv.2306.00983},
year = {2023},
date = {2023-12-11},
urldate = {2023-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: this https URL},
howpublished = {arXiv:2306.00983},
keywords = {arXiv, computer vision, generative AI, google, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar
Text and Click inputs for unambiguous open vocabulary instance segmentation Proceedings Article
In: Proeedings of British Conference for Machine Vision (BMVC), 2023.
Abstract | Links | BibTeX | Tags: arXiv, BMVC, computer vision, google, image segmentation
@inproceedings{2023-Warner-TACIFUOVIS,
title = {Text and Click inputs for unambiguous open vocabulary instance segmentation},
author = {Nikolai Warner and Meera Hahn and Jonathan Huang and Irfan Essa and Vighnesh Birodkar},
url = {https://doi.org/10.48550/arXiv.2311.14822
https://arxiv.org/abs/2311.14822
https://arxiv.org/pdf/2311.14822.pdf},
doi = {arXiv.2311.14822},
year = {2023},
date = {2023-11-24},
urldate = {2023-11-24},
booktitle = {Proeedings of British Conference for Machine Vision (BMVC)},
abstract = {Segmentation localizes objects in an image on a fine-grained per-pixel scale. Segmentation benefits by humans-in-the-loop to provide additional input of objects to segment using a combination of foreground or background clicks. Tasks include photoediting or novel dataset annotation, where human annotators leverage an existing segmentation model instead of drawing raw pixel level annotations. We propose a new segmentation process, Text + Click segmentation, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment. Compared to previous approaches, we leverage open-vocabulary image-text models to support a wide-range of text prompts. Conditioning segmentations on text prompts improves the accuracy of segmentations on novel or unseen classes. We demonstrate that the combination of a single user-specified foreground click and a text prompt allows a model to better disambiguate overlapping or co-occurring semantic categories, such as "tie", "suit", and "person". We study these results across common segmentation datasets such as refCOCO, COCO, VOC, and OpenImages. Source code available here.
},
keywords = {arXiv, BMVC, computer vision, google, image segmentation},
pubstate = {published},
tppubtype = {inproceedings}
}
K. Niranjan Kumar, Irfan Essa, Sehoon Ha
Words into Action: Learning Diverse Humanoid Robot Behaviors using Language Guided Iterative Motion Refinement Proceedings Article
In: CoRL Workshop on Language and Robot Learning Language as Grounding (with CoRL 2023), 2023.
Abstract | Links | BibTeX | Tags: arXiv, CoRL, robotics, vision & language
@inproceedings{2023-Kumar-WIALDHRBULGIM,
title = {Words into Action: Learning Diverse Humanoid Robot Behaviors using Language Guided Iterative Motion Refinement},
author = {K. Niranjan Kumar and Irfan Essa and Sehoon Ha},
url = {https://doi.org/10.48550/arXiv.2310.06226
https://arxiv.org/abs/2310.06226
https://arxiv.org/pdf/2310.06226.pdf
https://www.kniranjankumar.com/words_into_action/
},
doi = {10.48550/arXiv.2310.06226},
year = {2023},
date = {2023-11-01},
urldate = {2023-11-01},
booktitle = {CoRL Workshop on Language and Robot Learning Language as Grounding (with CoRL 2023)},
abstract = {We present a method to simplify controller design by enabling users to train and fine-tune robot control policies using natural language commands. We first learn a neural network policy that generates behaviors given a natural language command, such as “walk forward”, by combining Large Language Models (LLMs), motion retargeting, and motion imitation. Based on the synthesized motion, we iteratively fine-tune by updating the text prompt and querying LLMs to find the best checkpoint associated with the closest motion in history.},
keywords = {arXiv, CoRL, robotics, vision & language},
pubstate = {published},
tppubtype = {inproceedings}
}
K. Niranjan Kumar, Irfan Essa, Sehoon Ha
Cascaded Compositional Residual Learning for Complex Interactive Behaviors Journal Article
In: IEEE Robotics and Automation Letters, vol. 8, iss. 8, pp. 4601–4608, 2023.
Abstract | Links | BibTeX | Tags: IEEE, reinforcement learning, robotics
@article{2023-Kumar-CCRLCIB,
title = {Cascaded Compositional Residual Learning for Complex Interactive Behaviors},
author = {K. Niranjan Kumar and Irfan Essa and Sehoon Ha},
url = {https://ieeexplore.ieee.org/document/10152471},
doi = {10.1109/LRA.2023.3286171},
year = {2023},
date = {2023-06-14},
urldate = {2023-06-14},
journal = {IEEE Robotics and Automation Letters},
volume = {8},
issue = {8},
pages = {4601--4608},
abstract = {Real-world autonomous missions often require rich interaction with nearby objects, such as doors or switches, along with effective navigation. However, such complex behaviors are difficult to learn because they involve both high-level planning and low-level motor control. We present a novel framework, Cascaded Compositional Residual Learning (CCRL), which learns composite skills by recursively leveraging a library of previously learned control policies. Our framework combines multiple levels of pre-learned skills by using multiplicative skill composition and residual action learning. We also introduce a goal synthesis network and an observation selector to support combination of heterogeneous skills, each with its unique goals and observation space. Finally, we develop residual regularization for learning policies that solve a new task, while preserving the style of the motion enforced by the skill library. We show that our framework learns joint-level control policies for a diverse set of motor skills ranging from basic locomotion to complex interactive navigation, including navigating around obstacles, pushing objects, crawling under a table, pushing a door open with its leg, and holding it open while walking through it. The proposed CCRL framework leads to policies with consistent styles and lower joint torques, and successfully transfer to a real Unitree A1 robot without any additional fine-tuning.},
keywords = {IEEE, reinforcement learning, robotics},
pubstate = {published},
tppubtype = {article}
}
Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa
Learning Disentangled Prompts for Compositional Image Synthesis Technical Report
2023.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, prompt engineering
@techreport{2023-Sohn-LDPCIS,
title = {Learning Disentangled Prompts for Compositional Image Synthesis},
author = {Kihyuk Sohn and Albert Shaw and Yuan Hao and Han Zhang and Luisa Polania and Huiwen Chang and Lu Jiang and Irfan Essa},
url = {https://arxiv.org/abs/2306.00763},
doi = { https://doi.org/10.48550/arXiv.2306.00763},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
abstract = {We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis. We present a framework that leverages a pre-trained class-conditional generation model and visual prompt tuning. Specifically, we propose a novel source class distilled visual prompt that learns disentangled prompts of semantic (e.g., class) and domain (e.g., style) from a few images. Learned domain prompt is then used to synthesize images of any classes in the style of target domain. We conduct studies on various target domains with the number of images ranging from one to a few to many, and show qualitative results which show the compositional generalization of our method. Moreover, we show that our method can help improve zero-shot domain adaptation classification accuracy.
},
howpublished = {arXiv:2306.00763 },
keywords = {arXiv, computer vision, generative AI, google, prompt engineering},
pubstate = {published},
tppubtype = {techreport}
}
Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang
Visual Prompt Tuning for Generative Transfer Learning Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google
@inproceedings{2022-Sohn-VPTGTL,
title = {Visual Prompt Tuning for Generative Transfer Learning},
author = {Kihyuk Sohn and Yuan Hao and José Lezama and Luisa Polania and Huiwen Chang and Han Zhang and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2210.00990
https://openaccess.thecvf.com/content/CVPR2023/papers/Sohn_Visual_Prompt_Tuning_for_Generative_Transfer_Learning_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Sohn_Visual_Prompt_Tuning_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2210.00990},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark~citezhai2019large, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works.},
keywords = {computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
MAGVIT: Masked Generative Video Transformer Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Yu-MMGVT,
title = {MAGVIT: Masked Generative Video Transformer},
author = {Lijun Yu and Yong Cheng and Kihyuk Sohn and José Lezama and Han Zhang and Huiwen Chang and Alexander G. Hauptmann and Ming-Hsuan Yang and Yuan Hao and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2212.05199
https://magvit.cs.cmu.edu/
https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Yu_MAGVIT_Masked_Generative_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2212.05199},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at this https URL.},
keywords = {computational video, computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Dina Bashkirova, José Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa
MaskSketch: Unpaired Structure-guided Masked Image Generation Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Bashkirova-MUSMIG,
title = {MaskSketch: Unpaired Structure-guided Masked Image Generation},
author = { Dina Bashkirova and José Lezama and Kihyuk Sohn and Kate Saenko and Irfan Essa},
url = {https://arxiv.org/abs/2302.05496
https://openaccess.thecvf.com/content/CVPR2023/papers/Bashkirova_MaskSketch_Unpaired_Structure-Guided_Masked_Image_Generation_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Bashkirova_MaskSketch_Unpaired_Structure-Guided_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2302.05496},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.},
keywords = {computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra
Emergence of Maps in the Memories of Blind Navigation Agents Best Paper Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR), 2023.
Abstract | Links | BibTeX | Tags: awards, best paper award, computer vision, google, ICLR, machine learning, robotics
@inproceedings{2023-Wijmans-EMMBNA,
title = {Emergence of Maps in the Memories of Blind Navigation Agents},
author = {Erik Wijmans and Manolis Savva and Irfan Essa and Stefan Lee and Ari S. Morcos and Dhruv Batra},
url = {https://arxiv.org/abs/2301.13261
https://wijmans.xyz/publication/eom/
https://openreview.net/forum?id=lTt4KjHSsyl
https://blog.iclr.cc/2023/03/21/announcing-the-iclr-2023-outstanding-paper-award-recipients/},
doi = {10.48550/ARXIV.2301.13261},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
abstract = {Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to Δ x, Δ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.},
keywords = {awards, best paper award, computer vision, google, ICLR, machine learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, Irfan Essa
Discrete Predictor-Corrector Diffusion Models for Image Synthesis Proceedings Article
In: International Conference on Learning Representations (ICLR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, google, ICLR, machine learning
@inproceedings{2023-Lezama-DPDMIS,
title = {Discrete Predictor-Corrector Diffusion Models for Image Synthesis},
author = {José Lezama and Tim Salimans and Lu Jiang and Huiwen Chang and Jonathan Ho and Irfan Essa},
url = {https://openreview.net/forum?id=VM8batVBWvg},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {International Conference on Learning Representations (ICLR)},
abstract = {We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies},
keywords = {computer vision, generative AI, generative media, google, ICLR, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Yi-Hao Peng, Peggy Chi, Anjuli Kannan, Meredith Morris, Irfan Essa
Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Access Proceedings Article
In: ACM Symposium on User Interface Software and Technology (UIST), 2023.
Abstract | Links | BibTeX | Tags: accessibility, CHI, google, human-computer interaction
@inproceedings{2023-Peng-SGASESDNA,
title = {Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Access},
author = {Yi-Hao Peng and Peggy Chi and Anjuli Kannan and Meredith Morris and Irfan Essa},
url = {https://research.google/pubs/pub52182/
https://dl.acm.org/doi/fullHtml/10.1145/3544548.3580921
https://doi.org/10.1145/3544548.3580921
https://www.youtube.com/watch?v=pK08aMRx4qo},
year = {2023},
date = {2023-04-23},
urldate = {2023-04-23},
booktitle = {ACM Symposium on User Interface Software and Technology (UIST)},
abstract = {Presentation slides commonly use visual patterns for structural navigation, such as titles, dividers, and build slides. However, screen readers do not capture such intention, making it time-consuming and less accessible for blind and visually impaired (BVI) users to linearly consume slides with repeated content. We present Slide Gestalt, an automatic approach that identifies the hierarchical structure in a slide deck. Slide Gestalt computes the visual and textual correspondences between slides to generate hierarchical groupings. Readers can navigate the slide deck from the higher-level section overview to the lower-level description of a slide group or individual elements interactively with our UI. We derived side consumption and authoring practices from interviews with BVI readers and sighted creators and an analysis of 100 decks. We performed our pipeline with 50 real-world slide decks and a large dataset. Feedback from eight BVI participants showed that Slide Gestalt helped navigate a slide deck by anchoring content more efficiently, compared to using accessible slides.},
keywords = {accessibility, CHI, google, human-computer interaction},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.