A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Gong Zhang, Kihyuk Sohn, Meera Hahn, Humphrey Shi, Irfan Essa
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2024.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, machine learning, NeurIPS
@inproceedings{2024-Zhang-FFCSPTM,
title = {FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models},
author = {Gong Zhang and Kihyuk Sohn and Meera Hahn and Humphrey Shi and Irfan Essa},
url = {https://neurips.cc/virtual/2024/poster/96863
https://openreview.net/forum?id=1SmXUGzrH8},
year = {2024},
date = {2024-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {Few-shot fine-tuning of text-to-image (T2I) generation models enables people to create unique images in their own style using natural languages without requiring extensive prompt engineering. However, fine-tuning with only a handful, as little as one, of image-text paired data prevents fine-grained control of style attributes at generation. In this paper, we present FineStyle, a few-shot fine-tuning method that allows enhanced controllability for style personalized text-to-image generation. To overcome the lack of training data for fine-tuning, we propose a novel concept-oriented data scaling that amplifies the number of image-text pair, each of which focuses on different concepts (e.g., objects) in the style reference image. We also identify the benefit of parameter-efficient adapter tuning of key and value kernels of cross-attention layers. Extensive experiments show the effectiveness of FineStyle at following fine-grained text prompts and delivering visual quality faithful to the specified style, measured by CLIP scores and human raters.
},
keywords = {computer vision, generative AI, generative media, machine learning, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Dina Bashkirova, José Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa
MaskSketch: Unpaired Structure-guided Masked Image Generation Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Bashkirova-MUSMIG,
title = {MaskSketch: Unpaired Structure-guided Masked Image Generation},
author = { Dina Bashkirova and José Lezama and Kihyuk Sohn and Kate Saenko and Irfan Essa},
url = {https://arxiv.org/abs/2302.05496
https://openaccess.thecvf.com/content/CVPR2023/papers/Bashkirova_MaskSketch_Unpaired_Structure-Guided_Masked_Image_Generation_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Bashkirova_MaskSketch_Unpaired_Structure-Guided_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2302.05496},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.},
keywords = {computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
MAGVIT: Masked Generative Video Transformer Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Yu-MMGVT,
title = {MAGVIT: Masked Generative Video Transformer},
author = {Lijun Yu and Yong Cheng and Kihyuk Sohn and José Lezama and Han Zhang and Huiwen Chang and Alexander G. Hauptmann and Ming-Hsuan Yang and Yuan Hao and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2212.05199
https://magvit.cs.cmu.edu/
https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Yu_MAGVIT_Masked_Generative_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2212.05199},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at this https URL.},
keywords = {computational video, computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang
Visual Prompt Tuning for Generative Transfer Learning Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google
@inproceedings{2022-Sohn-VPTGTL,
title = {Visual Prompt Tuning for Generative Transfer Learning},
author = {Kihyuk Sohn and Yuan Hao and José Lezama and Luisa Polania and Huiwen Chang and Han Zhang and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2210.00990
https://openaccess.thecvf.com/content/CVPR2023/papers/Sohn_Visual_Prompt_Tuning_for_Generative_Transfer_Learning_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Sohn_Visual_Prompt_Tuning_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2210.00990},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark~citezhai2019large, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works.},
keywords = {computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, Irfan Essa
Discrete Predictor-Corrector Diffusion Models for Image Synthesis Proceedings Article
In: International Conference on Learning Representations (ICLR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, google, ICLR, machine learning
@inproceedings{2023-Lezama-DPDMIS,
title = {Discrete Predictor-Corrector Diffusion Models for Image Synthesis},
author = {José Lezama and Tim Salimans and Lu Jiang and Huiwen Chang and Jonathan Ho and Irfan Essa},
url = {https://openreview.net/forum?id=VM8batVBWvg},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {International Conference on Learning Representations (ICLR)},
abstract = {We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies},
keywords = {computer vision, generative AI, generative media, google, ICLR, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa
Improved Masked Image Generation with Token-Critic Proceedings Article
In: European Conference on Computer Vision (ECCV), arXiv, 2022, ISBN: 978-3-031-20050-2.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google
@inproceedings{2022-Lezama-IMIGWT,
title = {Improved Masked Image Generation with Token-Critic},
author = {José Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},
url = {https://arxiv.org/abs/2209.04439
https://rdcu.be/c61MZ},
doi = {10.1007/978-3-031-20050-2_5},
isbn = {978-3-031-20050-2},
year = {2022},
date = {2022-10-28},
urldate = {2022-10-28},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13683},
publisher = {arXiv},
abstract = {Non-autoregressive generative transformers recently demonstrated impressive image generation performance, and orders of magnitude faster sampling than their autoregressive counterparts. However, optimal parallel sampling from the true joint distribution of visual tokens remains an open challenge. In this paper we introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer. During non-autoregressive iterative sampling, Token-Critic is used to select which tokens to accept and which to reject and resample. Coupled with Token-Critic, a state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity, in the challenging class-conditional ImageNet generation.},
keywords = {computer vision, ECCV, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa
BLT: Bidirectional Layout Transformer for Controllable Layout Generation Proceedings Article
In: European Conference on Computer Vision (ECCV), 2022, ISBN: 978-3-031-19789-5.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google, vision transformer
@inproceedings{2022-Kong-BLTCLG,
title = {BLT: Bidirectional Layout Transformer for Controllable Layout Generation},
author = {Xiang Kong and Lu Jiang and Huiwen Chang and Han Zhang and Yuan Hao and Haifeng Gong and Irfan Essa},
url = {https://arxiv.org/abs/2112.05112
https://rdcu.be/c61AE},
doi = {10.1007/978-3-031-19790-1_29},
isbn = {978-3-031-19789-5},
year = {2022},
date = {2022-10-25},
urldate = {2022-10-25},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13677},
abstract = {Creating visual layouts is a critical step in graphic design. Automatic generation of such layouts is essential for scalable and diverse visual designs. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from previous work on transformers in adopting non-autoregressive transformers. In training, BLT learns to predict the masked attributes by attending to surrounding attributes in two directions. During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes. The masks generated in both training and inference are controlled by a new hierarchical sampling policy. We verify the proposed model on six benchmarks of diverse design tasks. Experimental results demonstrate two benefits compared to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, it achieves up to 10x speedup in generating a layout at inference time than the layout transformer baseline. Code is released at https://shawnkx.github.io/blt.},
keywords = {computer vision, ECCV, generative AI, generative media, google, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Tao Dong, Christian Frueh, Brian Colonna, Vivek Kwatra, Irfan Essa
Synthesis-Assisted Video Prototyping From a Document Proceedings Article
In: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–10, 2022.
Abstract | Links | BibTeX | Tags: computational video, generative media, google, human-computer interaction, UIST, video editing
@inproceedings{2022-Chi-SVPFD,
title = {Synthesis-Assisted Video Prototyping From a Document},
author = {Peggy Chi and Tao Dong and Christian Frueh and Brian Colonna and Vivek Kwatra and Irfan Essa},
url = {https://research.google/pubs/pub51631/
https://dl.acm.org/doi/abs/10.1145/3526113.3545676},
doi = {10.1145/3526113.3545676},
year = {2022},
date = {2022-10-01},
urldate = {2022-10-01},
booktitle = {Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology},
pages = {1--10},
abstract = {Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.},
keywords = {computational video, generative media, google, human-computer interaction, UIST, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
Text as Neural Operator: Image Manipulation by Text Instruction Proceedings Article
In: ACM International Conference on Multimedia (ACM-MM), ACM Press, 2021.
Abstract | Links | BibTeX | Tags: computer vision, generative media, google, multimedia
@inproceedings{2021-Zhang-TNOIMTI,
title = {Text as Neural Operator: Image Manipulation by Text Instruction},
author = {Tianhao Zhang and Hung-Yu Tseng and Lu Jiang and Weilong Yang and Honglak Lee and Irfan Essa},
url = {https://dl.acm.org/doi/10.1145/3474085.3475343
https://arxiv.org/abs/2008.04556},
doi = {10.1145/3474085.3475343},
year = {2021},
date = {2021-10-01},
urldate = {2021-10-01},
booktitle = {ACM International Conference on Multimedia (ACM-MM)},
publisher = {ACM Press},
abstract = {In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community. The input to conditional image generation has evolved from image-only to multimodality. In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We propose a GAN-based method to tackle this problem. The key idea is to treat text as neural operators to locally modify the image feature. We show that the proposed model performs favorably against recent strong baselines on three public datasets. Specifically, it generates images of greater fidelity and semantic relevance, and when used as a image query, leads to better retrieval performance.},
keywords = {computer vision, generative media, google, multimedia},
pubstate = {published},
tppubtype = {inproceedings}
}
Hsin-Ying Lee, Lu Jiang, Irfan Essa, Madison Le, Haifeng Gong, Ming-Hsuan Yang, Weilong Yang
Neural Design Network: Graphic Layout Generation with Constraints Proceedings Article
In: Proceedings of European Conference on Computer Vision (ECCV), 2020.
Links | BibTeX | Tags: computer vision, content creation, ECCV, generative media, google
@inproceedings{2020-Lee-NDNGLGWC,
title = {Neural Design Network: Graphic Layout Generation with Constraints},
author = {Hsin-Ying Lee and Lu Jiang and Irfan Essa and Madison Le and Haifeng Gong and Ming-Hsuan Yang and Weilong Yang},
url = {https://arxiv.org/abs/1912.09421
https://rdcu.be/c7sqw},
doi = {10.1007/978-3-030-58580-8_29},
year = {2020},
date = {2020-08-01},
urldate = {2020-08-01},
booktitle = {Proceedings of European Conference on Computer Vision (ECCV)},
keywords = {computer vision, content creation, ECCV, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.