A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Gong Zhang, Kihyuk Sohn, Meera Hahn, Humphrey Shi, Irfan Essa
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2024.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, machine learning, NeurIPS
@inproceedings{2024-Zhang-FFCSPTM,
title = {FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models},
author = {Gong Zhang and Kihyuk Sohn and Meera Hahn and Humphrey Shi and Irfan Essa},
url = {https://neurips.cc/virtual/2024/poster/96863
https://openreview.net/forum?id=1SmXUGzrH8},
year = {2024},
date = {2024-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {Few-shot fine-tuning of text-to-image (T2I) generation models enables people to create unique images in their own style using natural languages without requiring extensive prompt engineering. However, fine-tuning with only a handful, as little as one, of image-text paired data prevents fine-grained control of style attributes at generation. In this paper, we present FineStyle, a few-shot fine-tuning method that allows enhanced controllability for style personalized text-to-image generation. To overcome the lack of training data for fine-tuning, we propose a novel concept-oriented data scaling that amplifies the number of image-text pair, each of which focuses on different concepts (e.g., objects) in the style reference image. We also identify the benefit of parameter-efficient adapter tuning of key and value kernels of cross-attention layers. Extensive experiments show the effectiveness of FineStyle at following fine-grained text prompts and delivering visual quality faithful to the specified style, measured by CLIP scores and human raters.
},
keywords = {computer vision, generative AI, generative media, machine learning, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang
Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings) Proceedings Article
In: Proceedings of European Conference on Computer Vision (ECCV) , 2024.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, ECCV, generative AI, google, reinforcement learning
@inproceedings{2024-Lee-PPMRLFTG,
title = {Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings)},
author = {Seung Hyun Lee and Yinxiao Li and Junjie Ke and Innfarn Yoo and Han Zhang and Jiahui Yu and Qifei Wang and Fei Deng and Glenn Entis and Junfeng He and Gang Li and Sangpil Kim and Irfan Essa and Feng Yang
},
url = {https://arxiv.org/abs/2401.05675
https://arxiv.org/pdf/2401.05675
https://dl.acm.org/doi/10.1007/978-3-031-72920-1_26},
doi = {10.48550/arXiv.2401.05675},
year = {2024},
date = {2024-07-25},
urldate = {2024-07-25},
booktitle = {Proceedings of European Conference on Computer Vision (ECCV)
},
abstract = {Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.
},
keywords = {arXiv, computer vision, ECCV, generative AI, google, reinforcement learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
Photorealistic Video Generation with Diffusion Models Proceedings Article
In: European Conference on Computer Vision (ECCV), 2024.
Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, google
@inproceedings{2024-Gupta-PVGWDM,
title = {Photorealistic Video Generation with Diffusion Models},
author = {Agrim Gupta and Lijun Yu and Kihyuk Sohn and Xiuye Gu and Meera Hahn and Li Fei-Fei and Irfan Essa and Lu Jiang and José Lezama
},
url = {https://walt-video-diffusion.github.io/
https://arxiv.org/abs/2312.06662
https://arxiv.org/pdf/2312.06662
},
doi = {10.48550/arXiv.2312.06662},
year = {2024},
date = {2024-07-25},
urldate = {2024-07-25},
booktitle = {European Conference on Computer Vision (ECCV)},
abstract = {We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512×896 resolution at 8 frames per second.},
keywords = {arXiv, computational video, computer vision, generative AI, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang
VideoPoet: A large language model for zero-shot video generation Best Paper Proceedings Article
In: Proceedings of International Conference on Machine Learning (ICML), 2024.
Abstract | Links | BibTeX | Tags: arXiv, best paper award, computational video, computer vision, generative AI, google, ICML
@inproceedings{2024-Kondratyuk-VLLMZVG,
title = {VideoPoet: A large language model for zero-shot video generation},
author = {Dan Kondratyuk and Lijun Yu and Xiuye Gu and José Lezama and Jonathan Huang and Grant Schindler and Rachel Hornung and Vighnesh Birodkar and Jimmy Yan and Ming-Chang Chiu and Krishna Somandepalli and Hassan Akbari and Yair Alon and Yong Cheng and Josh Dillon and Agrim Gupta and Meera Hahn and Anja Hauth and David Hendon and Alonso Martinez and David Minnen and Mikhail Sirotenko and Kihyuk Sohn and Xuan Yang and Hartwig Adam and Ming-Hsuan Yang and Irfan Essa and Huisheng Wang and David A. Ross and Bryan Seybold and Lu Jiang
},
url = {https://arxiv.org/pdf/2312.14125
https://arxiv.org/abs/2312.14125
https://sites.research.google/videopoet/},
doi = {10.48550/arXiv.2312.14125},
year = {2024},
date = {2024-07-23},
urldate = {2024-07-23},
booktitle = {Proceedings of International Conference on Machine Learning (ICML)},
abstract = {We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
},
keywords = {arXiv, best paper award, computational video, computer vision, generative AI, google, ICML},
pubstate = {published},
tppubtype = {inproceedings}
}
Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models Proceedings Article
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 8682–8692, 2024.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, CVPR, generative AI
@inproceedings{2024-Xu-PDTTTDM,
title = {Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models},
author = {Xingqian Xu and Jiayi Guo and Zhangyang Wang and Gao Huang and Irfan Essa and Humphrey Shi
},
url = {https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf
https://openaccess.thecvf.com/content/CVPR2024/html/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.html
https://arxiv.org/abs/2305.16223
},
doi = {10.48550/arXiv.2305.16223},
year = {2024},
date = {2024-06-18},
urldate = {2024-06-18},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
},
pages = {8682--8692},
abstract = {Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet one pain point persists: the text prompt engineering and searching high-quality text prompts for customized results is more art than science. Moreover as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details hence necessitating more additional controls from the visual domain. In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. Our proposed framework Prompt-Free Diffusion relies on only visual inputs to generate new images: it takes a reference image as "context" an optional image structural conditioning and an initial noise with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder) substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on with promising quality. Our code and models will be open-sourced.
},
keywords = {arXiv, computer vision, CVPR, generative AI},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR) , 2024.
Abstract | Links | BibTeX | Tags: AI, arXiv, computer vision, generative AI, google, ICLR
@inproceedings{2024-Yu-LMBDVG,
title = {Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation},
author = {Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Vighnesh Birodkar and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang},
url = {https://arxiv.org/abs/2310.05737
https://arxiv.org/pdf/2310.05737},
doi = { https://doi.org/10.48550/arXiv.2310.05737},
year = {2024},
date = {2024-05-14},
urldate = {2024-05-14},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)
},
abstract = {While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
},
keywords = {AI, arXiv, computer vision, generative AI, google, ICLR},
pubstate = {published},
tppubtype = {inproceedings}
}
Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan
StyleDrop: Text-to-Image Generation in Any Style Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2023.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, NeurIPS
@inproceedings{2023-Sohn-STGS,
title = {StyleDrop: Text-to-Image Generation in Any Style},
author = {Kihyuk Sohn and Nataniel Ruiz and Kimin Lee and Daniel Castro Chin and Irina Blok and Huiwen Chang and Jarred Barber and Lu Jiang and Glenn Entis and Yuanzhen Li and Yuan Hao and Irfan Essa and Michael Rubinstein and Dilip Krishnan},
url = {https://arxiv.org/abs/2306.00983
https://openreview.net/forum?id=KoaFh16uOc
https://proceedings.neurips.cc/paper_files/paper/2023/hash/d33b177b69425e7685b0b1c05bd2a5e4-Abstract-Conference.html},
doi = {10.48550/arXiv.2306.00983},
year = {2023},
date = {2023-12-11},
urldate = {2023-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: this https URL},
howpublished = {arXiv:2306.00983},
keywords = {arXiv, computer vision, generative AI, google, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2023.
Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, NeurIPS
@inproceedings{2023-Yu-SSPAMGWFL,
title = {SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs},
author = {Lijun Yu and Yong Cheng and Zhiruo Wang and Vivek Kumar and Wolfgang Macherey and Yanping Huang and David A. Ross and Irfan Essa and Yonatan Bisk and Ming-Hsuan Yang and Kevin Murphy and Alexander G. Hauptmann and Lu Jiang},
url = {https://arxiv.org/abs/2306.17842
https://openreview.net/forum?id=CXPUg86A1D
https://proceedings.neurips.cc/paper_files/paper/2023/hash/a526cc8f6ffb74bedb6ff313e3fdb450-Abstract-Conference.html},
doi = {10.48550/arXiv.2306.17842},
year = {2023},
date = {2023-12-11},
urldate = {2023-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.},
howpublished = {Advances in Neural Information Processing Systems (NeurIPS) (arXiv:2306.17842v2)},
keywords = {arXiv, computational video, computer vision, generative AI, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
MAGVIT: Masked Generative Video Transformer Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Yu-MMGVT,
title = {MAGVIT: Masked Generative Video Transformer},
author = {Lijun Yu and Yong Cheng and Kihyuk Sohn and José Lezama and Han Zhang and Huiwen Chang and Alexander G. Hauptmann and Ming-Hsuan Yang and Yuan Hao and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2212.05199
https://magvit.cs.cmu.edu/
https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Yu_MAGVIT_Masked_Generative_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2212.05199},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at this https URL.},
keywords = {computational video, computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Dina Bashkirova, José Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa
MaskSketch: Unpaired Structure-guided Masked Image Generation Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Bashkirova-MUSMIG,
title = {MaskSketch: Unpaired Structure-guided Masked Image Generation},
author = { Dina Bashkirova and José Lezama and Kihyuk Sohn and Kate Saenko and Irfan Essa},
url = {https://arxiv.org/abs/2302.05496
https://openaccess.thecvf.com/content/CVPR2023/papers/Bashkirova_MaskSketch_Unpaired_Structure-Guided_Masked_Image_Generation_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Bashkirova_MaskSketch_Unpaired_Structure-Guided_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2302.05496},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.},
keywords = {computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang
Visual Prompt Tuning for Generative Transfer Learning Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google
@inproceedings{2022-Sohn-VPTGTL,
title = {Visual Prompt Tuning for Generative Transfer Learning},
author = {Kihyuk Sohn and Yuan Hao and José Lezama and Luisa Polania and Huiwen Chang and Han Zhang and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2210.00990
https://openaccess.thecvf.com/content/CVPR2023/papers/Sohn_Visual_Prompt_Tuning_for_Generative_Transfer_Learning_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Sohn_Visual_Prompt_Tuning_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2210.00990},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark~citezhai2019large, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works.},
keywords = {computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa
Learning Disentangled Prompts for Compositional Image Synthesis Technical Report
2023.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, prompt engineering
@techreport{2023-Sohn-LDPCIS,
title = {Learning Disentangled Prompts for Compositional Image Synthesis},
author = {Kihyuk Sohn and Albert Shaw and Yuan Hao and Han Zhang and Luisa Polania and Huiwen Chang and Lu Jiang and Irfan Essa},
url = {https://arxiv.org/abs/2306.00763},
doi = { https://doi.org/10.48550/arXiv.2306.00763},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
abstract = {We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis. We present a framework that leverages a pre-trained class-conditional generation model and visual prompt tuning. Specifically, we propose a novel source class distilled visual prompt that learns disentangled prompts of semantic (e.g., class) and domain (e.g., style) from a few images. Learned domain prompt is then used to synthesize images of any classes in the style of target domain. We conduct studies on various target domains with the number of images ranging from one to a few to many, and show qualitative results which show the compositional generalization of our method. Moreover, we show that our method can help improve zero-shot domain adaptation classification accuracy.
},
howpublished = {arXiv:2306.00763 },
keywords = {arXiv, computer vision, generative AI, google, prompt engineering},
pubstate = {published},
tppubtype = {techreport}
}
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, Irfan Essa
Discrete Predictor-Corrector Diffusion Models for Image Synthesis Proceedings Article
In: International Conference on Learning Representations (ICLR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, google, ICLR, machine learning
@inproceedings{2023-Lezama-DPDMIS,
title = {Discrete Predictor-Corrector Diffusion Models for Image Synthesis},
author = {José Lezama and Tim Salimans and Lu Jiang and Huiwen Chang and Jonathan Ho and Irfan Essa},
url = {https://openreview.net/forum?id=VM8batVBWvg},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {International Conference on Learning Representations (ICLR)},
abstract = {We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies},
keywords = {computer vision, generative AI, generative media, google, ICLR, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Tianhao Zhang, Weilong Yang, Honglak Lee, Hung-Yu Tseng, Irfan Essa, Lu Jiang
Image manipulation by text instruction Patent
2023.
Abstract | Links | BibTeX | Tags: content creation, generative AI, google, media generation, patents
@patent{2023-Zhang-IMTI,
title = {Image manipulation by text instruction},
author = {Tianhao Zhang and Weilong Yang and Honglak Lee and Hung-Yu Tseng and Irfan Essa and Lu Jiang},
url = {https://patents.google.com/patent/US11562518},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
abstract = {A method for generating an output image from an input image and an input text instruction that specifies a location and a modification of an edit applied to the input image using a neural network is described. The neural network includes an image encoder, an image decoder, and an instruction attention network. The method includes receiving the input image and the input text instruction; extracting, from the input image, an input image feature that represents features of the input image using the image encoder; generating a spatial feature and a modification feature from the input text instruction using the instruction attention network; generating an edited image feature from the input image feature, the spatial feature and the modification feature; and generating the output image from the edited image feature using the image decoder.},
howpublished = {US Patent # US11562518},
keywords = {content creation, generative AI, google, media generation, patents},
pubstate = {published},
tppubtype = {patent}
}
José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa
Improved Masked Image Generation with Token-Critic Proceedings Article
In: European Conference on Computer Vision (ECCV), arXiv, 2022, ISBN: 978-3-031-20050-2.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google
@inproceedings{2022-Lezama-IMIGWT,
title = {Improved Masked Image Generation with Token-Critic},
author = {José Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},
url = {https://arxiv.org/abs/2209.04439
https://rdcu.be/c61MZ},
doi = {10.1007/978-3-031-20050-2_5},
isbn = {978-3-031-20050-2},
year = {2022},
date = {2022-10-28},
urldate = {2022-10-28},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13683},
publisher = {arXiv},
abstract = {Non-autoregressive generative transformers recently demonstrated impressive image generation performance, and orders of magnitude faster sampling than their autoregressive counterparts. However, optimal parallel sampling from the true joint distribution of visual tokens remains an open challenge. In this paper we introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer. During non-autoregressive iterative sampling, Token-Critic is used to select which tokens to accept and which to reject and resample. Coupled with Token-Critic, a state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity, in the challenging class-conditional ImageNet generation.},
keywords = {computer vision, ECCV, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa
BLT: Bidirectional Layout Transformer for Controllable Layout Generation Proceedings Article
In: European Conference on Computer Vision (ECCV), 2022, ISBN: 978-3-031-19789-5.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google, vision transformer
@inproceedings{2022-Kong-BLTCLG,
title = {BLT: Bidirectional Layout Transformer for Controllable Layout Generation},
author = {Xiang Kong and Lu Jiang and Huiwen Chang and Han Zhang and Yuan Hao and Haifeng Gong and Irfan Essa},
url = {https://arxiv.org/abs/2112.05112
https://rdcu.be/c61AE},
doi = {10.1007/978-3-031-19790-1_29},
isbn = {978-3-031-19789-5},
year = {2022},
date = {2022-10-25},
urldate = {2022-10-25},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13677},
abstract = {Creating visual layouts is a critical step in graphic design. Automatic generation of such layouts is essential for scalable and diverse visual designs. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from previous work on transformers in adopting non-autoregressive transformers. In training, BLT learns to predict the masked attributes by attending to surrounding attributes in two directions. During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes. The masks generated in both training and inference are controlled by a new hierarchical sampling policy. We verify the proposed model on six benchmarks of diverse design tasks. Experimental results demonstrate two benefits compared to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, it achieves up to 10x speedup in generating a layout at inference time than the layout transformer baseline. Code is released at https://shawnkx.github.io/blt.},
keywords = {computer vision, ECCV, generative AI, generative media, google, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.