NEWS

_______

2025.06.26 Our paper was accepted to ICCV 2025

Title: MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
Author: Hyung Kyu Kim, Sangmin Lee (Korea University) and Hak Gu Kim
Abstract: Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker’s speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting {speaking style} only with audio input to maximize usability in applications. Our framework consists of two training stages: <1-stage> is storing and retrieving general motion i.e., Memorizing), and <2-stage> is to perform the personalized facial motion synthesis i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods. Our source code will be released to facilitate further research.

2025.05.21 Our paper was accepted to ICIP 2025

Title: Enhancing 3D Scene Representation with Structural Dissimilarity-Aware Learning
Author: Seungjae Lee, Ho Jun Kim and Hak Gu Kim
Abstract: Novel view synthesis aims to generate high-quality unseen views from images at different viewpoints. However, the existing methods often struggle to preserve fine details, leading to structural distortions in complex regions. In this paper, we introduce a simple yet effective structure-aware objective function designed to enhance structural information in novel view synthesis. By leveraging the Structural Similarity Index (SSIM), our method attends to regions exhibiting significant structural distortions. We incorporate structural dissimilarity-based attention to highlight discrepancies in challenging regions between predicted and ground-truth images. It enables recent 3D scene representation models to achieve improved structure preservation, leading to more coherent representations. Experiments on synthetic and real-world datasets demonstrate that our approach enhances structural consistency, particularly in challenging regions, making it a valuable addition to state-of-the-arts.

2025.05.19 Our paper was accepted to Interspeech 2025

Title: Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation
Author: Hyung Kyu Kim and Hak Gu Kim
Abstract: Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assigns adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context dependent visemes in synthesizing natural speech-driven 3D facial animation.

2025.04.29 Our paper was accepted to IEEE Access

Title: Leveraging Text Signed Distance Function Map for Boundary-Aware Guidance in Scene Text Segmentation
Author: Ho Jun Kim and Hak Gu Kim
Abstract: Scene text segmentation is to predict pixel-wise text regions from an image, enabling in-image text editing or removal. One of the primary challenges is to remove noises including non-text regions and predict intricate text boundaries. To deal with that, traditional approaches utilize a text detection or recognition module explicitly. However, they are likely to highlight noise around the text. Because they did not sufficiently consider the boundaries of text, they fail to accurately predict the fine details of text. In this paper, we introduce leveraging text signed distance function (SDF) map, which encodes distance information from text boundaries, in scene text segmentation to explicitly provide text boundary information. By spatial cross attention mechanism, we encode the text-attended feature from the text SDF map. Then, both visual and text-attended features are utilized to decode the text segmentation map. Our approach not only mitigates confusion between text and complex backgrounds by eliminating false positives such as logos and texture blobs located far from the text, but also effectively captures fine details of complex text patterns by leveraging text boundary information. Extensive experiments demonstrate that leveraging text SDF map in scene text segmentation provides superior performances on various scene text segmentation datasets.

2025.03.03 Two students joined our lab

Hyun Wook Kim and Byung Chan Hwang joined our lab as Master students.

Hyun Wook Kim received a B.S. degree from CK College in Feb. 2025.

Byung Chan Hwang received a B.S. degree from Sejong Univ. in Feb. 2025.

Welcome to IRIS@CAU!

2025.02.15 Our workshop proposal was accepted to ICIP 2025

Title: 2nd Integrating Image Processing with Large-Scale Language/Vision Models for Advanced Visual Understanding
Organizers: Yong Mak Ro (KAIST), Wen-Huang Cheng (National Taiwan Univ.), and Hak Gu Kim (Chung-Ang Univ.)
Short abstract: This workshop aims to bridge the gap between conventional image processing techniques and the latest advancements in large-scale vision and language models. Recent developments in large-scale models have revolutionized image processing tasks, significantly enhancing capabilities in visual object understanding, image classification, and generative image synthesis. Furthermore, the large-scale models have opened new avenues for human-machine multimodal interactive dialogue systems, where the synergy between visual and linguistic processing enables more intuitive and dynamic interactions. This workshop will provide a platform for researchers and practitioners to explore how cutting-edge large-scale models integrate with image processing methods and foster innovation across diverse applications. Discussions will extend beyond conventional tasks to address the role of vision-language models in Generative AI and their use in multimodal systems, such as virtual assistants that interact seamlessly using images, text, and speech.

2025.02.12 Prof. Kim was honored for 24-2 Outstanding Course Evaluations

Prof. Kim was selected as an outstanding full-time faculty member based on the course evaluations for the 2nd semester of the 2024.

The outstanding faculty member designation was awarded to full-time faculty teaching undergraduate courses who ranked in the top 30% based on course evaluations.

Course: Application to Pattern Recognition (56122)
Department: School of Computer Science and Engineering

2025.02.07 Hyung Kyu Kim received Best Paper Award at IPIU 2025

Our M.S. student Hyung Kyu Kim has won the silver prize for the Best Paper Award at IPIU 2025.

Since its inception in 1989, the Workshop on Image Processing and Image Understanding (IPIU) has been Korea’s premier academic conference in image processing. It has played a pivotal role in advancing image processing and computer vision while serving as a vital bridge for the growth of the domestic research community.

Title: Audio-Lip Motion Memory Network for Personalized Speech-driven 3D Facial Animation

Author: Hyung Kyu Kim and Hak Gu Kim

2024.12.23 Our paper was accepted to IEEE TCSVT

Title: MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Author: Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, and Hak Gu Kim, and Yong Man Ro
Abstract: Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using a Large Language Model (LLM). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

2024.09.02 One student joined our lab.

Seungjae Lee joined our lab as Master students.

Seungjae Lee received a B.S. degree from Myungji Univ. in Feb. 2024.

Welcome to IRIS@CAU!

2024.06.07 Our paper was accepted to ICIP Workshop 2024

Title: Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features
Author: Ho Jun Kim*, Hyung Kyu Kim*, Sangmin Lee (UIUC), and Hak Gu Kim (*equal contribution)
Abstract: Scene text segmentation is to accurately identify text areas within a scene while disregarding non-textual elements like background imagery or graphical elements. However, current text segmentation models often fail to accurately segment text regions due to complex background noises or various font styles and sizes. To address this issue, it is essential to consider not only visual information but also semantic information of text in scene text segmentation. For this purpose, we propose a novel semantic-aware scene text segmentation framework, which incorporates multimodal large language models (MLLMs) to fuse both visual, text and linguistic information. By leveraging semantic-enhanced feature from multimodal LLMs, scene text segmentation model can remove false positives that are visually confusing but not recognized as text. Both qualitative and quantitative evaluations demonstrate that multimodal LLMs improve scene text segmentation performances.

2024.06.07 Our paper was accepted to ICIP 2024

Title: Analyzing Visible Articulatory Movements in Speech Production for Speech-Driven 3D Facial Animation
Author: Hyung Kyu Kim, Sangmin Lee (UIUC), and Hak Gu Kim
Abstract: Speech-driven 3D facial animation is a method aimed at generating realistic facial meshes based on input speech signals. However, due to a lack of understanding of visible articulatory movements, current state-of-the-art methods results in inaccurate lip and jaw movements. Traditional evaluation metrics such as lip vertex error (LVE) often fail to represent the quality of visual results. Based on our observation, we reveal the problems of existing evaluation metrics and raise the necessity for separate evaluation approaches for 3D axes. Comprehensive analysis showed that most recent methods struggle to precisely predict lip and jaw movements in a 3D space.

2024.02.27 Our paper was accepted to CVPR 2024

Title: Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection
Author: Taeheon Kim (KAIST), Sebin Shin (KAIST), Youngjoon Yu (KAIST), Hak Gu Kim, and Yong Man Ro (KAIST)
Abstract: RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.

2024.02.27 Two students joined our lab.

San Ah Jeong and Jung Jae Yu joined our lab as Master students.

San Ah Jeong received a B.S. degree from Korea National University of Transportation (KNUT) in Feb. 2024.

Jung Jae Yu received a B.S. degree from Sun Moon Univ. in Feb. 2024.

Welcome to IRIS@CAU!

2024.01.15 Our paper was accepted to IEEE Access

Title: Photometric Stereo Super Resolution via Complex Surface Structure Estimation
Author: Han-nyoung Lee and Hak Gu Kim
Abstract: Photometric stereo, which derives per-pixel surface normals from shading cues, faces challenges in capturing high-resolution (HR) images in linear response systems. We address the representation of HR surface normals from low-resolution (LR) photometric stereo images. To represent fine details of the surface normal in the HR domain, we propose a novel plug-in high-frequency representation module named the Complex Surface Structure (CSS) estimator. When combined with a conventional photometric stereo model, CSS is capable of representing intricate surface structures in 2D Fourier space. We show that photometric stereo super-resolution (SR) with our CSS estimator provides high-fidelity surface normal representations in higher resolution from the LR inputs. Experiments demonstrate that our results are quantitatively and qualitatively better than those of the existing deep learning-based SR work.

2024.01.13 Our paper was accepted to IEEE SPL

Title: Super-Resolution Neural Radiance Field via Learning High Frequency Details for High-Fidelity Novel View Synthesis
Author: Han-nyoung Lee and Hak Gu Kim
Abstract: While neural rendering approaches facilitate photorealistic rendering in novel view synthesis tasks, the challenge of high-resolution rendering persists due to the substantial costs associated with acquiring and training data. Recently, several studies have been proposed that render high-resolution scenes by either super-sampling points or using reference images, aiming to restore details in low-resolution (LR) images. However, supersampling is computationally expensive, and methods with reference images require high-resolution (HR) images for inference. In this paper, we propose a novel super-resolution (SR) neural radiance field (NeRF) framework for high-fidelity novel view synthesis. To address the representation of high-fidelity HR images from the captured LR images, we learn a mapping function that maps LR rendering images to the Fourier space to restore insufficient high frequency details and render HR images at higher resolution. Experiments demonstrate that our results are quantitatively and qualitatively better than those of the existing SR methods in novel view synthesis. By visualizing the estimated dominant frequency components, we provide visual interpretations of the performance improvement.

2023.01.01 Two students joined our lab.

Hyung Kyu Lee and Kyo-Seok Lee joined our lab as Master students.

Hyung Kyu Kim is expected to graduate from the School of Computer Science & Engineering (CSE), Konkuk Univ. in Feb. 2023.

Kyo-Seok Lee is expected to graduate from the Department of Medical IT, Eulji Univ. in Feb. 2023.

Welcome to IRIS@CAU!

2022.07.01 Two students joined our lab.

Ho Jun Kim and Jiwoo Hwang joined our lab as a Master student and a research intern, respectively.

Ho Jun Kim received a B.S. degree from Dankook Univ. in Feb. 2022.

Jiwoo Hwang is double majoring in School of Computer Arts & Computer Science and Engineering (CSE), Chung-Ang Univ.

Welcome to IRIS@CAU!

2022.01.22 Our paper was accepted to IEEE ICASSP 2022

Title: Natural-Looking Adversarial Examples from Freehand Sketches
Authors: Hak Gu Kim, Davide Nanni (EPFL), and Sabine Süsstrunk (EPFL)
Abstract: Deep neural networks (DNNs) have achieved great success in image classification and recognition compared to previous methods. However, recent works have reported that DNNs are very vulnerable to adversarial examples that are intentionally generated to mislead the predictions of the DNNs. Here, we present a novel freehand sketch-based natural-looking adversarial example generator that we call SketchAdv. To generate a natural-looking adversarial example from a sketch, we force the encoded edge information (i.e., the visual attributes) to be close to the latent random vector fed to the edge generator and adversarial example generator. This leads to preserve the spatial consistency of the adversarial example generated from the random vector with the edge information. In addition, through the sketch-edge encoder with a novel sketch-edge matching loss, we reduce the gap between edges and sketches. We evaluate the proposed method on several dominant classes of SketchyCOCO, the benchmark dataset for sketch to image translation. Our experiments show that our SketchAdv produces visually plausible adversarial examples while remaining competitive with other adversarial attack methods.

2021.12.01 Two students joined our lab.

Han-nyoung Lee and Seon Ho Park joined our lab as a Master student and a research intern, respectively.

Han-nyoung Lee is expected to graduate from the School of Integrative Engineering (Digital Imaging), Chung-Ang Univ. in Feb. 2022.

Seon Ho Park is double majoring in the Department of Brain & Cognitive Sciences and Statistics, Ewha Womans Univ.

Welcome to IRIS@CAU again!

2021.10.01 CAU IRIS Lab is now open.

Welcome to Immersive Reality and Intelligent Systems Lab (IRIS Lab) at Chung-Ang Univ. (CAU) !

The main goal of IRIS Lab is to develop the state-of-the-art machine learning/deep learning-based intelligent systems to create the future of immersive reality (e.g., AR/VR/Metaverse), i.e., Convergence of AI & Reality.

For more information, please visit our research & publications pages.

2021.09.01 I started working at CAU as an Assistant Professor.

Page updated

Google Sites

Report abuse