NEWS
_______
_______
Title: Leveraging Text Signed Distance Function Map for Boundary-Aware Guidance in Scene Text Segmentation
Author: Ho Jun Kim and Hak Gu Kim
Abstract: Scene text segmentation is to predict pixel-wise text regions from an image, enabling in-image text editing or removal. One of the primary challenges is to remove noises including non-text regions and predict intricate text boundaries. To deal with that, traditional approaches utilize a text detection or recognition module explicitly. However, they are likely to highlight noise around the text. Because they did not sufficiently consider the boundaries of text, they fail to accurately predict the fine details of text. In this paper, we introduce leveraging text signed distance function (SDF) map, which encodes distance information from text boundaries, in scene text segmentation to explicitly provide text boundary information. By spatial cross attention mechanism, we encode the text-attended feature from the text SDF map. Then, both visual and text-attended features are utilized to decode the text segmentation map. Our approach not only mitigates confusion between text and complex backgrounds by eliminating false positives such as logos and texture blobs located far from the text, but also effectively captures fine details of complex text patterns by leveraging text boundary information. Extensive experiments demonstrate that leveraging text SDF map in scene text segmentation provides superior performances on various scene text segmentation datasets.
Hyun Wook Kim and Byung Chan Hwang joined our lab as Master students.
Hyun Wook Kim received a B.S. degree from CK College in Feb. 2025.
Byung Chan Hwang received a B.S. degree from Sejong Univ. in Feb. 2025.
Welcome to IRIS@CAU!
Title: 2nd Integrating Image Processing with Large-Scale Language/Vision Models for Advanced Visual Understanding
Organizers: Yong Mak Ro (KAIST), Wen-Huang Cheng (National Taiwan Univ.), and Hak Gu Kim (Chung-Ang Univ.)
Short abstract: This workshop aims to bridge the gap between conventional image processing techniques and the latest advancements in large-scale vision and language models. Recent developments in large-scale models have revolutionized image processing tasks, significantly enhancing capabilities in visual object understanding, image classification, and generative image synthesis. Furthermore, the large-scale models have opened new avenues for human-machine multimodal interactive dialogue systems, where the synergy between visual and linguistic processing enables more intuitive and dynamic interactions. This workshop will provide a platform for researchers and practitioners to explore how cutting-edge large-scale models integrate with image processing methods and foster innovation across diverse applications. Discussions will extend beyond conventional tasks to address the role of vision-language models in Generative AI and their use in multimodal systems, such as virtual assistants that interact seamlessly using images, text, and speech.
Prof. Kim was selected as an outstanding full-time faculty member based on the course evaluations for the 2nd semester of the 2024.
The outstanding faculty member designation was awarded to full-time faculty teaching undergraduate courses who ranked in the top 30% based on course evaluations.
Course: Application to Pattern Recognition (56122)
Department: School of Computer Science and Engineering
Our M.S. student Hyung Kyu Kim has won the silver prize for the Best Paper Award at IPIU 2025.
Since its inception in 1989, the Workshop on Image Processing and Image Understanding (IPIU) has been Korea’s premier academic conference in image processing. It has played a pivotal role in advancing image processing and computer vision while serving as a vital bridge for the growth of the domestic research community.
Title: Audio-Lip Motion Memory Network for Personalized Speech-driven 3D Facial Animation
Author: Hyung Kyu Kim and Hak Gu Kim
Title: MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Author: Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, and Hak Gu Kim, and Yong Man Ro
Abstract: Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using a Large Language Model (LLM). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.
Seungjae Lee joined our lab as Master students.
Seungjae Lee received a B.S. degree from Myungji Univ. in Feb. 2024.
Welcome to IRIS@CAU!
Title: Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features
Author: Ho Jun Kim*, Hyung Kyu Kim*, Sangmin Lee (UIUC), and Hak Gu Kim (*equal contribution)
Abstract: Scene text segmentation is to accurately identify text areas within a scene while disregarding non-textual elements like background imagery or graphical elements. However, current text segmentation models often fail to accurately segment text regions due to complex background noises or various font styles and sizes. To address this issue, it is essential to consider not only visual information but also semantic information of text in scene text segmentation. For this purpose, we propose a novel semantic-aware scene text segmentation framework, which incorporates multimodal large language models (MLLMs) to fuse both visual, text and linguistic information. By leveraging semantic-enhanced feature from multimodal LLMs, scene text segmentation model can remove false positives that are visually confusing but not recognized as text. Both qualitative and quantitative evaluations demonstrate that multimodal LLMs improve scene text segmentation performances.
Title: Analyzing Visible Articulatory Movements in Speech Production for Speech-Driven 3D Facial Animation
Author: Hyung Kyu Kim, Sangmin Lee (UIUC), and Hak Gu Kim
Abstract: Speech-driven 3D facial animation is a method aimed at generating realistic facial meshes based on input speech signals. However, due to a lack of understanding of visible articulatory movements, current state-of-the-art methods results in inaccurate lip and jaw movements. Traditional evaluation metrics such as lip vertex error (LVE) often fail to represent the quality of visual results. Based on our observation, we reveal the problems of existing evaluation metrics and raise the necessity for separate evaluation approaches for 3D axes. Comprehensive analysis showed that most recent methods struggle to precisely predict lip and jaw movements in a 3D space.
Title: Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection
Author: Taeheon Kim (KAIST), Sebin Shin (KAIST), Youngjoon Yu (KAIST), Hak Gu Kim, and Yong Man Ro (KAIST)
Abstract: RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.
San Ah Jeong and Jung Jae Yu joined our lab as Master students.
San Ah Jeong received a B.S. degree from Korea National University of Transportation (KNUT) in Feb. 2024.
Jung Jae Yu received a B.S. degree from Sun Moon Univ. in Feb. 2024.
Welcome to IRIS@CAU!
Title: Photometric Stereo Super Resolution via Complex Surface Structure Estimation
Author: Han-nyoung Lee and Hak Gu Kim
Abstract: Photometric stereo, which derives per-pixel surface normals from shading cues, faces challenges in capturing high-resolution (HR) images in linear response systems. We address the representation of HR surface normals from low-resolution (LR) photometric stereo images. To represent fine details of the surface normal in the HR domain, we propose a novel plug-in high-frequency representation module named the Complex Surface Structure (CSS) estimator. When combined with a conventional photometric stereo model, CSS is capable of representing intricate surface structures in 2D Fourier space. We show that photometric stereo super-resolution (SR) with our CSS estimator provides high-fidelity surface normal representations in higher resolution from the LR inputs. Experiments demonstrate that our results are quantitatively and qualitatively better than those of the existing deep learning-based SR work.
Title: Super-Resolution Neural Radiance Field via Learning High Frequency Details for High-Fidelity Novel View Synthesis
Author: Han-nyoung Lee and Hak Gu Kim
Abstract: While neural rendering approaches facilitate photorealistic rendering in novel view synthesis tasks, the challenge of high-resolution rendering persists due to the substantial costs associated with acquiring and training data. Recently, several studies have been proposed that render high-resolution scenes by either super-sampling points or using reference images, aiming to restore details in low-resolution (LR) images. However, supersampling is computationally expensive, and methods with reference images require high-resolution (HR) images for inference. In this paper, we propose a novel super-resolution (SR) neural radiance field (NeRF) framework for high-fidelity novel view synthesis. To address the representation of high-fidelity HR images from the captured LR images, we learn a mapping function that maps LR rendering images to the Fourier space to restore insufficient high frequency details and render HR images at higher resolution. Experiments demonstrate that our results are quantitatively and qualitatively better than those of the existing SR methods in novel view synthesis. By visualizing the estimated dominant frequency components, we provide visual interpretations of the performance improvement.
Hyung Kyu Lee and Kyo-Seok Lee joined our lab as Master students.
Hyung Kyu Kim is expected to graduate from the School of Computer Science & Engineering (CSE), Konkuk Univ. in Feb. 2023.
Kyo-Seok Lee is expected to graduate from the Department of Medical IT, Eulji Univ. in Feb. 2023.
Welcome to IRIS@CAU!
Ho Jun Kim and Jiwoo Hwang joined our lab as a Master student and a research intern, respectively.
Ho Jun Kim received a B.S. degree from Dankook Univ. in Feb. 2022.
Jiwoo Hwang is double majoring in School of Computer Arts & Computer Science and Engineering (CSE), Chung-Ang Univ.
Welcome to IRIS@CAU!
Title: Natural-Looking Adversarial Examples from Freehand Sketches
Authors: Hak Gu Kim, Davide Nanni (EPFL), and Sabine Süsstrunk (EPFL)
Abstract: Deep neural networks (DNNs) have achieved great success in image classification and recognition compared to previous methods. However, recent works have reported that DNNs are very vulnerable to adversarial examples that are intentionally generated to mislead the predictions of the DNNs. Here, we present a novel freehand sketch-based natural-looking adversarial example generator that we call SketchAdv. To generate a natural-looking adversarial example from a sketch, we force the encoded edge information (i.e., the visual attributes) to be close to the latent random vector fed to the edge generator and adversarial example generator. This leads to preserve the spatial consistency of the adversarial example generated from the random vector with the edge information. In addition, through the sketch-edge encoder with a novel sketch-edge matching loss, we reduce the gap between edges and sketches. We evaluate the proposed method on several dominant classes of SketchyCOCO, the benchmark dataset for sketch to image translation. Our experiments show that our SketchAdv produces visually plausible adversarial examples while remaining competitive with other adversarial attack methods.
Han-nyoung Lee and Seon Ho Park joined our lab as a Master student and a research intern, respectively.
Han-nyoung Lee is expected to graduate from the School of Integrative Engineering (Digital Imaging), Chung-Ang Univ. in Feb. 2022.
Seon Ho Park is double majoring in the Department of Brain & Cognitive Sciences and Statistics, Ewha Womans Univ.
Welcome to IRIS@CAU again!
Welcome to Immersive Reality and Intelligent Systems Lab (IRIS Lab) at Chung-Ang Univ. (CAU) !
The main goal of IRIS Lab is to develop the state-of-the-art machine learning/deep learning-based intelligent systems to create the future of immersive reality (e.g., AR/VR/Metaverse), i.e., Convergence of AI & Reality.
For more information, please visit our research & publications pages.