Deep learning has advanced vision tasks such as classification, segmentation, and detection. However, in real-world scenarios, models often encounter domains that differ from the ones seen during training, which can lead to substantial performance degradation. To mitigate the effects of distribution shifts, domain generalization (DG) aims to enable models to generalize effectively to unseen target domains. Recent DG approaches use generative models like diffusion models to augment data with text prompts. However, these methods rely on domain-specific textual inputs and require costly fine-tuning, which limits their scalability. We propose TRIDENT, a framework that overcomes these limitations by eliminating the need for text prompts and leveraging the linear structure of CLIP embeddings. TRIDENT decomposes image embeddings into three components—domain, class, and attribute—enabling precise control over semantic content. By reassembling each embedding component, we generate semantically valid and structurally coherent synthetic samples across domains. This allows efficient and diverse data synthesis without retraining diffusion models. TRIDENT operates through lightweight embedding-space manipulation, significantly reducing computational overhead. Extensive experiments on standard DG benchmarks (e.g., PACS, VLCS, and OfficeHome) demonstrate that TRIDENT achieves competitive or superior performance compared with existing approaches. Furthermore, qualitative evaluations and comprehensive analyses confirm that TRIDENT not only enables efficient and diverse data synthesis but also demonstrates the effectiveness of the proposed decomposition strategy.
This paper proposes a novel framework called TRIDENT, which addresses these issues by eliminating dependency on text prompts while enabling simple separation and manipulation of meanings within image embedding spaces. TRIDENT utilizes visual information from image embeddings in CLIP’s embedding space to generate data effectively without relying on text. Leveraging this property, TRIDENT decomposes image embeddings into three components: Domain, Class, and Attribute. By incorporating attributes into the decomposition process, we aim to enhance domain generalization by ensuring that generated images retain essential visual properties that define their natural diversity.
Compared to existing methods, TRIDENT offers several key advantages:
@ARTICLE{Choi_2025_TRIDENT,
author={Choi, Yoonyoung and Yu, Geunhyeok and Hwang, Hyoseok},
journal={IEEE Access},
title={TRIDENT: Text-Free Data Augmentation Using Image Embedding Decomposition for Domain Generalization},
year={2025},
volume={13},
number={},
pages={139816-139830},
keywords={Semantics;Training;Diffusion models;Visualization;Data models;Computational modeling;Data collection;Overfitting;Deep learning;Data augmentation;Domain generalization;data augmentation;data generation;embedding decomposition;CLIP;stable diffusion},
doi={10.1109/ACCESS.2025.3596371}
}