TRIDENT: Text-Free Data Augmentation Using Image Embedding Decomposition for Domain Generalization

Kyung Hee University
IEEE Access
MY ALT TEXT

Figure 1. Brief overview of the TRIDENT framework.

MY ALT TEXT

Figure 2. Augmented data expands srouce data diversity accross domains via TRIDENT.

Abstract

Deep learning has advanced vision tasks such as classification, segmentation, and detection. However, in real-world scenarios, models often encounter domains that differ from the ones seen during training, which can lead to substantial performance degradation. To mitigate the effects of distribution shifts, domain generalization (DG) aims to enable models to generalize effectively to unseen target domains. Recent DG approaches use generative models like diffusion models to augment data with text prompts. However, these methods rely on domain-specific textual inputs and require costly fine-tuning, which limits their scalability. We propose TRIDENT, a framework that overcomes these limitations by eliminating the need for text prompts and leveraging the linear structure of CLIP embeddings. TRIDENT decomposes image embeddings into three components—domain, class, and attribute—enabling precise control over semantic content. By reassembling each embedding component, we generate semantically valid and structurally coherent synthetic samples across domains. This allows efficient and diverse data synthesis without retraining diffusion models. TRIDENT operates through lightweight embedding-space manipulation, significantly reducing computational overhead. Extensive experiments on standard DG benchmarks (e.g., PACS, VLCS, and OfficeHome) demonstrate that TRIDENT achieves competitive or superior performance compared with existing approaches. Furthermore, qualitative evaluations and comprehensive analyses confirm that TRIDENT not only enables efficient and diverse data synthesis but also demonstrates the effectiveness of the proposed decomposition strategy.

Main Figure

MY ALT TEXT

Figure 3. Overview of the TRIDENT framework: (1) Extract mean embeddings, which serves as the prototype for decomposing the meaning of image embedding, (2) Train TRIDENT, which is designed to decompose CLIP image embeddings, and (3) Generate data via unCLIP.

Contributions

This paper proposes a novel framework called TRIDENT, which addresses these issues by eliminating dependency on text prompts while enabling simple separation and manipulation of meanings within image embedding spaces. TRIDENT utilizes visual information from image embeddings in CLIP’s embedding space to generate data effectively without relying on text. Leveraging this property, TRIDENT decomposes image embeddings into three components: Domain, Class, and Attribute. By incorporating attributes into the decomposition process, we aim to enhance domain generalization by ensuring that generated images retain essential visual properties that define their natural diversity.

Compared to existing methods, TRIDENT offers several key advantages:

  • Elimination of Text Dependency: TRIDENT fundamentally resolves issues associated with textual ambiguity, constraints of fixed text prompts, and inefficiency by removing reliance on text prompts.
  • Decomposition into Domain, Class, and Attribute Components: By disentangling image embeddings into these three components, enabling targeted preservation or modification of specific information.
  • Practicality and Efficiency: TRIDENT achieves practical and efficient data generation by leveraging pretrained diffusion models, such as Stable unCLIP, without requiring costly model fine-tuning.

Experiments

Results

BibTeX

@ARTICLE{Choi_2025_TRIDENT,
    author={Choi, Yoonyoung and Yu, Geunhyeok and Hwang, Hyoseok},
    journal={IEEE Access}, 
    title={TRIDENT: Text-Free Data Augmentation Using Image Embedding Decomposition for Domain Generalization}, 
    year={2025},
    volume={13},
    number={},
    pages={139816-139830},
    keywords={Semantics;Training;Diffusion models;Visualization;Data models;Computational modeling;Data collection;Overfitting;Deep learning;Data augmentation;Domain generalization;data augmentation;data generation;embedding decomposition;CLIP;stable diffusion},
    doi={10.1109/ACCESS.2025.3596371}
}