TRIDENT: Text-Free Data Augmentation Using Image Embedding Decomposition for Domain Generalization

Kyung Hee University
IEEE Access

Abstract

Deep learning has advanced vision tasks such as classification, segmentation, and detection. However, in real-world scenarios, models often encounter domains that differ from the ones seen during training, which can lead to substantial performance degradation. To mitigate the effects of distribution shifts, domain generalization (DG) aims to enable models to generalize effectively to unseen target domains. Recent DG approaches use generative models like diffusion models to augment data with text prompts. However, these methods rely on domain-specific textual inputs and require costly fine-tuning, which limits their scalability. We propose TRIDENT, a framework that overcomes these limitations by eliminating the need for text prompts and leveraging the linear structure of CLIP embeddings. TRIDENT decomposes image embeddings into three components—domain, class, and attribute—enabling precise control over semantic content. By reassembling each embedding component, we generate semantically valid and structurally coherent synthetic samples across domains. This allows efficient and diverse data synthesis without retraining diffusion models. TRIDENT operates through lightweight embedding-space manipulation, significantly reducing computational overhead. Extensive experiments on standard DG benchmarks (e.g., PACS, VLCS, and OfficeHome) demonstrate that TRIDENT achieves competitive or superior performance compared with existing approaches. Furthermore, qualitative evaluations and comprehensive analyses confirm that TRIDENT not only enables efficient and diverse data synthesis but also demonstrates the effectiveness of the proposed decomposition strategy.

Main Figure

Figure 3. Overview of the TRIDENT framework: (1) Extract mean embeddings, which serves as the prototype for decomposing the meaning of image embedding, (2) Train TRIDENT, which is designed to decompose CLIP image embeddings, and (3) Generate data via unCLIP.

Contributions

This paper proposes a novel framework called TRIDENT, which addresses these issues by eliminating dependency on text prompts while enabling simple separation and manipulation of meanings within image embedding spaces. TRIDENT utilizes visual information from image embeddings in CLIP’s embedding space to generate data effectively without relying on text. Leveraging this property, TRIDENT decomposes image embeddings into three components: Domain, Class, and Attribute. By incorporating attributes into the decomposition process, we aim to enhance domain generalization by ensuring that generated images retain essential visual properties that define their natural diversity.

Compared to existing methods, TRIDENT offers several key advantages:

Elimination of Text Dependency: TRIDENT fundamentally resolves issues associated with textual ambiguity, constraints of fixed text prompts, and inefficiency by removing reliance on text prompts.
Decomposition into Domain, Class, and Attribute Components: By disentangling image embeddings into these three components, enabling targeted preservation or modification of specific information.
Practicality and Efficiency: TRIDENT achieves practical and efficient data generation by leveraging pretrained diffusion models, such as Stable unCLIP, without requiring costly model fine-tuning.

Experiments

Table 1. Accuracy of the Multi-source Domain Generalization on PACS.

Table 2. Accuracy of the Multi-source Domain Generalization on VLCS.

Table 3. Accuracy of the Multi-source Domain Generalization on OfficeHome.

Figure 7. Comparison of augmented feature distributions between FDS (left) and TRIDENT (right)

Table 4. Comparison of embedding space statistics between FDS and TRIDENT (computed using CLIP ViT-B/32).

Table 5. DG Methods with TRIDENT. We evaluated existing DG methods with and without TRIDENT.

Figure 7. Example of ablation study. Each result is generated from the same source image(PACS) on the leftmost column.

Table 10. Ablation study on two strategies: (Att.) decomposing embedding to 3 components including attribute embedding, (Cross.) using cross-attention (CAAD) to refine attribute separation, and (Ref.) adding a refinement loss to better align the training objective.

Results

Figure 4. Synthetic dataset examples. The leftmost images are transferred to other domains, shown in the middle and right.

Figure 5. Comparison quality of generated images via Stable Diffusion with text prompt or image guidance.

Figure 11. Visualization of decomposed embedding components via unCLIP and GradCAM.

BibTeX

@ARTICLE{Choi_2025_TRIDENT, author={Choi, Yoonyoung and Yu, Geunhyeok and Hwang, Hyoseok}, journal={IEEE Access}, title={TRIDENT: Text-Free Data Augmentation Using Image Embedding Decomposition for Domain Generalization}, year={2025}, volume={13}, number={}, pages={139816-139830}, keywords={Semantics;Training;Diffusion models;Visualization;Data models;Computational modeling;Data collection;Overfitting;Deep learning;Data augmentation;Domain generalization;data augmentation;data generation;embedding decomposition;CLIP;stable diffusion}, doi={10.1109/ACCESS.2025.3596371} }