Athena Seminar Series: Towards Synthesizing More Informative Task-driven Datasets
Abstract: Synthetic data generation has become an increasingly powerful tool for overcoming the limitations of collecting and curating large real-world datasets for model training. Yet, fundamental questions remain about how […]
More info-
None
Abstract:
Synthetic data generation has become an increasingly powerful tool for overcoming the limitations of collecting and curating large real-world datasets for model training. Yet, fundamental questions remain about how synthetic data stores task-relevant information and how it can best be generated. In this talk, we bring together two complementary lines of work that aim to deepen our understanding of synthetic dataset construction. First, we examine dataset distillation, which compresses large datasets into a compact collection of synthetic examples while retaining important task-specific information. We discuss what distilled data actually represent, how it encodes task-relevant information about early training dynamics, and why it cannot simply substitute for real data. Second, we investigate text-to-image (T2I) models as generative engines for synthetic training data, focusing on the challenge of producing diverse, semantically aligned samples. We introduce a fine-tuning strategy, Beyond OBjects (BOB), which leverages class-agnostic attributes such as background and pose to guide model adaptation, mitigating overfitting while preserving generative diversity. Together, these perspectives offer both conceptual insights and practical advances toward building more effective, interpretable, and generalizable synthetic datasets in the era of large-scale data.