Dimitris N. Chorafas Foundation Award - 2025 - Oguzhan Fatih Kar

© 2025 EPFL

© 2025 EPFL

Scaling the modalities in multimodal foundation models

EPFL thesis n°10572

Thesis director: Prof. Amir Roshan Zamir

For his outstanding contributions in creating diverse, scalable, and high-quality datasets that advance multimodal AI models and enhance their robustness and real-world applicability.

Having a single neural network to handle a wide and varied range of tasks and modalities has been a long-standing goal. Such a model brings notable advantages, such as test-time computational efficiency, modality fusion, and reduced model size. Our goal in this thesis is to make progress towards building unified multimodal foundation models that can process diverse inputs such as images, text, 3D, semantics, and other sensory data to solve a wide variety of real-world tasks including scene understanding, generation, and retrieval.

Our approach addresses three core challenges: 1) obtaining diverse and high-quality training data, 2) building a scalable training framework, and 3) evaluation and benchmarking. It enables diverse capabilities, including strong out-of-the-box vision performance, any-conditional & steerable generation, cross-modal retrieval and multi-sensory fusion, all in a single model. Finally, we analyze the capabilities of the resulting models both qualitatively and quantitatively on a broad range of benchmarks. Our evaluations also include a "status check" of the leading closed-weight multimodal foundation models, enabling a direct comparison with specialist models.