The last few years have seen significant advances made in the fields of machine learning for vision and natural language processing (NLP) by training general models on vast and diverse datasets, unlocking a jump in performance and capabilities. This approach has led to the emergence of “foundation models” that exhibit a remarkable capacity to leverage information gathered from a variety of sources when attempting to solve unseen tasks. For example, “large language models” have triggered a renaissance in NLP: it is now standard practice to fine-tune or prompt a generalist model, rather than train a specialist model from scratch. However, a similar paradigm shift is yet to occur for applications of machine learning on scientific datasets, an unrealized opportunity that our “Polymathic AI” research initiative seeks to address.
Introducing the Polymathic AI initiative: our goal is to accelerate the development of versatile foundation models tailored for numerical datasets and scientific machine learning tasks. The challenge we are undertaking is to build AI models which leverage information from heterogeneous datasets and across different scientific fields, which, contrary to domains like natural language processing, do not share a unifying representation (i.e., text). Such models can then be used as strong baselines or be further fine-tuned by scientists for specific applications. This approach has the potential to democratize AI in science by providing off-the-shelf models that have stronger priors (i.e., background knowledge) for shared general concepts such as causality, measurement, signal processing, and even more specialized shared concepts like wave-like behavior, which otherwise would need to be learned from scratch.
To reach this goal, we are bringing together a team of pure machine learning researchers with domain scientists, covering a wide variety of disciplines. In addition, we are guided by a scientific advisory group of world leading experts.
There is much preliminary research required to build a true foundation model for science. We are concentrating our efforts on the fundamentals of this space, and have thus far published research on key architectural components, from adapting language models for numerical data[1] to demonstrating transferability of surrogate models trained on diverse physical systems[2] to learning shared embeddings for multi-modal scientific data[3]. We encourage you to find out more at this link.
We are excited about the potential of this research direction to redefine the landscape of scientific machine learning, and Polymathic AI represents an ambitious step towards realizing that goal.