The rise of data-driven science
Events such as the COVID-19 global pandemic have starkly illustrated the need for ever-accelerating cycles of scientific discovery. This challenge has instigated one of the greatest races in the history of scientific discovery—one that has demanded unprecedented agility and speed. This requirement is not localized to the healthcare domain, however; with significant pressure being exerted on the speed of materials discovery by challenges such as the climate emergency, which arguably are of an even greater magnitude.
Fortunately, our tools for performing such discovery cycles are transforming—with data, artificial intelligence (AI) and hybrid cloud being used in new ways to break through long-standing bottlenecks. Science has seen a number of major paradigm shifts, driven by the advent and advancement of core underlying technology. The last two decades have seen the emergence of the Fourth Paradigm of big-data-driven science, dominated by an “exa-flood” of data and the associated systems and analytics to process it. Now, with the maturation of AI and robotic technology, alongside the further scaling of high-performance computing and hybrid cloud technologies, we are entering a new paradigm where the key is not any one individual technology, but instead how heterogeneous capabilities work together to achieve results greater than the sum of their parts.
Accelerating the materials discovery cycle
A typical materials discovery effort can be decomposed into a series of phases: (1) specification of a research question, (2) collection of relevant existing data, then (3) formation of a hypothesis and finally (4) experimentation and testing of this hypothesis, which may in turn lead to knowledge generation and the creation of a new hypothesis. This process, whilst conceptually simple, has many significant bottlenecks which can hinder its successful execution.
AI, HPC and robotic automation are helping to accelerate and enrich all stages of the discovery cycle through the ability to further scale efforts through improved generation of, access to and reasoning on a wide variety of data. As shown in Fig. 2, the inclusion of these technologies can move towards a community-driven, closed loop process that addresses challenges at each step, from extracting and integrating knowledge at scale, to the use of deep generative models for automatically proposing new hypotheses, to automating testing and experimentation using robotic labs.
Extracting knowledge from the literature
Historical materials science data is embedded in unstructured patents, papers, reports, and datasheets. Automated platforms are needed to ingest these documents, extract the data and ultimately present it to users for query and downstream use. The IBM DeepSearch platform provides a holistic solution to this challenge, consisting of the Corpus Conversion Service (CCS) and Corpus Processing Service (CPS). The CCS leverages state-of-the-art AI models to convert documents from PDF to a structured JSON format, while the CPS builds document-centric Knowledge Graphs (KGs) and supports rich queries and data extraction.
Key open challenges in the use of unstructured data in materials discovery include data access, entity resolution, and complex ad hoc queries. Data access is an issue as much of the content of interest, particularly technical papers and domain-specific databases, are not yet open access. The entity resolution problem in materials is often also complex, as the material entity may be specified in a diffuse fashion across multiple modalities in the document. Finally, as capabilities to collect and organize materials data improve, there is a natural expectation that more complex queries should be supported, progressing from existence (‘Has this material been made?’) through performance (‘What’s the highest recorded Tc?’) to hypothesis (‘Could a Heusler compound be useful in this spintronic device?’).
Accelerating in silico materials discovery
Simulation gives us the means to generate data on hypothetical materials that are necessarily absent from the literature. However, the choice of simulation protocol can present complexity, and a poor choice can doom a discovery campaign before it is begun. Even if an accurate protocol exists, the computational expense of executing it may severely limit the size of the design space being searched.
AI has been used to address some of these issues, with the emergence of techniques like Bayesian optimization and machine-learned potentials. Bayesian optimization can dynamically improve candidate prioritization, allowing more accurate models to be used on a smaller amount of data, thus improving the traditional ‘virtual high throughput screening’ approach. Machine-learned potentials enable access to quantum-chemical-like accuracies at a fraction of the cost, while AI-assisted calibration of simulation outputs can improve their relatedness to real-world data.
Accelerating in vitro materials discovery
To counter the intractably vast materials design space, we adopt an AI-driven generative modeling approach to collaborate with human experts and augment their creativity. Deep generative models can automatically generate new candidate chemicals, molecules and materials, expanding both the discovery space and the creativity of scientists. Our experience is that generative models can accelerate early materials ideation processes by 100x.
These generative models work under review and control by human experts, who tune and reinforce the models with domain knowledge. Looking forward, these models will need to evolve in their coverage of materials classes, extend beyond materials composition to processing and form, and effectively capture and encode application constraints based on human knowledge.
Autonomous synthesis and characterization
At the end of the design cycle, we face the need to accelerate the synthesis and testing of the large number of materials hypotheses. Recent advances in AI-enabled digitization of common tasks in chemical synthesis, including forward reaction prediction, retrosynthetic analysis, and inference of experimental protocols, are being combined with an explosion of automation and AI in chemical synthesis.
One of the most recent efforts is RoboRXN, which integrates cloud, AI, and commercial automation to assist chemists all the way from the selection of synthetic routes to the actual synthesis of the molecule. Future challenges in this space include the generation and integration of in silico chemical data, the further integration of analytical chemistry and application-specific testing, and the expansion and adaptation of these technologies to other materials classes.
A prototype for accelerated materials discovery
The work described above illustrates a prototype for the future of accelerated materials discovery. This irregular and heterogeneous discovery workflow was enabled by the use of the OpenShift hybrid cloud computing framework, allowing a single researcher to orchestrate the available resources across multiple data-centers to execute the necessary steps.
In this prototype, a series of sophisticated applications, algorithms, and computational systems are seamlessly orchestrated to accelerate cycles of learning and support human scientists in their quest for knowledge. We have seen tangible examples of this acceleration across all stages of the discovery process, and we strongly believe that the commoditization and democratization of such diverse workflows will fundamentally alter the way we respond to emerging discovery challenges.
To learn more about the latest advancements in accelerated materials discovery, visit the IT Fix blog.