MENU

Algorithm mines materials science literature for new discoveries

Algorithm mines materials science literature for new discoveries

Technology News |
By Rich Pell



In their study, the researchers collected 3.3 million abstracts of published materials science papers and fed them into an algorithm called Word2vec. By analyzing relationships between words the algorithm was able to predict discoveries of new thermoelectric materials years in advance and suggest as-yet unknown materials as candidates for thermoelectric materials.

“Without telling it anything about materials science, it learned concepts like the periodic table and the crystal structure of metals, says Anubhav Jain, a scientist in Berkeley Lab’s Energy Storage & Distributed Resources Division. “That hinted at the potential of the technique. But probably the most interesting thing we figured out is, you can use this algorithm to address gaps in materials research, things that people should study but haven’t studied so far.”

Their study, say the researchers, establishes that text mining of scientific literature can uncover hidden knowledge, and that pure text-based extraction can establish basic scientific knowledge. The project, they say, was motivated by the difficulty making sense of the overwhelming amount of published studies.

“In every research field there’s 100 years of past research literature, and every week dozens more studies come out,” says Berkeley Lab scientist Gerbrand Ceder, who helped lead the study. “A researcher can access only fraction of that. We thought, can machine learning do something to make use of all this collective knowledge in an unsupervised manner – without needing guidance from human researchers?”

The researchers collected the 3.3 million abstracts from papers published in more than 1,000 journals between 1922 and 2018. Word2vec took each of the approximately 500,000 distinct words in those abstracts and turned each into a 200-dimensional vector, or an array of 200 numbers.

“What’s important is not each number, but using the numbers to see how words are related to one another,” says Jain, who leads a group working on discovery and design of new materials for energy applications using a mix of theory, computation, and data mining. “For example you can subtract vectors using standard vector math. Other researchers have shown that if you train the algorithm on nonscientific text sources and take the vector that results from ‘king minus queen,’ you get the same result as ‘man minus woman.’ It figures out the relationship without you telling it anything.”

Similarly, say the researchers, when trained on materials science text, the algorithm was able to learn the meaning of scientific terms and concepts such as the crystal structure of metals based simply on the positions of the words in the abstracts and their co-occurrence with other words. For example, just as solved the equation “king – queen + man,” it could figure out that for the equation “ferromagnetic – NiFe + IrMn” the answer would be “antiferromagnetic.”

Word2vec was even able to learn the relationships between elements on the periodic table when the vector for each chemical element was projected onto two dimensions (image). To see if the algorithm could predict novel thermoelectric materials, the researchers took the top thermoelectric candidates suggested by the algorithm – which ranked each compound by the similarity of its word vector to that of the word “thermoelectric” – and then ran calculations to verify the algorithm’s predictions.

Of the top 10 predictions, they found all had computed power factors slightly higher than the average of known thermoelectrics; the top three candidates had power factors at above the 95th percentile of known thermoelectrics.

They then tested to see if the algorithm could perform predictions “in the past” by giving it abstracts published only up to a certain point of time in the past, such as the year 2000. Of the top predictions, a significant number turned up in later studies – four times more than if materials had just been chosen at random. For example, three of the top five predictions trained using data up to the year 2008 have since been discovered and the remaining two contain rare or toxic elements.

“I honestly didn’t expect the algorithm to be so predictive of future results,” says Jain. “I had thought maybe the algorithm could be descriptive of what people had done before but not come up with these different connections.”

“I was pretty surprised when I saw not only the predictions but also the reasoning behind the predictions, things like the half-Heusler structure, which is a really hot crystal structure for thermoelectrics these days,” he says. “This study shows that if this algorithm were in place earlier, some materials could have conceivably been discovered years in advance.”

Along with their, the researchers are releasing the top 50 thermoelectric materials predicted by the algorithm. They say they will also be releasing the word embeddings needed for people to make their own applications if they want to search on, for example, a better topological insulator material.

Next, the researchers are working on a smarter, more powerful search engine, allowing researchers to search abstracts in a more useful way. For more, see “Unsupervised word embeddings capture latent knowledge from materials science literature.”

Berkeley Lab

Related articles:
AI material prediction platform launched
MIT researchers let AI identify materials’ best strain options
AI finds new materials for supercapacitors

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s