Algorithm clarifies ‘big data’ clusters
Rice University scientists have developed a big data technique that could have a significant impact on health care.
The Rice lab of bioengineer Amina Qutub designed an algorithm called “progeny clustering” that is being used in a hospital study to identify which treatments should be given to children with leukemia.
Details of the work appear today in Nature’s online journal Scientific Reports.
Clustering is important for its ability to reveal information in complex sets of data like medical records. The technique is used in bioinformatics — a topic of interest to Rice scientists who work closely with fellow Texas Medical Center institutions.
“Doctors who design clinical trials need to know how to group patients so they receive the most appropriate treatment,” Qutub said. “First, they need to estimate the optimal number of clusters in their data.” The more accurate the clusters, the more personalized the treatment can be, she said.
Separating groups by a single data point, like eye color, would be easy, she said. But when separating people by the types of proteins in their bloodstreams, it becomes more difficult.
“That’s the kind of data that’s become prevalent everywhere in biology, and it’s good to have,” Qutub said. “We want to know hundreds of features about a single person. The problem is identifying how to use all that data.”
The Rice algorithm provides a way to assure the number of clusters is as accurate as possible, she said. The algorithm extracts characteristics about patients from a data set, mixing and matching them randomly to create artificial populations — the “progeny,” or descendants, of the parent data. The characteristics appear in roughly the same ratios in descendants as they do among the parents.
These characteristics, called dimensions, can be anything: as simple as hair color or place of birth, or as detailed as one’s blood cell count or the proteins expressed by tumor cells. For even a small population, each individual may have hundreds or even thousands of dimensions.
By creating progeny with the same dimensions of features, the Rice algorithm increases the size of the data set. With this additional data, the distinct patterns become more apparent, allowing the algorithm to optimize the number of clusters that warrant attention from doctors and scientists.
Qutub and lead researcher Wendy Hu, a graduate student in her lab at Rice’s BioScience Research Collaborative, said their technique is just as reliable as state-of-the-art clustering evaluation algorithms, but at a fraction of the computational cost. In lab tests, progeny clustering compared favorably to other popular methods, they wrote, and was the only method to successfully discover clinically meaningful groupings in an acute myeloid leukemia reverse phase protein array data set.
Progeny clustering also allows researchers to determine the ideal number of clusters in small populations, Qutub said.
The algorithm was put to work in an ongoing trial involving patients with leukemia at Texas Children’s Hospital. There, Qutub said, “progeny clustering allowed them to design a robust clinical trial, even though that trial did not involve a large number of children. It meant they didn’t have to wait to enroll more.”
Technologies that gather data about patients — from sophisticated hospital equipment to simple wrist-worn health monitors — are advancing rapidly, Qutub said. That puts a premium on tools that can decipher growing mountains of data. Ten patients, for example, may be few in number but there may be hundreds or thousands of dimensions for each, she said.
“Big data is just numbers, but the numbers don’t have any value if you don’t get information from them,” Hu said. “My job is to look at these numbers and use computational tools and insights from biology to generate new information. This can help us know more about diseases and come up with therapeutic solutions and diagnostic schemes and identify new drug targets.”
“If you don’t know how to handle that data, you’re at a loss,” Qutub said. “You won’t know the best way to group people or assign them to a certain therapy or exercise regime.”
The algorithm could apply to any data set, Qutub said. “We could just as easily use it for a population of voters to see who should get campaign materials from a candidate,” she said. “Progeny clustering has a lot of possible applications.”
The lab plans to make the algorithm available free through its website, she said.
Co-authors of the paper are Steven Kornblau, a professor in the departments of Leukemia and Stem Cell Transplantation at the University of Texas MD Anderson Cancer Center, and John Slater, an assistant professor of biomedical engineering at the University of Delaware.
The National Science Foundation, the Leukemia and Lymphoma Society and the Howard Hughes Medical Institute Med-Into-Grad Fellowship supported the research.