March 2005, Issue 4

Creating Customized Computer Search Engines for Research Data

Carla E. Brodley, PhD, joined the Department of Computer Science in September 2004. Her main areas of interest are computer security and data mining. Data mining is the creation and use of computer algorithms to explore and organize data. Brodley enjoys the challenges of applying both areas of her work to real-world problems. She collaborates with researchers in various fields to plan data collection and analysis, and then develops computer programs to deal with the data. In a way, it's like having a customized search engine created just for your research.

After receiving her PhD in computer science from the University of Massachusetts Amherst in 1994, Brodley began a ten-year career on the faculty of the School of Electrical and Computer Engineering at Purdue University. While at Purdue, Brodley collaborated with radiologists and computer vision researchers to create methods for automatically generating procedures for content-based retrieval of medical images. She also worked with earth scientists to address the tasks of mapping global land-cover from satellite images.

Brodley uses both supervised and unsupervised computer learning in data mining programs. Supervised learning occurs when you provide the computer with labeled examples of the categories into which you want your data separated. The larger and more accurate the set of examples, the better the quality of the data separations. The computer software learns generalizations from the labeled examples and then applies these generalizations as a step-by-step procedure (an algorithm) to classify new unlabeled data. “For example, in earth science you have satellite images, and you've come up with a way to represent them on a grid such that each grid represents some variable,” Brodley explains. “And what you'd like to do is decide what exactly is on that part of the earth's surface. Is it tundra? Coniferous forest? Deciduous forest? And so we would try to have a computer program learn how to do this classification automatically rather than have a human being do the labeling.”

Unsupervised learning occurs when you ask the computer to look for trends in data and then cluster data that have similar, specified characteristics. The software will also show outliers or anomalies in the data. “Unsupervised learning is when you don't have any labels but want to separate your data into homogeneous groups for better understanding,” Brodley explains. “For example, you might want to look at eating trends within a population or customer patterns in electrical usage."

Brodley works with independent and time-dependent data points, and both types of data can be analyzed with supervised or unsupervised computer learning programs. For example, blood pressure data points from one patient are usually independent of the data points from another patient, but a single patient’s data points are often related to, and dependent on, events that occurred earlier. Computers can learn to predict future data points from past data points. Computers can also be programmed to look for inconsistencies with past data. “An application of this is in computer security,” says Brodley. “When you send data out from your computer, it goes over the network in something called packets, and you can look and see if that behavior is normal, or whether someone has taken over your machine and is using it to spam other machines. Another example of computer learning is trying to automatically detect which emails are spam.”

Each new application requires that the lead scientist on a project consult with Brodley, and together they develop a list of features that are of interest and are relevant for each label or group determination. Brodley develops new algorithms for each new project, and admits that some projects can be exceedingly difficult because of the excess of unimportant data (measurement ‘noise’) that pervades real world experiments. Data-reduction algorithms can be used to clean up the data and get rid of some of this noise before a data-mining algorithm is used.

Brodley is enthusiastic about her work. “I like solving a problem in a domain that I'm not an expert in, because then I get to learn about it. I got to learn about remote sensing and about diagnosis based on high-resolution CT images of the lung. I recently learned about dairy cows because we're using a wireless sensor to track their behavior and health. I'm doing this with researchers from Purdue, but I hope to talk to researchers working at the veterinary school here about looking at animal behaviors.”

Brodley sees many possibilities for interdisciplinary collaboration at Tufts. She is discussing with Professor Gregory Crane of the Classics Department potential applications of computer learning programs in the Perseus Project, which is an electronic database on Archaic and Classical Greece. One such application might include a supervised learning program that uses optical character recognition to verify or correct illegible Greek symbols in ancient texts.

“I'm interested in doing data mining with researchers from all of the Tufts schools. That's when I really feel good – when we do something that improves other areas of science significantly,” Brodley concludes.

For more information, go to http://www.cs.tufts.edu/~brodley/.


 

Tufts University, Office of the Vice Provost
Health Sciences Campus: (617) 636-6550
Medford Campus: (617) 627-3417
Copyright 2005 Tufts University. All Rights Reserved.

Please send questions/comments about this site to Webmaster.