Demystifying Big Data
Originally published in DukEngineer, the student-written magazine of the Pratt School of Engineering
How much information does one need to answer a question reliably? The mathematical tools used to understand the tradeoffs between data and uncertainty can be found in the fields of engineering, signal processing, information theory and statistics.
Galen Reeves, an assistant professor with joint appointments in the Department of Electrical and Computer Engineering and the Department of Statistical Science, emphasizes the significance of a technical perspective on information.
“Although the early applications of information theory were focused on signal processing and communication applications, the mathematical foundations apply much more broadly to problems in data analysis and statistical inference,” says Reeves. “So a lot of the analysis is, in a sense, a mathematical understanding of what it means to communicate.”
Digging into Data
Many of the most important scientific and technological advances of the next several decades will follow from our ability to collect, understand and communicate massive amounts of data. The amount of data in the modern day has made the ability to understand and analyze invaluable. And according to Reeves, the vast amounts of information have created critical and complex questions.
“Data is powerful,” says Reeves. “The information in medical data could greatly improve medicine. But how do you get it out, and what happens if you make a mistake? If you have the right data but conduct the wrong analysis, you can make the wrong conclusions. And that can result in a detrimental outcome."
Reeves describes the study of information models at the “boundary” between two directions—one mathematical and the other applied. One type of model closely resembles reality and is difficult to draw conclusions from. The other contains more mathematical abstractions and is easier to study. While the latter is not necessarily clear in how it relates to the real world, it enables drawing deep, interesting conclusions.
High-dimensional statistical inference problems sometimes exhibit phase transitions, in which a small change in information leads to large changes in measures of uncertainty (e.g. probability of error or posterior variance). The study of where and why these phase transitions occur provides new ways to characterize problems and to analyze tradeoffs between information, computation and structure. Such an understanding can lead to new ways of thinking and new solutions.
For example, social networks follow this “phase transition” behavior. When observing which celebrities, politicians and other notable figures influence one another, the data is initially chaotic. Lines of influence connect a few figures but portray nothing clear about the relationships as a whole. But with a small bit of information, key players can suddenly be identified.
The significance of these phase transitions extends into every area that entails massive amounts of data. “How do you understand these phase transitions?” asks Reeves.
“How do you use them to inform algorithms used in a Google search or in solving some other basic problems?"
Blessings and Curses
Analyzing the vast quantities of data involved with these "phase transitions" is not always possible with today’s computers. In his research, Reeves looked at the computational limitations of computers, which are still unable to accomplish tasks that “blow up exponentially in complexity.” Because there is an insufficient amount of computing power in the world to check all possible states, statisticians rely on algorithms and suboptimal searches to be computationally efficient.
Reeves says this phenomenon is the “curse of dimensionality.” Originally coined by Richard Bellman, the phrase refers to how the number of possible states grows exponentially with known parameters, and how data can “blow up” to a million unknown values with a million observed variables.
On the other hand, there is also the “blessing of dimensionality.” As data is scaled up, the number of random interactions becomes so large that the macroscopic behavior becomes predictable and nonrandom.
“The analysis can become beautiful when the data are large and complex,” Reeves said. “But much of this math cannot be brute forced via numerical simulation. Instead, one sometimes needs to go back to the pencil and paper to understand the mathematical properties of data.”
Lucy Zhang is a freshman double majoring in electrical and computer engineering and computer science.