Computational Sprints in Concert
Game theory provides a new framework to optimize processor sprints and power consumption for large server farms
By Ken Kingery
Squeezing as many computational operations as possible out of every processing second is important for any tech firm. But for companies that rely on warehouse-sized server farms, like Google, Facebook and Amazon, the need for speed compounds exponentially.
These technological giants make enormous investments in their hardware to gain a competitive advantage. Less obvious, however, are the investments in software that can keep all of a facility’s processors operating at peak efficiency simultaneously.
The challenge becomes immediately evident in the realm of power consumption. Most everyone has heard the gentle murmur of their laptop ramp up to a mild roar when it performs a computationally intensive task—shortly followed by the computer’s fan kicking on to dissipate the heat generated by all that power coursing through the system.
In computer processing terms, this sudden increase in performance and power consumption is called a sprint. And if you’ve got a warehouse full of hundreds of thousands of processors that all decide they need to enter this mode at the same time, you’ve got an issue.
“It’s a tricky problem because you want your processors to be able to draw extra power to enhance their performance if they need to. But if they all do it at the same time, they’re going to trip the circuit breaker and draw on backup batteries,” said Benjamin Lee, associate professor of electrical and computer engineering at Duke University. “That’s called a power emergency, and it’s a pretty disruptive event for these facilities.”
There are a few ways to prevent this situation from happening. Each processor could have its power consumption capped at a relatively low level, but that would also prevent them from operating to their full potential. Or, you could allow every processor to sprint whenever it wants to and just make sure there’s enough power to handle the worst-case scenario—but that’s a tall task for any electrical system, and electricity isn’t free. A third option is to require permission to sprint from a global overseer managing the overall power consumption, but in the time it would take for such an exchange to take place, much of the benefit from the sprint would already be lost.
Large server farms often opt for a combination of the first two options—they allow processors to sprint whenever they want, but force large subsets of the system to throttle back if the power consumption levels become too high.
“What you would like to do, though, is have a framework that allows processors to figure out when they get the most benefit from drawing that extra power and then judiciously choose when to do it,” said Lee.
In a 2016 paper, Lee outlined a plan for a better option based on game theory. Three years later, companies such as Qualcomm and Arm are exploring ways to put it into practice. Because of its potential impact and broad interest, the paper is being recognized as a Research Highlight in the January 2019 edition of the Communications of the Association for Computer Machinery.
“What you would like to do, though, is have a framework that allows processors to figure out when they get the most benefit from drawing that extra power and then judiciously choose when to do it."
In Lee’s framework, each processor runs its own independent job, and the user who submitted the job wants its processor to gobble up as much power as it needs regardless of the others’ power needs. With this scenario in mind, Lee and his then-students Songchun Fan and Seyed Majid Zahedi came up with their framework to optimize a facility full of processors wanting to sprint periodically. The system works by setting a threshold of performance gain at which an individual processor will choose to sprint and also sharing global information about what each processor is doing on a large set time frame, such as every 30 or 60 minutes.
The processors’ sprinting strategies have been optimized to account for what else might be happening in the system, but on the microsecond level, each individual processor is deciding whether or not to sprint on its own.
According to Lee, it works pretty well; the framework can help a large network achieve four to six times more tasks per second than the strategies such server systems typically use.
“Very rarely do the processors oversubscribe and trip the circuit breakers,” said Lee. “But even when they do, that might have still been the best decision. The system probably really needed the performance at that time, and it was willing to trip the circuit breaker to get it.”
CITATION: "Distributed Strategies for Computational Sprints." Songchun Fan, Seyed Majid Zahedi, Benjamin C. Lee. Communications of the ACM, February 2019, Vol. 62 No. 2, Pages 98-106. DOI: 10.1145/3299885