Negative 40 degrees… So cold, they have to heat the air used to cool the data center.
The Ice Cube Lab data center has over 1,200 computing cores and three petabytes of storage and tethered to the IceCube Observatory, a neutrino detector with strings of optical sensors buried a kilometer deep in the Antarctic ice.
IceCube observes bursts of neutrinos from cataclysmic astronomical events, which helps to study both “dark matter” and the physics of neutrinos.
Running this kind of IT operation at one of the most hostile and remote locations in the world creates a whole set of challenges few in IT have ever experienced.
Laying Cables – Photograph by National Science Foundation/F. Descamps
A trip to the Amundsen-Scott South Pole Station is as close to visiting another planet as you can get on Earth, with “a palette of whites, blues, greys, and blacks,” Gitt says. In summer, it feels like two o’clock in the afternoon 24 hours a day, which “can do interesting things to your diurnal cycle,” says Barnet. In winter, the outside world is lit only by moonlight and the Aurora Australis.
With a maximum population of 150 at the base during the Austral summer, South Pole IT professionals-in-residence are limited to a select few. And they don’t get to stay long—most of the WIPAC IT team only stays for a few months in the summer, during which they have to complete all planned IT infrastructure projects.
The rest of the year, WIPAC’s US-based employees remotely walk the wintering members of the IceCube team—usually chosen for their physical science credentials, not for their IT skills—through tasks over satellite calls. Systems are monitored over a multiplexed Iridium satellite connection. “You just try to collect as much information as you can and do what you can remotely,” says Ralf Auer, WIPAC’s South Pole systems administrator.
The wintering-over team “can do a lot of the physical maintenance,” says Barnet, but routine IT tasks can sometimes feel foreign to physical scientists. The result is that Auer, Barnet, and the others back home have to visualize what the over-winter team is seeing in order to walk them through fixes. (Barnet compares it to being an air traffic controller talking down an airliner flown by a flight attendant.) So part of the team’s job is to make these hands-on tasks as simple as possible for the scientists and to handle as much as possible remotely over whatever bandwidth they can get.
South Pole Station does have other IT support, but it’s not on WIPAC’s payroll. The general IT and communications support team for Amundsen-Scott peaks at 5-8 people during the summer and shrinks to 4-5 during the winter, according to Gitt. More stay at McMurdo Station, the main logistics support base in Antarctica: 35-40 at the peak of the research season in the summer, “depending on the projects,” he says. Smaller stations run by NSF may have only two or three IT and communications people total. (Lockheed-Martin just took over the NSF contract, but the award is currently under protest.)
The most reliable form of communication available is the Iridium satellite network. Individual Iridium connections aren’t exactly blazing—they support a data rate of only about 2,400 bits per second. But according to Gitts, Raytheon did a lot to coax as much bandwidth as possible out of Iridium, including multiplexing Iridium connections, doing compression to shrink the size of e-mails—even adding a wireless server with file-sharing and e-mail services to containerized Iridium ground stations to support some of the smaller field stations.
“We even dropped the package size down further in size and put it on some of the traverse tractors,” Gitt says—providing a data lifeline to expeditions crossing the frozen continent.
The higher-bandwidth access is limited to about 10 hours a day of broadband coverage from NASA’s Tracking and Data Relay Satellite System (TDRSS) and the GOES-3 Satellite —a weather satellite launched in 1978 that lost its weather imaging capabilities and now provides 1-megabit per second data transmission for eight hours a day. TDRSS provides the most bandwidth, with transmission speeds of up to 150 megabits per second.
Getting People There
Getting people and gear to the Pole in the first place is a Herculean effort. “Since the station is closed and completely inaccessible from the beginning of March until October,” Barnet says, “any work you’re going to do (on the infrastructure) has to happen from November to January. So it poses a significant logistical challenge.” And because everything is dependent on weather, “You can draw up the most beautiful plan, and then you may end up spending time you planned to do IT work doing dishes or lugging food around instead because your cargo’s not in.”
The Amundsen-Scott South Pole Station is at the end of a 9,000-mile logistics chain. Anything or anyone bound for the Pole has to get from the US to Christchurch, New Zealand, before being loaded onto an Air Force C-17 Globemaster bound for the ice runway at McMurdo Station. Then ski-equipped LC-130 transport planes from the New York State Air National Guard handle the final 800 miles or so of the trip—with a little rocket-assisted takeoff help.
“You get a very strong appreciation for the sound dampening in commercial aircraft,” says Barnet, describing the trip. “But the food is better these days on the National Guard flights than it is on most airlines.”
That long supply chain poses a support challenge for the IceCube data center team. Getting vendor support isn’t nearly as simple as it is back home. “We’re not exactly in the established vendor support dialogue path,” Barnet notes, “where if we say we’ve got a failed hard drive, they’ll say, ‘We’ll overnight you another one.’ We have to keep sufficient spares in order to be able to operate through winter.”
And 30-day warranty replacement doesn’t exactly work when you’re at 90 degrees south latitude. “If it’s the right 30 days, we’re fine,” he says. “But that almost never happens.”
You might not think that cooling would be a problem in one of the coldest places on Earth. But ironically, it’s a big concern. “You had to be careful about when you could do maintenance,” Gitt says. “At the South Pole Station, there are times we couldn’t crack open equipment bays because the equipment would start to crack from the cold.” At other times, electronic equipment would actually overheat because it had been designed for cold weather.
“150 machines can produce a lot of heat,” adds Auer. “You can’t just open a door because the temperature would drop too quickly, and we’d lose hardware because hard drives would die.”
Then there’s the matter of figuring out what temperature, exactly, the equipment should be cooled to. Barnet says it’s difficult to get “reasonable engineering estimates” for the equipment from vendors because of the altitude. “You do start worrying about things like the machine sizes,” he says. IceCube uses 2U servers, but when the team considered smaller servers, they worried whether the smaller machines could move enough of the thinner air through them for proper cooling.
To get the server room to 65 degrees Fahrenheit when the outside temperature can be -40 F or -100 F in winter meant investing some thought into how to get cooling air into the servers. The IceCube team runs an HVAC system—without air conditioning—to handle the cooling, using vents to bring in outside air in a controlled fashion. But the environment doesn’t necessarily allow itself to be controlled all the time—Barnet says that the vents for the HVAC system often freeze in position.
Working at Extremes
There are “other fun and games” driven by the extremes of the climate. One of them is that there’s nearly zero humidity at the Pole. When working in the server space, Barnet says, “We have to be very careful, wear anti-static jackets everywhere we go and make sure we’re always properly grounded.”
The humidity played havoc for a while with tape storage. For a number of reasons, all of the data that comes out of the IceCube Observatory—about a terabyte a day—is written to tape. “That’s a lot of tape,” Barnet says. “We found that tape gets really, really cranky down there, and we’ve wrestled with this from day one.” After trying a “wide variety” of fixes, the team determined that the problem was related to low humidity. ?The tape drives might develop internal static; humidifying the tapes made them perform better.
At one point, the team put the tapes in a greenhouse at the station before use; they also considered running a humidifier inside the server room, but concerns about condensation drew protests from others in the station. “It got other people a little upset,” Barnet says. Auer adds that right now they’re not doing anything to the tapes—the current batch seems to work fine without being pre-steamed.
Another major issue: power availability. “In the Northern hemisphere, we have a good power grid,” Auer says, “but down there, everything is run off two generators for the entire station, one of them active. They provide the power for the station and all of the outbuildings, and they’re basically the only source of power for our experiments. If something goes wrong in the power plant down there, every minor problem is immediately visible in IceCube, and nobody ever knows how long it will take to restore power.“
But perhaps the biggest challenge is simply getting to the servers themselves. Even during the summer, there’s limited access to the server room—it’s in an out-building separate from the station. During the winter, the weather could keep the onsite team from getting access at all for days at a time.
Despite all the challenges—or perhaps because of them—Auer and Barnet think their jobs are pretty cool, both literally and figuratively. “When you can tell people, we’re going to the South Pole, and we run a data center that has about 150 servers and provides 99.5-plus percent uptime, that’s just cool,” Auer says.
Originally posted 2015-09-06 07:33:03. Republished by Blog Post Promoter