Artwork by Michelle Henry
Since it launched in February 2004, Facebook has grown into an online community where 890 million people log on each day. People share thoughts, photos, links and videos and like, share and comment on each other’s status updates. Users share two billion photos on the site each day.
This online activity also produces some firsts for data management and distributed systems. With so many photos, and so many people constantly accessing them, what is the best way to store them? Questions like these interest Wyatt Lloyd, assistant professor in the USC Viterbi Computer Science Department.
“Facebook is the motivation for a lot of research in distributed systems—one computer can only do so much, so very often you need lots of computers,” said Lloyd. “Because of this giant social network, they’re running into problems people haven’t seen before, and they’re a huge and successful company, so they’re doing things on a larger scale than anyone has done before.”
Lloyd did an internship with Facebook while finishing his Ph.D. at Princeton University. He then did postdoctoral research there before coming to USC Viterbi in fall 2014. His main project focus has been improving photo storage and delivery.
When a user logs into Facebook and looks at a photo, that photo could be delivered to them in a variety of ways. If she looked at it recently, it might be stored in her web browser’s local cache, but if it’s not there, Facebook searches for the photo in a local “Edge cache.” These are strategically placed around the country and abroad to maximize the accessibility of recently viewed photos. Los Angeles has one, for instance, and the greater New York metropolitan area has several.
If the photo is not on the browser and not in a regional Edge cache, Facebook looks on the next layer, called the “Origin cache.” Lastly, if the picture is still nowhere to be found, Facebook sends the request to the hard drives spread across Facebook's various datacenters, where all the photos are stored.
Why all these layers of caches before the hard drives where the photos most certainly are? Hard drives are slow.
Due to inherent physical limitations of hard drives, only 80 photos per second can be retrieved from them. When you have hundreds of millions of users uploading and requesting photos constantly, that’s just not going to cut it.
In contrast, the layers of caches Facebook uses store photos on flash cards, which are orders of magnitude faster. They can deliver approximately 40,000 photos per second, estimates Lloyd.
What Lloyd and collaborators from Cornell and Facebook looked at were the algorithms that the flash cards or devices at those caches used to determine which photos would be stored there. Many use an algorithm called “first in first out,” or FIFO. Just like it sounds, the cache stores recently viewed photos, and when it’s full, it deletes the oldest photo and adds the newest one.
That works reasonably well, but Lloyd ran simulations to test an algorithm called LRU – “least recently used.” In this scenario, when the flash card is full, the most popular photos stay and the least viewed photos are deleted. This algorithm, while theoretically superior, was more taxing on the flash cards because the constant writing and rewriting of files onto them that happens when things are given priority based on popularity wears the device out faster.
To solve this, Lloyd and his collaborators designed a system that implements these effective algorithms without wearing out the flash quite so fast.
In addition to the work of photo delivery systems and algorithms, Lloyd also explores the optimal way to store the photos in the short and long term.
Facebook keeps multiple copies of each photo on various hard drives as insurance against hard drive failure, just like individuals back up their work in multiple locations. But again, with the vast amount of data Facebook works with, this can be a logistical hurdle.
Lloyd researched ways to maximize the space Facebook has for photos by storing newer photos differently than old photos.
Data storage is referred to with temperatures: hot and warm. “Hot” refers to new, popular, frequently requested information. “Warm” is a bit older and perhaps doesn’t need to be as readily accessible.
This is similar to the ways we store items in our own homes. Frequently used items we keep on a shelf or table, things we need to have around but don’t use everyday go in the closet, and things we need to keep but almost never use go in the attic.
While Facebook is unique in its data storage and accessibility needs, the research Lloyd and others conduct will not just benefit data behemoths. “The things we’re learning apply to smaller web services too,” said Lloyd.
But when he’s not researching photo storage and delivery, Lloyd does enjoy being a Facebook user.
“I like Facebook quite a bit,” Lloyd said. “I think about Facebook in a technical sense more than I use Facebook because it’s a wonderful motivating use case for tons of things that I think about and work on. But I am definitely a daily active user. I even signed my grandma up on Facebook.”