Virtualization Featured Article

Why We Created our own Platform-as-a-Service at LinkedIn

June 06, 2016
By Special Guest
Steven Ihde, Head of the LPS project at LinkedIn -

We often hear that the only constant in the technology industry is change. But especially when building infrastructure, that doesn’t mean that change happens continuously. It happens iteratively, alternating between periods of stability and periods of rapid, sometimes, disruptive, progress. Companies need to establish a firm foundation for business operations, but periodically need to absorb some disruption to cast off technical debt and take that foundation to the next level.

At LinkedIn, we’ve re-architected our hosting systems four times over our company’s history and now host close to a thousand different services, and hundreds of thousands of service instances, in our data centers. These changes were driven by our business needs as we scaled to serve over 400 million LinkedIn members. This increasing scale also drives changes in our underlying hosting infrastructure.

As we grew, our people and processes were creaking under the increased load of running all of these services. It was taking us days or sometimes weeks to push out a new service, and it required coordination with multiple teams. Two years ago, we set out to make sure that deploying and running LinkedIn’s services on our infrastructure was as easy as (or easier than) using a public cloud service like Amazon EC2.

We knew that infrastructure is iterative. We also knew that the time between iterations is often based on how forward-looking your designs are, so we started looking at the hottest and most innovative projects that we could find. We looked at OpenStack, Docker, rkt, and similar projects. We found lots of good ideas, but back in 2014, most of them weren’t ready for full-scale deployment. Often they introduced unwanted overhead or were opinionated and would have forced us to re-implement parts of our software stack that were already serving our needs well.

During our search for resources, we did come upon several useful projects that would help us make progress. The increased focus on containerization helped us zero in on the problems encountered by early adopters and the solutions they engineered. Work on OpenStack and Mesosphere gave us insight into different approaches to resource allocation. In mid-June of 2015, Google and Docker open sourced much of the core container code from Docker via the runC project.

We set out to build our own system, using both significant open source components and our own established technologies. The goals of this project were to find a way to reduce our hardware footprint, while at the same time increasing productivity by making our software stack more application-oriented. Our bet was that by abstracting the problems of deployment, resource provisioning, and dependency management at scale, we’d be massively increasing productivity for our software engineers and SRE team members.

We asked ourselves what this ideal hosting environment would need to do in practical terms and came up with the following criteria:

  • Enable service owners and engineers to manage the lifecycle of their own services;
  • Relentlessly optimize our infrastructure for maximum performance while maintaining robustness and adaptability;
  • Automatically compensate for human error and unexpected events by bringing more applications or resources online to maintain high availability;
  • Perhaps most importantly, we didn’t want to implement any “hacks” or incur any extra technical debt to achieve the above points.

When we started working on this new platform, we found that many of the systems that made our services much more elegant, automated, and user-friendly for engineers already contained control plane APIs that could be leveraged by LPS. Services like Nuage, inGraphs, and AutoAlerts provide the functionality to automatically provision data stores, provide operational and performance metrics, and monitor applications to ensure that new application instances are spun up when they are needed.

Another key enabler of LPS is the transition to a next generation of data center architecture. The new ultra-low latency data center design we’ve adopted allows service instances in any part of the data center to communicate with each other seamlessly.

Today, LPS is an integrated system that presents an entire data center as a single resource pool to application developers. Our resource allocation and containerization system, Rain, shares machines between different applications and eliminates unused and underused resources to increase cost efficiency. It uses Linux kernel namespaces and cgroups to ensure security and resource isolation between jobs. Rain also takes advantage of libcontainer via runC. RACE, the Resource Allocation and Control Engine, handles scaling needs for our applications. For instance, RACE can scale up application instances in case of sudden spikes due to increased demand or failover scenarios. It can also scale down instances when they’re not needed. Other parts of the system are still in development (like Maestro, which provides a global view of the LPS system along with “gold record” application blueprints), but since implementing parts of LPS, we’ve seen a 95 percent time savings for deploying new applications.

Building a private cloud is a long-term commitment, and not one every company could or should make. There are many hurdles to overcome, from integrating existing systems to the challenge of running your own infrastructure at scale. But if your company considers technology to be a core competency, then I strongly recommend that you start planning for the next time your architecture needs to be shaken up.

Edited by Maurice Nagle

Click here to share your opinion – Would color of equipment influence your purchasing decision, one way or another?

Featured Blog Entries

Day 4, Cisco Live! - The Wrap

Day 4 was the final day of our first ever Cisco Live! We had a great show, with many great conversations and new connections with existing and potential end users, resellers, partners and job hunters.

Day 3, Cisco Live!

Day 3 of Cisco Live is history! For Fiber Mountain, we continued to enjoy visits from decision makers and influencers who were eager to share their data center and structured cabling challenges.

Day 2, Cisco Live!

Tuesday was Day 2 of Cisco Live for Fiber Mountain and we continued to experience high levels of traffic, with many high value decision makers and influencers visiting our booth. One very interesting difference from most conferences I attend is that there are no titles on anyone's show badges. This allows open conversations without people being pretentious. I think this is a very good idea.

Day 1, Cisco Live!

Fiber Mountain is exhibiting at Cisco Live! In Las Vegas for the first time ever! Our first day was hugely successful from just about any perspective - from quantity and quality of booth visitors to successful meetings with customers.

Industry News