Engineering Blog - April 17, 2013

Exercises in Performance

Jeff Knight

One of the things that we like to pride ourselves on here at Zoosk is the User Experience, and one key facet of that experience is performance. This is often assumed during the development cycle, but rarely is it explicitly called out as a “feature.” This was best put in a blog post on High Scalability, which stated:

The less interactive a site becomes the more likely users are to click away and do something else. Latency is the mother of interactivity. Though it’s possible through various UI techniques to make pages subjectively feel faster, slow sites generally lead to higher customer defection rates, which lead to lower conversation rates, which results in lower sales. Yet for some reason latency isn’t a topic talked a lot about for web apps. We talk a lot about about building high-capacity sites, but very little about how to build low-latency sites. We apparently do so at the expense of our immortal bottom line.

Many techniques have been developed over the years to address the issue of latency and interactivity, the most obvious being the rise of Web 2.0 and the dynamic loading of content after page render. Ajax has been the panacea of low latency websites, allowing for non blocking page loads and deferring high cost operations until such time as that content is needed or requested. This has gotten us far, but we can do better.

Enter BigPipe

In 2010, Facebook posted a blog entitled, BigPipe: Pipelining web pages for high performance. In it, they outline a rather novel solution for delivering highly dynamic and content driven websites with low latency. I will leave you to read the article for the specifics, but to summarize, a page is broken down into highly decoupled widgets, or pagelets as they call them. Each pagelet is represented on the page by a div, which is a placeholder for where the content will be displayed once it has been processed. This skeleton is flushed immediately to the user, however instead of closing the connection, the server keeps the request alive. In the background, asynchronous processes are triggered which fetch and process the data for each pagelet. Upon completion, each process will flush some data to the page as JSON, which initializes the pagelet with the desired content. Once all processes have competed, the request is closed and the page is “complete.” Facebook found that by moving to BigPipe, they were able to decrease the time to interact latency by about half. This is a significant win, and one that I wanted to see if we could replicate here at Zoosk.

Javascript to the rescue

In order to test BigPipe with our product, I decided to try out a new stack, one which lent itself more readily to asynchronous programming. We have typically been a LAMP shop here at Zoosk, but for this, I decided to use Node.js. Having most recently rebuilt our site as an HTML5 single page application targeting smart phones, I found the Javascript event loop ideally suited for the task at hand. For the test case, I chose to rewrite our internal admin tool. While not a user facing application, the admin tool is the backbone of our user operations team and an ideal candidate for optimization. The end result? We were able to reduce TTFB (time to first byte) by 50 percent, and the time to page load was reduced by 35-50 percent. For an initial, unoptimized proof of concept, this was a huge win. I am convinced that with further tuning, we could easily drop the latency even further.

The devil is in the details

While the Facebook article does a good job outlining the BigPipe methodology at a high level, I want to draw attention to a few items that were not as immediately obvious as candidates for optimization:

SSL – Here at Zoosk, we take user privacy extremely seriously, which is why we host an all-SSL experience. While SSL keeps your data secure, it does come at a price. SSL can increase the overhead of serving a request, with most of the overhead coming from the handshake protocol. Every external network resource, be it CSS, Javascript, an image tag, an AJAX request, etc, will incur this overhead, which can lead to a higher latency experience. By moving what would typically be an Ajax heavy experience to BigPipe, you are able to avoid incurring this cost. Instead of several Ajax requests having to negotiate the SSL handshake, you are able to make ONE request to the server which you can auth and terminate SSL at the edge of your network. All internal processing can then use more performant transport protocols while within the confines of your network.
Physical Locality – dovetailing on the above point, BigPipe is able to take advantage of the principle of locality. The location of a user and the quality of their connection can play a large role in the latency required to serve a request. How fast is their DNS server, how close are they to their local exchange, is their ISP performing any bandwidth shaping/throttling, how many hops total before reaching our cage, are they mobile? Every connection a user’s browser makes is affected by these factors, and more. However, requests within our cage take just a few milliseconds. In a typical Ajax heavy application, the page has to wait until DOMContentLoaded before firing off any additional requests. This initial request has an overhead which must be paid, and introduces a stall in our request pipeline. Then, each additional Ajax request incurs an additional cost. By switching to BigPipe, a user just has to make one (potentially) costly request to our servers. At that point, we are able to make highly performant requests within our cage, flushing the results as soon as we get them, thus keeping the request pipeline from stalling.
Browser connections – speaking of pipeline stalls, one source of stalls in the rendering pipeline is the max number of simultaneous connections a browser can make per host. Currently, most browsers have an upper bound of 4-6 connections (this drops down to 1-2 connections on mobile devices over 3G). As pages have become more dynamic and interactive, it has become increasingly easy to hit these hard limits. Once that magic threshold has been reached, no new requests can be made to that domain until a previous request has been completed, and your page stalls out. By using BigPipe, you release these connections that would have been used for Ajax requests back into the available pool.

Over the coming months, we will be applying the lessons learned to our core product, resulting in a snappier, more delightful Zoosk.