Engineering Blog - February 24, 2015

Evolution of the Photo System at Zoosk

Chandra Vijayarenu

Because Zoosk is a dating website, photos are an integral part of our users’ experiences. Having a good profile photo helps Zoosk’s users make good first impressions. This is why we are continuously trying to identify better ways for users to upload, edit, and maintain their photo galleries.

A user’s photo gallery consists of information about:

all the photos uploaded by the user
edits the user has made to the photos
which photos are actively visible on the user’s profile

This information is stored in a relational database.

Photo System v1
The initial version of Zoosk’s photo system was a library of helper functions written in PHP that defined the interface to our underlying distributed file storage systems like Mogilefs, Amazon S3, and ImageMagick extension. The gallery information about the photos was combined into a relational database.

Photo System v2
One of the first enhancements we wanted to make to Zoosk’s photo system was to convert it into a service, so that we could separate it from Zoosk’s core codebase. We could separate the library dependency, like ImageMagick, from our API servers. To achieve this we built a thrift interface between our API tier and the photo tier then moved all the necessary library and photo-relational database behind the service. The service was implemented in PHP using the ImageMagick library. Although this solved our code maintainability and library dependency, it did not add any enhanced benefits for the user. The system still had a lot of flaws.

The photo transcoding was sequential. Each time a Zoosk user uploaded a single photo, we generated 12 different sizes of that photo, which were then used all over the website and across different mobile apps. This photo generation happened synchronously, so the user had to wait for all of the photos to be generated before he or she could see one photo uploaded.
With the addition of devices featuring retina display, such as the iPad, new challenges arose. None of the 12 existing photo sizes could be served on a high-resolution device. Since the new size generation would also be done synchronously, adding the new high-resolution sizes would increase the photo upload time significantly. This also meant that we had to generate new high-resolution images for all of the current photos the user already had in his or her photo gallery. (This was finally achieved with the help of 100 Amazon EC2 instances working tirelessly for three weeks.)
We had not taken advantage of CDN services. Nor were we taking full advantage of S3 header settings to set the cache timeout.
The photo gallery information was part of the user database cluster and the photo system did not know any business logic about the gallery. Because of this, if there was any change in the photo system it had to be communicated back to the API tier using an extra thrift network call.

To address these issues we took the following steps.

Introduction of CDN. We experimented with some of the CDN providers and noticed an improvement in the load time of users’ profile pages.
S3 cache timeout. Because image files are static files they never change. It made logical sense to set the cache timeout to a high value so that it was cached as much as possible.

Even with these enhancements the time it took a Zoosk user to upload a photo did not change or see any improvement.

Photo System v3
Photo System v3 was truly a dynamic photo generation system. First we hosted the system on Amazon EC2 so that we could decrease the time for access to S3. (S3 was our backend photo storage system, so it made logical sense to have this photo system in EC2.) We also moved the gallery from the user database to Photo System v3, which allowed us to independently maintain Zoosk member galleries and not worry about calling the API tier back.

This is how the new system worked:

The user uploaded a photo.
We validated the photo and returned a photo id back to the API tier.
The API tier generated a URL based on the photo id and the required photo size (for example: https://cdn-zoosk/photoid/size.jpg). The URL included all the information needed to generate the photo dynamically.

From the photo id, we got all the crop information needed from the database. This involved getting the edit information applied by the user and also the exif information present in the image itself. (Generally the photos would have exif information, which would give us information about the orientation of the image, such as height and width.) This exif information was used along with the edits the user made to get the resulting image. The size of the resulting image came from the URL too. This solved most of our problems.

Advantages of our current photo system:

The photo is generated dynamically. Now we can generate an image of any size and also don’t have to worry about generating images from older photos.
The upload time for the Zoosk user decreased from 2 seconds to 200 milliseconds.
The gallery information is contained in the photo system, so maintenance is easy to manage.
CDN in front of our dynamic photo system ensures that the images are being cached.
Load on our API tier decreased.

Migrating from Photo System v2 to Photo System v3
One of the biggest challenges of building such a big system was managing the switch from Photo System v2 to Photo System v3. Photo System v2 was live for close to six years and had millions of members’ profile photos. We also had a few hundreds of Terabytes of images on S3 buckets, which were getting served by Photo System v2 that needed to migrate to the new system. Apart from this we were also getting live photo uploads at a rate of hundreds of thousands of uploads a day.

How did we migrate photos from an old system to a new system without affecting the user experience?

To achieve this we designed some changes in the API tier. We rolled this out in phases.

Phase 1: Dual write and read from Photo System v2
We built the Photo System v3 parallel to Photo System v2 and started to dual write. When a Zoosk member uploaded a photo it wrote to Photo System v2 and then created an asynchronous job to add the same photo into Photo System v3. Since it was done as an asynchronous job it did not affect user experience. During this phase we also ran distributed jobs to copy photos from Photo System v2 to Photo System v3. Considerable care was taken to make sure both systems were in sync.
Phase 2: Dual write and read from Photo System v3
Once the migration caught up and both systems were in sync, we switched the reads from Photo System v2 to Photo System v3. All this was very seamless for the users on the web and mobile apps.
Phase 3: Deprecate Photo System v2.
Once the switch happened, we stopped writing to Photo System v2 and deprecated the system.

Conclusion
We built a dynamic photo system that can generate different sizes of photos on the go and significantly reduced the photo upload time. This also reduced the overall response time of the website and increased user engagement by 2%.