Engineering Series: Keeping Upverter Up

We’ve been pretty cagey in the past about a lot of our engineering efforts at Upverter.  Today, we want to start lifting the veil a bit and talk about some of the things we’ve done under the hood to keep the Upverter platform stable, despite huge feature pushes.

Stability starts in culture.  We enforce a pretty stringent engineering culture, augmented by a handful of software systems: all code changes get (quite brutally) code reviewed by two other engineers using our custom-modded version of Rietveld, before buildbot runs it against a battery of tests and packages everything for deployment.

image

We generally avoid big deployments or “release management” since they basically act as risk capacitors.  Instead, everyone on the team can deploy any code that has passed code review and the test suite at any time – and they do.  We usually deploy several times a day.

Overwhelmingly, our stability stems from these kinds of ‘best practices’.  However, we have over 120,000 lines of Javascript running client-side on people’s browser, and that means there’s a huge surface area for client-side stability problems to arise, despite any amount of testing.  Furthermore, it can be a harrowing experience for a hardware engineer if their editor keeps running into errors.

The good news is that instead of having to wait for your software distributor to send you a new version, at Upverter we’re able to deploy fixes to our servers as soon as we see them happen.  To keep an eye on the stability of connected clients, we have a big dashboard in the main engineering space:

image

The dashboard displays all the key data for managing live errors on the site.  It shows us how many times the error has occurred (based on a hash of the stack trace), what users are affected, and what part of the code base is responsible.  We also see times of first and last occurrence.  Since our last revision, all new errors are automatically posted to our our task management tool, Asana, and the engineer tasked with the fix is sync’d back to the dash using the Asana API.

image

In order to track down complex bugs, we send a lot of data back with every error.  Client-side, we take advantage of Google Closure’s global error handler, and add a bunch of extra contextual information to the stacktrace, including the entire history of the client session: what tools were used, what shapes were placed, and when.  Additionally, users are given the opportunity to submit reproduction steps after their design reloads.

Here’s what our engineers see:

image

Finally, we can also browse the connection history to ensure there wasn’t any kind of network problem that contributed to the error:

image

We’re able to re-use the session history information to track how long into sessions errors typically occur, and whether there are significant disconnect/reconnects prior to the crash.

Once the problem is diagnosed, the patch goes into code review, and it’s wash-rinse-repeat!

Sure beats waiting for the next version.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s