Stability starts in culture. We enforce a pretty stringent engineering culture, augmented by a handful of software systems: all code changes get (quite brutally) code reviewed by two other engineers using our custom-modded version of Rietveld, before buildbot runs it against a battery of tests and packages everything for deployment.
We generally avoid big deployments or “release management” since they basically act as risk capacitors. Instead, everyone on the team can deploy any code that has passed code review and the test suite at any time – and they do. We usually deploy several times a day.
Overwhelmingly, our stability stems from these kinds of ‘best practices’. However, we have over 120,000 lines of Javascript running client-side on people’s browser, and that means there’s a huge surface area for client-side stability problems to arise, despite any amount of testing. Furthermore, it can be a harrowing experience for a hardware engineer if their editor keeps running into errors.
The good news is that instead of having to wait for your software distributor to send you a new version, at Upverter we’re able to deploy fixes to our servers as soon as we see them happen. To keep an eye on the stability of connected clients, we have a big dashboard in the main engineering space:
The dashboard displays all the key data for managing live errors on the site. It shows us how many times the error has occurred (based on a hash of the stack trace), what users are affected, and what part of the code base is responsible. We also see times of first and last occurrence. Since our last revision, all new errors are automatically posted to our our task management tool, Asana, and the engineer tasked with the fix is sync’d back to the dash using the Asana API.
In order to track down complex bugs, we send a lot of data back with every error. Client-side, we take advantage of Google Closure’s global error handler, and add a bunch of extra contextual information to the stacktrace, including the entire history of the client session: what tools were used, what shapes were placed, and when. Additionally, users are given the opportunity to submit reproduction steps after their design reloads.
Here’s what our engineers see:
Finally, we can also browse the connection history to ensure there wasn’t any kind of network problem that contributed to the error:
We’re able to re-use the session history information to track how long into sessions errors typically occur, and whether there are significant disconnect/reconnects prior to the crash.
Once the problem is diagnosed, the patch goes into code review, and it’s wash-rinse-repeat!
Sure beats waiting for the next version.