As we’ve shared previously, we’ve spent the last several months on a major upgrade to the infrastructure that runs Watchman Monitoring. This is a big investment in the future of the platform: it lets us ship updates, new features, and bug fixes far faster than before. As with any migration of this size, it also comes with the occasional bump along the way, and we want to be transparent about one we hit this week.
What happened
We completed the migration to the new infrastructure at the beginning of this week. Shortly after, we noticed a handful of customers reporting that their Mac clients weren’t updating to the latest version. While our team was investigating and triaging that issue, a change made during troubleshooting inadvertently left the application in a state that would fail the next time it restarted.
Overnight, the application went through a minor, routine restart, and because of that lingering change, it came back up incorrectly and was effectively down. As soon as a team member was available to see the downtime, we triaged it and brought everything back online.
We understand exactly why it happened and have already put a process in place so it doesn’t happen again.
The Mac client update issue
We’ve also resolved the underlying Mac client update problem. Any Mac clients that have recently been failing to update should pick up their update automatically on their next check-in, no action needed on your end.
Looking ahead
Overall, the move to our new infrastructure has gone to plan, and we’re excited about what it unlocks. That said, with a change this large, it’s likely we’ll continue to find small outliers over the coming weeks. If you notice anything that used to work one way and now behaves differently, please let us know.
Coming out of this, we’ve launched a status page where you can always check on the health of Watchman Monitoring: https://status.watchmanmonitoring.com
Going forward, if you’re ever unsure whether the app is up or experiencing an issue, this is the place to look. We’ll post status updates there during any incident, and it reflects the live health of the service.
