syringe and pills on blue background

Here in DC, members of the general public who want a COVID vaccine register for one on a website where appointments are made available in batches. Eligibility has recently expanded to include people under 65 with certain high-risk medical conditions, which means a larger group than ever is attempting to book one of the 4000 or so appointment slots when they’re initially released.

As you might imagine, the appointment-booking website is absolutely buckling under the strain. It’s been a frustrating experience, made worse by how DC’s CTO has described the situation:

Throughout the crashes and subsequent apologies, the refrain has been, “the system is experiencing high demand! Thirty thousand people are trying to use it at once! Keep refreshing! The real problem is that DC needs more vaccine!”

And yes, DC does need more vaccine, just like everyone else, but plentiful vaccine isn’t going to make a website able to handle an entirely predictable number of visitors.

DC is a city of 700,000 people. I do not know the precise number of people who are currently eligible to be vaccinated, but it’s not at all a stretch to think at least 100,000 (and probably more) people in this city are seniors, part of a critical workforce group, or have extremely common health conditions like diabetes or high blood pressure. But the website is absolutely collapsing at around 30,000 concurrent connections.

Now, I am not a developer or server engineer. But I do work with people who are to keep some extremely high-traffic, high-profile websites online. And more to the point, I spend a lot of time explaining to non-technical clients how the decisions they make can affect the availability of their high-traffic websites. And based on that experience I can tell you that 30,000 concurrent users is just not that many.

When you need a web service to be available to a large number of users, there are broadly three ways you can improve it:

  1. Increase technical resources: You buy more processor cycles, more bandwidth, more servers, until you can handle all your traffic through pure brute strength. This can be expensive; cloud tech has made it more accessible than it used to be, but costs can still be unpredictable and high.
  2. Increase efficiency: You reduce the amount of work the server has to do per user. You design your application to have fewer clicks, fewer writes to the database, fewer files that have to be served.
  3. Manage demand: You reduce the number of users that hit the site all at once. An application and environment that would fail at 30,000 users can often handle 15,000 just fine, so how do you make sure all 30,000 people aren’t using it at the same time?

The DC form makes inadequate use of all three of these strategies.

Managing demand is another way to say “flattening the curve.”

This is probably the most important one in the case of COVID vaccines. There’s a limited supply available every week. DC is trying to ensure equity in the process by prioritizing slots for the less-affluent wards hardest hit by COVID. As a result, only about 3000-4000 appointments become available at the time. So even the fact that 30,000 users (and probably many more, 30,000 is just the point at which the server runs out of resources) are all trying to book appointments at once is a problem. 90% of users are going to come away disappointed even if the website performs perfectly.

Rather than throwing 4000 appointments on the website at 9AM and letting the Hunger Games happen, what if DC had registered residents in advance (say, during the months we were all staying home, waiting for vaccines to be approved), and then started contacting people as they became eligible? Most of the information in the form is pretty static- name, address, occupation, medical condition. The only information that has to be established at the time of making the appointment is the actual appointment time/location and the person’s COVID status.

Pre-registration and contacting people when their turn comes up means people are only being offered an appointment when there’s one available for them, and it also increases equity for residents who don’t have flexible jobs that let them refresh a browser for an hour while trying to get through or reliable home internet access. But for web performance it also drastically cuts down on the number of people simultaneously using a form that writes to the database, and reduces the amount of data being written at once.

If you’re only having to accommodate <5000 people at a time on your form, efficiency in the form itself matters less and making sure you have enough server resources gets a lot cheaper. But let’s just say that it’s a few months from now, vaccine is more plentiful, and you can release 30,000 appointments at once. What then?

Efficiency is about design choices as well as technical choices.

Even at high demand, you can make your form as lightweight as possible. I think DC’s form has made a good start at that: there’s minimal styling, no decorative assets, and the couple of external services invoked (CAPTCHA and maps) are critical for the functioning of the form, there are some weird UX/UI choices happening that cause some inefficiency. For example, after you’ve entered your contact information and answered the COVID status questions, the page reloads and shows you the data you just typed in, with a checkbox asking you to attest that all that data is true. That’s an entire page request that doesn’t need to be there.

Additionally, when you look for an appointment slot, the form defaults to showing you the next week, and if the next week is booked up, it shows no appointments available. There might be appointments available after that, but then you have to adjust the date picker and try again. Why doesn’t the display just start from the next available appointment? Again, this is multiple page loads that don’t have to happen, not to mention just a confusing UX.

Throw money server resources at the problem.

We’ve already established that we shouldn’t have a situation where 100,000 eligible people are competing for 3000 slots. But if you’re going to do that, you should probably contract with your cloud provider (Microsoft Azure in this case) to allow 100,000+ concurrent connections. Not doing that just makes for an aggravating situation for even the people who do manage to score one of the coveted appointments. It’s expensive, and that’s why the other ways of managing this problem are so important, but the entire point of cloud infrastructure is that it’s easier to provision additional resources on-demand.

Despite these criticisms, I don’t want to come down too hard on DC, or other state and local governments, for not having their act completely together on this. The pandemic has been mismanaged from the top and that has left governments that should have been able to cooperate with a federal plan having to make it up as they go along. But DC is full of civic-minded tech experts who could and would have advised on these systems had they been asked to. Instead, we have a situation where trying to get scheduled for a COVID vaccine is like trying to score concert tickets before they sell out. We can and should do better.