⚠️ Warning: This book was published in 2014. Some of the details and code samples may be outdated.

The Launch

Launching is always a stressful experience. No matter how much preparation you do upfront, production traffic is going to throw you a curve ball. During the launch you want to have a 360º view of your systems so you can identify problems early and react quickly when the unexpected happens.

Your War Room: Monitoring the Launch

The metrics and instrumentation we discussed earlier will give you a high-level overview of what’s happening on your servers, but during the launch, you want finer-grained, real-time statistics of what’s happening within the individual pieces of your stack.

Based on the load testing you did earlier, you should know where hot spots might flare up. Setup your cockpit with specialized tools so you can watch these areas like a hawk when production traffic hits.

Server Resources

htop is like the traditional top process viewer on steroids. It can be installed on Ubuntu systems with apt install htop.

_images/htop.png

Use htop to keep an eye on server-level metrics such as RAM and CPU usage. It will show you which processes are using the most resources per server. htop has a few other nifty tricks up its sleeve including the ability to:

  • send signals to running processes (useful for reloading uWSGI with a SIGHUP)

  • list open files for a process via lsof

  • trace library and syscalls via ltrace and strace

  • renice CPU intensive processes

What to Watch

  • Is the load average safe? During peak operation, it should not exceed the number of CPU cores.

  • Are any processes constantly using all of a CPU core? If so, can you split the process up across more workers to take advantage of multiple cores?

  • Is the server swapping (Swp)? If so, add more RAM or reduce the number of running processes.

  • Are any Python processes using excessive memory (greater than 300MB RES)? If so, you may want to use a profiler to determine why.

  • Are Varnish, your cache, and your database using lots of memory? That’s what you want. If they aren’t, double-check your configurations.

Varnish

Varnish is unique in that it doesn’t log to file by default. Instead, it comes bundled with a suite of tools that will give you all sorts of information about what it’s doing in realtime. The output of each of these tools can be filtered via tags1 and a special query language2 which you’ll see examples of below.

varnishstat

_images/varnishstat.png

You’ll use varnishstat to see your current hit-rate and the cumulative counts as well as ratios of different events, e.g. client connections, cache misses, backend connection failures, etc.

Note

The hitrate displayed in the upper-right can be deceiving. A pass in Varnish is not considered a cache miss, so the hitrate only measures the percentage of requests served from the cache for requests that __can__ be served from the cache. If you want a true measure of requests served out of Varnish’s cache versus requests that are served from your backend, you’ll need to take into account the s_pass value as well.

varnishhist

_images/varnishhist.png

varnishhist is a neat tool that will create a histogram of response times. Cache hits are displayed as a | and misses are #. The x-axis is the time it took Varnish to process the request in logarithmic scale. 1e-3 is 1 millisecond while 1e0 is 1 second.

varnishtop

_images/varnishtop.png

varnishtop is a continuously updated list of the most common log entries with counts. This isn’t particularly useful until you add some filtering to the results. Here’s a few incantations you might find handy:

  • varnishtop -b -i "BereqURL" Cache misses by URL – a good place to look for improving your hit rate

  • varnishtop -c -i "ReqURL" Cache hits by URL

  • varnishtop -i ReqMethod Incoming request methods, e.g. GET, POST, etc.

  • varnishtop -c -i RespStatus Response codes returned – sanity check that Varnish is not throwing errors

  • varnishtop -I "ReqHeader:User-Agent" User agents

varnishlog

varnishlog is similar to tailing a standard log file. On it’s own, it will spew everything from Varnish’s shared memory log, but you can filter it to see exactly what you’re looking for. For example:

  • varnishlog -b -g request -q "BerespStatus eq 404" \
    -i "BerespStatus,BereqURL"

    A stream of URLs that came back as a 404 from the backend.

What to Watch

  • Is your hit rate acceptable? “Acceptable” varies widely depending on your workload. On a read-heavy site with mostly anonymous users, it’s feasible to attain a hit rate of 90% or better.

  • Are URLs you expect to be cached actually getting served from cache?

  • Are URLs that should not be cached, bypassing the cache?

  • What are the top URLs bypassing the cache? Can you tweak your VCL so they are cached?

  • Are there common 404s or permanent redirects you can catch in Varnish instead of Django?

1

https://hpd.sh/varnish-vsl

2

https://hpd.sh/varnish-vsl-query

uWSGI

uwsgitop shows statistics from your uWSGI process updated in realtime. It can be installed with pip install uwsgitop and connect to the stats socket (see uWSGI Tuning) of your uWSGI server via uwsgitop 127.0.0.1:1717.

_images/uwsgitop.png

It will show you, among other things:

  • number of requests served

  • average response time

  • bytes transferred

  • busy/idle status

Of course, you can also access the raw data directly to send to your metrics server:

uwsgi --connect-and-read 127.0.0.1:1717

What to Watch

  • Is the average response time acceptable (less than 1 second)? If not, you should look into optimizing at the Django level as described in The Build.

  • Are all the workers busy all the time? If there is still CPU and RAM to spare (htop will tell you that), you should add workers or threads. If there are no free resources, add more application servers or upgrade the resources available to them.

Celery

Celery provides both the inspect command3 to see point-in-time snapshots of activity as well as the events command4 to see a realtime stream of activity.

_images/celery_events.png

While both these tools are great in a pinch, Celery’s add-on web interface, flower5, offers more control and provides graphs to visualize what your queue is doing over time.

_images/celery_dashboard.png _images/celery_monitor.png
3

https://hpd.sh/celery-monitoring-commands

4

https://hpd.sh/celery-monitoring-events

5

https://hpd.sh/celery-flower

What to Watch

  • Are all tasks completing successfully?

  • Is the queue growing faster than the workers can process tasks? If your server has free resources, add Celery workers; if not, add another server to process tasks.

Memcached

memcache-top6 will give you basic stats such as hit rate, evictions per second, and read/writes per second.

_images/memcache-top.png

It’s a single Perl script that can be downloaded and run without any other dependencies:

curl -L http://git.io/h85t > memcache-top
chmod +x memcache-top

Running it without any arguments will connect to a local memcached instance, or you can pass the instances flag to connect to multiple remote instances:

./memcache-top --instances=10.0.0.1,10.0.0.2,10.0.0.3
6

https://github.com/lincolnloop/memcache-top

What to Watch

  • How’s your hit rate? It should be in the nineties. If it isn’t, find out where you’re missing so you can take steps to improve. It could be due to a high eviction rate or poor caching strategy for your workflow.

  • Are connections and usage well balanced across the servers? If not, you’ll want to investigate a more efficient hashing algorithm, or modify the function that generates the cache keys.

  • Is the time spent per operation averaging less than 2ms? If not, you may be maxing out the hardware (swapping, network congestion, etc.). Adding additional servers or giving them more resources will help handle the load.

Database

pg_top

Monitor your Postgres database activity with pg_top. It can be installed via apt install ptop (yes, ptop not pg_top) on Ubuntu. It not only shows you statistics for the current query, but also per-table (press R) and index (press X). Press E and type in the PID to explain a query in- place. The easiest way to run it is as the postgres user on the same machine as your database:

sudo -u postgres pg_top -d <your_database>
_images/pg_top.png

pg_stat_statements

On any recent version of Postgres, the pg_stat_statements extension 7 is a goldmine. On Ubuntu, it can be installed via apt install postgresql-contrib. To turn it on, add the following line to your postgresql.conf file:

shared_preload_libraries = 'pg_stat_statements'

Then create the extension in your database:

psql -c "CREATE extension pg_stat_statements;"

Once enabled, you can perform lookups like this to see which queries are the slowest or are consuming the most time overall.

SELECT
  calls,
  round((total_time/1000/60)::numeric, 2) as total_minutes,
  round((total_time/calls)::numeric, 2) as average_ms,
  query
FROM pg_stat_statements
ORDER BY 2 DESC
LIMIT 100;

The best part is that it will normalize the queries, basically squashing out the variables and making the output much more useful.

_images/pg_stat_statements.png

The Postgres client’s output can be a bit hard to read by default. For line wrapping and a few other niceties, start it with the following flags:

psql -P border=2 -P format=wrapped -P linestyle=unicode

For MySQL users, pt-query-digest8 from the Percona Toolkit will give you similar information.

7

https://hpd.sh/pg-stat-statements

8

https://hpd.sh/pt-query-digest

pgBadger

While it won’t give you realtime information, it’s worth mentioning pgBadger9 here. If you prefer graphical interfaces or need more detail than what pg_stat_statements gives you, pgBadger has your back. You can use it to build pretty HTML reports of your query logs offline.

9

https://hpd.sh/pgbadger

mytop

The MySQL counterpart to pg_top is mytop. It can be installed with apt install mytop on Ubuntu. Use e and the query ID to explain it in-place.

_images/mytop.png

Tip

Since disks are often the bottleneck for databases, you’ll also want to look at your iowait time. You can see this via top as X%wa in the Cpu(s) row. This will tell you how much CPU time is spent waiting for disks. You want it to be zero or very close to it.

What to Watch

  • Make sure the number of connections is well under the maximum connections you’ve configured. If not, bump up the maximum, investigate if that many connections are actually needed, or look into a connection pooler.

  • Watch out for “Idle in transaction” connections. If you do see them, they should go away quickly. If they hang around, one of the applications accessing your database might be leaking connections.

  • Are queries running for more than a second? They could be waiting on a lock or require some optimization. Make sure your database isn’t tied up working on these queries for too long.

  • Check for query patterns that are frequently displayed. Could they be cached or optimized away?

When Disaster Strikes

Despite all your preparation, it’s very possible your systems simply won’t keep up with real world traffic. Response times will sky rocket, tying up all available uWSGI workers and requests will start timing out at the load balancer or web accelerator level. If you are unlucky enough to experience this, chances are good that either your application servers, database servers, or both are bogging down under excessive load. In these cases, you want to look for the quickest fix possible. Don’t rule out throwing more CPUs at the problem for a short-term band-aid. Cloud servers cost pennies per hour and can get you out of a bind while you look for longer term optimizations.

Application Server Overload

If the load is spiking on your application servers but the database is still humming along, the quickest remedy is to simply add more application servers to the pool (scaling horizontally). It will ease the congestion by spreading load across more CPUs. Keep in mind this will push more load down to your database, but hopefully it still has cycles to spare.

Once you have enough servers to bring load back down to a comfortable level, you’ll want to use your low-level toolkit to determine why they were needed. One possibility is a low cache hit rate on your web accelerators.

Note

We had a launch that looked exactly like this. We flipped the switch to the new servers and watched as load quickly increased on the application layer. This was expected as the caches warmed up, but the load never turned the corner, it just kept increasing. We expected the need for three application servers, launched with four, but ended up scaling to eight to keep up with the traffic. This was well outside of our initial estimates so we knew there was a problem.

We discovered that the production web accelerators weren’t functioning properly and made adjustments to fix the issue. This let us drop three application servers out of the pool, but it was still more than we expected. Next we looked at which Django views were consuming the most time. It turned out the views that calculated redirects for legacy URLs were not only very resource intensive, but, as expected, getting heavy traffic during the launch. Since these redirects never changed, we added a line in Varnish to cache the responses for one year.

With this and a few more minor optimizations, we were able to drop back down to our initially planned three servers, humming along at only 20% CPU utilization during normal operation.

Database Server Overload

Database overload is a little more concerning because it isn’t as simple to scale out horizontally. If your site is read-heavy, adding a replica (see Read-only Replicas) can still be a relatively simple fix to buy some time for troubleshooting. In this scenario, you’ll want to review the steps we took in Database Optimization and see if there’s anything you missed earlier that you can apply to your production install.

Note

We deployed a major rewrite for a client that exhibited pathological performance on the production database at launch. None of the other environments exhibited this behavior. After a couple of dead-end leads, we reviewed the slow query log of the database server. One particular query stood out that was extremely simple, but ate up the bulk of the database’s processing power. It looked something like:

SELECT ... FROM app_table WHERE fk_id=X

EXPLAIN told us we weren’t using an index to do the lookup, so it was searching the massive table in memory. A review of the table indexes showed that the foreign key referenced in the WHERE clause was absent.

The culprit was an incorrectly applied database migration that happened long before the feature actually launched, which explained why we didn’t see it in the other environments. A single SQL command to manually add the index immediately dropped the database load to almost zero.

Application & Database Server Overload

If both your application and database are on fire, you may have more of a challenge on your hands. Adding more application servers is only going to exacerbate the situation with your database. There are two ways to attack this problem.

You can start from the bottom up and look to optimize your database. Alleviating pressure on your database will typically make your application more performant and relieve pressure there as well.

Alternatively, if you can take pressure off your application servers by tuning your web accelerator, it will trickle down to the database and save you cycles there as well.

Note

A while back, we launched a rewrite for a very high traffic CMS, then watched as the load skyrocketed across the entire infrastructure. We had done plenty of load testing so the behavior certainly took us by surprise.

We focused on the load on the primary database server, hoping a resolution there would trickle up the stack. While watching mytop we noticed some queries that weren’t coming from the Django stack. An external application was running queries against the same database. This was expected, but its traffic was so low nobody expected it to make a dent in the beefy database server’s resources. It turned out that it triggered a number of long-running queries that tied up the database, bringing the Django application to its knees. Once we identified the cause, the solution was simple. The external application only needed read access to the database, so we pointed it to a replica database. It immediately dropped the load on the master database and gave us the breathing room to focus on longer-term optimizations.


Once you’ve weathered the storm of the launch, it’s time to let out a big sigh of relief. The hardest work is behind you, but that doesn’t mean your job is done. In the next chapter, we’ll discuss maintaining your site and making sure it doesn’t fall into disrepair.