Where developers come to connect, share, build and be inspired.

18

SQL Queries Kill Website Scalability

4002 views

Have you ever designed a system that stays up even when the database goes down?

I have.

The number of people who are lining up to trash me for this post probably looks like the start of the Boston Marathon. But it’s true. Queries and joins are bad for your website performance. So is connecting to a database in the first place.

We can probably all agree that the fastest websites you can write wouldn’t use databases at all.

Nobody's ever said "I think I'll add some database queries to this page to speed it up."

Assuming a typical db powered website, it’s almost always the db that causes the website to crash when there’s a spike in load. If you could remove all external connections from your app layer (databases, external cache pools, etc.) then you truly could horizontally scale to an unlimited degree because there are no blocking subsystems keeping you from building pages. You only need to light up as many web servers as are necessary to handle your load.

Are you always aware of which pages are making db connections? If so, do you know what the db queries are doing? Probably not, due to the fact that there are often no clear default indicators during development indicating db utilization. For example, in PHP, a common session storage strategy is the database. When enabled, PHP automatically connects to the database to perform a combination of SELECT, INSERT, or UPDATE queries on every page load. You wouldn’t notice this unless you monitor the query log. Many profilers wouldn't even pick these queries up since the connection is managed by a different db connection than the one the rest of the site uses.

A better approach here would be to not use the database for session management. Hm..

"But my users need to register. Clearly I need to write database records to have record keys generated that don’t collide with other records."

I think most developers think this way just because that’s the good old standard way of thinking. After all, how would your new record get an ID if the database were down? You need some kind of unique non-colliding key generator.

But what if you just made the keys in your code? Enter the good old UUID function (or any highly unique function for that matter. Yes, UUID's are larger than INT's but its the price you pay for durability). Every one of my tables always has as UUID value and it is by this UUID that I refer to my persistent information. I never write sites which refer to objects via numeric key like this:

GET /users/show/1

Avoiding the database means you can keep accepting user registrations even if the database is down.

Write the data to some other more robust and scalable backend, such as a local file and/or cache and/or queue. As long as you eventually write the data to the db later on, you’re good to go. If you like numeric primary database keys, you can still let the db set them later. You don’t need them though because you’ve keyed your records with UUID’s.

"But on the next page, I would need to SELECT the user out of the system that was just created."

Much like an ORM does relational data retrieval for you, it’s not that hard to write something that retrieves your stored data for you. For example, in PHP, some pseudo code might look like this:

$userModel = $userService->getByUuid($uuid);

If the database is down, what could that getter do to retrieve the user object we wrote in the last step?

If, during the write operation, we wrote the user data to the cache (among other backends), it would be findable, even if the database is down. The best available copy is in the cache! We don’t care how the service gets the best available version of this record, we just know that it gets it for us in the fastest way possible. This durable data access magic is encapsulated in a service which looks and works like many other systems you have seen before.

This pattern is the premise of a data access system I once helped design which I referred to as "Data Layer". How Data Layer works will be the subject of another post.

Getting back to SQL, What's My Problem With JOIN’s

  • A JOIN’s is like telling a waiter to get multiple orders at once. It's part of their job, but if you try to give that waiter 1,000 orders all at the same time, the waiter does a pretty bad job.

To cope with this, apps use a query result cache pattern. If your app has 1,000 different queries, you might have hundreds of different cached copies of query results.

However, when you have many different cached query results, any changes to the underlying data will cause all of the queries which have cached the changed data obsolete.

  • As load increases, JOIN's cause lock contention and deadlocks. When you scale and you eventually start getting lock contention and deadlocks from queries you're doing at runtime, there’s almost nothing you can do other than redesign your code.

SQL Query Result Cache is Fine. What is this cache problem you speak of?

Lets say we’ve got a movie website which has, among other things, a movie detail page and also another page which shows a list of a user’s favorite 80’s movies.

Earlier in my career, when coding the favorite movies page, I would have suggested that you join users to userfavoritemovies and again to movies where the user id is the guy who’s list you’re looking at. "That way, you get everything you want in one database result and you can even cache the query result."

I have grown to hate this approach.

Why? Because you probably have other pages on the site with different queries that join to the movies table too. Those queries also have their query result caches with their own private copy of the database information.

What happens when the underlying information about a movie changes?

In this scenario, you have three basic options:

  • Do nothing, and let your site serve out different inconsistent information about your movies from page to page until the TTL's expire. This option is horrible.
  • Try to remove all of the cached queries that might refer to that movie. This option is also horrible because in a busy site, you need a hydrated cache layer or else things might come to a grinding halt.
  • Write the app differently without joins and with smart record-level caching.

In my opinion, the best solution is the one without JOIN’s, and also with little-to-no database connections, and also where the data is current.

The pseudo code PHP API would look like this:

$userModel = $userService->getByUuid($uuid);

$favoriteMoviesCollection = $userService->getUserFavoriteMovies($userModel);

The above code should look pretty standard. Except you should now start being able to imagine how and why it's working without JOIN's and database connections.

Under the hood, getUserFavoriteMovies() is pulling a cache-backed list of ID's which was pre-heated the last time Johnny updated his favorite movies. This ID list is then then iterated over, one at a time, pulling each movie record out of the elastic memcache-backed data access system. With this approach, the database was not used and all of the records are cache-backed and all of the data is current.

Plus, this pattern allows for a website to remain up even when the database goes down.

Best of all, this pattern of accessing records individually also supports the sharding of a database where as in a JOIN-based system, one MySQL server cannot join to records which are on another shard.

Lastly, using an individual record CRUD-style data access pattern allows you to more seamlessly integrate other backends in such a way that your app logic wouldn't care if the movie record you've pulled was stored in MongoDB, MySQL, files, or anything else. You'd be able to use any kind of backend storage because you didn't write your code to be dependent on the a JOIN feature which is only supported by an RDBMS.

That's the end of the article. If you really like writing queries, try my sql coding challenge.

Comments

  • 53f421eabf21e9641b354ef0c192e4cc_normal
    typicalprog

    PHP can already store session information in the file system, memcache, or redis. There's no reason to reinvent the wheel.

    Adding a UUID to every table as a key is a terrible idea. You bloat your primary key and every index from 4 bytes for the usual INT UNSIGNED to 32 bytes for the UUID, and replace a fast integer comparison with a slow string comparison, for every key match. That might explain the JOIN problems you're having.

    If you need to expose a database row ID in your URLs but want to make it tamper-proof it's better to use a salted SHA1 or even MD5 hash along with the ID. This is a fairly common practice to protect HTML FORM fields from tampering. Even better is using PHP's session storage so you aren't exposing raw database IDs in the browser.

    Throwing out JOIN means not using a relational model (it's called unnormalized data). It may be appealing at first but it will cause duplicate and inconsistent data, and you'll be writing application code to make your app ACID-compliant. You are unlikely to get that right -- RDBMSs have a 30 year head start on you.

    Writing that RDBMSs are not scalabale or that SQL queries "were the leading cause of website fatalities" in 2012 (where did you get that from?) just advertises inexperience. Every big enterprise application and almost every big web site runs on top of RDBMSs, leveraging several decades of reliability, redundancy, optimization, and scalability experience. Do you think your bank uses Mongo or some home-rolled UUID-based system without JOINs? No, they use Oracle.

    In my considerable experience working with broken web sites and databases (that's what I do for a living) I'd say the leading cause of website fatalities is bad programming. If your database server is slow or crashing you should be looking at your database design and queries, not coming to the conclusion that RDBMSs suck. If I crash on my motorcycle it's alway my own fault, and I don't conclude that I know more about building motorcycles than Kawasaki does.

  • Blank-mugshot
    wildlyinaccurate

    If your application is hitting the database enough to crash it then you need to optimise your queries for cacheability, and maybe even revisit the database design. You should probably also get a professional DB admin to set the database server up properly.

    Rolling your own UUID-based system and throwing away JOINs is going to be a nightmare for maintainability and I'll bet it scales worse than a well-configured RDBMS.

    Besides, any good ORM will take care of performance and caching for you. Look into something like Doctrine, which has caching built-in and is super smart when it comes to single big JOINs vs multiple smaller queries.

  • Avatar-ika
    irakli

    @jonathan,

    I second your sentiment. JOINs are evil for scalability. "SQL optimization" is rarely an answer, because they can help performance at low scale, but not as much: scalability.

    At NPR we've been successfully doing something similar to what you're describing for years, now. Our content management system stores data in a highly-reliable and consistent, fully normalized SQL store, but at publish we "burn" the same content as completely de-normalized content objects that can be served super-fast at run-time. Years ago, we used to use file storage for storing those de-normalized documents. They are now stored in an efficient document database.

  • Blank-mugshot
    milesj

    I agree with most of what's stated here, excluding using UUIDs. When I set off to write my own model/DBAL/ORM system, the primary goal was to not use any kind of join internally. This allows for the ORM to hit DBs in other shards, other servers, hell, even other DB engines (as long as the FKs match). It even supports result set query caching so that every record is cached by ID and pulled in, allowing everything to populate and refresh the cache relatively quick. I like to think of it as a lightweight alternative to Doctrine.

    https://github.com/titon/model

  • Logo_normal
    usef

    use nosql database.

  • Jfaytwitter_normal
    jfay_dba

    A swing and a miss from "typicalprog". The author is talking about design patterns that scale. RDBMS certainly can scale, but only to a point at which time, you're looking at very expensive and non-profitable redesign. If you can take some of the learning pain from highly-available and extremely busy websites and systems, apply that past experience to your own designs, you're in a better position.

  • 0adb059_normal
    sqlnightowl

    What kind of DB system are you running? Our DB has been up/running for the past 3 years, 8 months, 4 days and assorted hours. We took it down to upgrade the hardware and DB from SQL Server 2005 to SQL Server 2008R2. It runs in a cluster and we have planned failovers for maintenance, but the database is still up/available. Our web/app serevers are hardly that robust and restart, on average, once a week. Maybe you're solving the wrong problem, I'd fix the DB availability issue first. I do agree that session state shouldn't be stored in an RDBMS, I have seen some fairly slick key-value implementations using noSQL as well as a couple of state server implementaitons that scale quite well. IMHO: you have to use the right tool for the job. BTW: ORMs kill DB performance. Most of the code generated by an ORM where more than one table is involved absolutely suck. If the issue with ORM's ever gets fixed then I may be persuaded, but that day isn't here, yet.

  • 53f421eabf21e9641b354ef0c192e4cc_normal
    typicalprog

    jfay_dba... I know the author refers to "patterns" but the article is a mixture of very specific advice complete with code examples and unsubstantiated claims about the evils of relational JOINs. I don't see any actual patterns here, just straw man arguments.

    It's obvious that architecting an application to not hit the database any more than necessary will probably speed things up: "We can probably all agree that the fastest websites you can write wouldn’t use databases at all." Yes, I agree, but that is neither a design pattern nor is it scalable.

    Switching gears from PHP session management to using UUIDs as primary keys to relational joins the author writes "A JOIN’s is like telling a waiter to get multiple orders at once. It's part of their job, but if you try to give that waiter 1,000 orders all at the same time, the waiter does a pretty bad job." Actually a JOIN is more like having the waiter bring the entire order out at once to serve a table rather than making separate trips to the kitchen for each dish.

    If you decide not to store related data in normalized tables and use the highly-optimized RDBMS to do the work you have two alternatives: Do it yourself by writing the join lookup code in the application (an anti-pattern I see quite a bit, perpetuated by simplistic ORMs), or by abandoning normalization and stuffing everything in a single table. Neither of those scale nearly as well as the relational solution.

    The author has not made a clear case against the relational model or RDBMSs, just some vague statements about JOINs killing websites. Nor has the author offered a workable alternative. There are no design patterns at all that I can discern. The advice about UUIDs and not using JOINs is terrible and flies in the face of years of relational database experience, including some real time-tested patterns.

    Sure, there are applications where the relational model is not the best fit. Those tend to be at the very low end (toy to-do list apps and recipe managers) and at the very high end (web search engine data). I don't think a database of favorite movies is going to tax a relational database.

Add a comment