Have you ever designed a system that stays up even when the database goes down?
The number of people who are lining up to trash me for this post probably looks like the start of the Boston Marathon. But it’s true. Queries and joins are bad for your website performance. So is connecting to a database in the first place.
We can probably all agree that the fastest websites you can write wouldn’t use databases at all.
Nobody's ever said "I think I'll add some database queries to this page to speed it up."
Assuming a typical db powered website, it’s almost always the db that causes the website to crash when there’s a spike in load. If you could remove all external connections from your app layer (databases, external cache pools, etc.) then you truly could horizontally scale to an unlimited degree because there are no blocking subsystems keeping you from building pages. You only need to light up as many web servers as are necessary to handle your load.
Are you always aware of which pages are making db connections? If so, do you know what the db queries are doing? Probably not, due to the fact that there are often no clear default indicators during development indicating db utilization. For example, in PHP, a common session storage strategy is the database. When enabled, PHP automatically connects to the database to perform a combination of SELECT, INSERT, or UPDATE queries on every page load. You wouldn’t notice this unless you monitor the query log. Many profilers wouldn't even pick these queries up since the connection is managed by a different db connection than the one the rest of the site uses.
A better approach here would be to not use the database for session management. Hm..
"But my users need to register. Clearly I need to write database records to have record keys generated that don’t collide with other records."
I think most developers think this way just because that’s the good old standard way of thinking. After all, how would your new record get an ID if the database were down? You need some kind of unique non-colliding key generator.
But what if you just made the keys in your code? Enter the good old UUID function (or any highly unique function for that matter. Yes, UUID's are larger than INT's but its the price you pay for durability). Every one of my tables always has as UUID value and it is by this UUID that I refer to my persistent information. I never write sites which refer to objects via numeric key like this:
Avoiding the database means you can keep accepting user registrations even if the database is down.
Write the data to some other more robust and scalable backend, such as a local file and/or cache and/or queue. As long as you eventually write the data to the db later on, you’re good to go. If you like numeric primary database keys, you can still let the db set them later. You don’t need them though because you’ve keyed your records with UUID’s.
"But on the next page, I would need to SELECT the user out of the system that was just created."
Much like an ORM does relational data retrieval for you, it’s not that hard to write something that retrieves your stored data for you. For example, in PHP, some pseudo code might look like this:
$userModel = $userService->getByUuid($uuid);
If the database is down, what could that getter do to retrieve the user object we wrote in the last step?
If, during the write operation, we wrote the user data to the cache (among other backends), it would be findable, even if the database is down. The best available copy is in the cache! We don’t care how the service gets the best available version of this record, we just know that it gets it for us in the fastest way possible. This durable data access magic is encapsulated in a service which looks and works like many other systems you have seen before.
This pattern is the premise of a data access system I once helped design which I referred to as "Data Layer". How Data Layer works will be the subject of another post.
Getting back to SQL, What's My Problem With JOIN’s
- A JOIN’s is like telling a waiter to get multiple orders at once. It's part of their job, but if you try to give that waiter 1,000 orders all at the same time, the waiter does a pretty bad job.
To cope with this, apps use a query result cache pattern. If your app has 1,000 different queries, you might have hundreds of different cached copies of query results.
However, when you have many different cached query results, any changes to the underlying data will cause all of the queries which have cached the changed data obsolete.
- As load increases, JOIN's cause lock contention and deadlocks. When you scale and you eventually start getting lock contention and deadlocks from queries you're doing at runtime, there’s almost nothing you can do other than redesign your code.
SQL Query Result Cache is Fine. What is this cache problem you speak of?
Lets say we’ve got a movie website which has, among other things, a movie detail page and also another page which shows a list of a user’s favorite 80’s movies.
Earlier in my career, when coding the favorite movies page, I would have suggested that you join users to userfavoritemovies and again to movies where the user id is the guy who’s list you’re looking at. "That way, you get everything you want in one database result and you can even cache the query result."
I have grown to hate this approach.
Why? Because you probably have other pages on the site with different queries that join to the movies table too. Those queries also have their query result caches with their own private copy of the database information.
What happens when the underlying information about a movie changes?
In this scenario, you have three basic options:
- Do nothing, and let your site serve out different inconsistent information about your movies from page to page until the TTL's expire. This option is horrible.
- Try to remove all of the cached queries that might refer to that movie. This option is also horrible because in a busy site, you need a hydrated cache layer or else things might come to a grinding halt.
- Write the app differently without joins and with smart record-level caching.
In my opinion, the best solution is the one without JOIN’s, and also with little-to-no database connections, and also where the data is current.
The pseudo code PHP API would look like this:
$userModel = $userService->getByUuid($uuid);
$favoriteMoviesCollection = $userService->getUserFavoriteMovies($userModel);
The above code should look pretty standard. Except you should now start being able to imagine how and why it's working without JOIN's and database connections.
Under the hood, getUserFavoriteMovies() is pulling a cache-backed list of ID's which was pre-heated the last time Johnny updated his favorite movies. This ID list is then then iterated over, one at a time, pulling each movie record out of the elastic memcache-backed data access system. With this approach, the database was not used and all of the records are cache-backed and all of the data is current.
Plus, this pattern allows for a website to remain up even when the database goes down.
Best of all, this pattern of accessing records individually also supports the sharding of a database where as in a JOIN-based system, one MySQL server cannot join to records which are on another shard.
Lastly, using an individual record CRUD-style data access pattern allows you to more seamlessly integrate other backends in such a way that your app logic wouldn't care if the movie record you've pulled was stored in MongoDB, MySQL, files, or anything else. You'd be able to use any kind of backend storage because you didn't write your code to be dependent on the a JOIN feature which is only supported by an RDBMS.
That's the end of the article. If you really like writing queries, try my sql coding challenge.