Redis & Go: Building a simple swear word filter
I will be honest. I am fairly new to Go and Redis and even with my limited skill-set with these two technologies I've been able to build a pretty bad-ass swear word filter together for my company.
Warning: This post contains a small set of swear words that you may find offensive. If you do, than you are simply not my target audience. ;)
At this stage, I would consider this tool a prototype that is still in development but I'm going to share with you the recipe for my tooling and some of the choices I made. Also keep in mind, this system is very simple to build and nothing advanced but boy is it freakin' FAST and offers a great starting point.
Let's start with the ingredients list that we will need:
- At least Redis 2.6 - http://www.redis.io
- At least Go 1.1 or higher - http://golang.org/doc/install
- A database of swear words - http://www.infochimps.com/datasets/list-of-bad-words
- Redigo - a Go binding for Redis - https://github.com/garyburd/redigo
- Sugar and Spice and everything nice. (I have all girls, they made me add this.)
Why Go?
Well there are many great choices out there. But I like speed and I like simple. And I also like Go's amazing concurrency support. As it turns out, Go for me has been a great tool for prototyping things in a similar fashion to Python but you also get the benefit of really fast execution speed right off the bat. Go is one of those technologies that really just behaves as advertised, uses little memory and is super stable for building server based applications. I've had a Go web server running on my companies intranet that's been running for about 3 months without a reboot and it's just happy as a clam utilizing 2.3 megs of RAM and when a web page request is made it's also happy to answer with a response with no hesitation. I like that in a product.
Why Redis?
I needed a mini-project to help me learn more about Redis. They say it's fast. They say it's stable. They say it's simple to model your data. They say it's persistent. They say it requires zero configuration. They say it's basically memcached plus more. Out of the box you get the basic key/value storage just like memcached in addition to: hashes, sets, sorted sets, lua scripting, atomic counters, and a boat-load of other convenient functions. If you want a super easy way to play with it head on over to: http://try.redis.io where you can enter commands to your hearts content.
Why Redigo?
Redis has a large selection of client libraries available but since we're using Go, we obviously need to use a Go based client library. I spent some time researching some of the different Redis client libraries for Go and Redigo seams to offer the most support for Redis with a nice Gothic api. Additionally the author Gary Burd understands Go very well. For those that have no idea WTF I just said: Apparently Gothic is Go's way of saying idiomatic Go code just like Python has Pythonic.
//To install Redigo
go get github.com/garyburd/redigo/redis
Why Infochimps database of swear words?
Well, you don't have to use this data-set. However, they did provide a lot of various combinations of curse words that we'd like to filter out. As an example, we obviously need to filter out words like: ass, shit, fuck, etc. But I'm also interested in filtering out combinations of creative spellings like: @ss, sh!t, fcuk, etc. The reality here is that, our solution is only as good as our dataset. If our filter list only has 400 curse words, that leaves a lot of room for people to get creative and spell words using special symbols, or simply gives them the room to come up with word combinations like: fuckface.
The fact of the matter is, my solution is not bullet proof, but it is simple. I'm not making use of any kind of heuristics or machine learning. Furthermore I'm not doing any kind of natural language processing or word roots, stemming, you get the picture. My solution is more of a brute force nature of just feed in data to redis and hopefully you catch a good 85% of the cases.
Feel free to use your own dataset or come up with them yourself to test yourself on how dirty you really are. ;)
On to the technical implementation
It's pretty straightforward here is my amazing implementation:
- Using a pipeline, loop over all your words one by one, adding each words to a Redis set. A pipeline will offer extremely fast performance because it will not wait for Redis to respond for each word.
- If you have multiple data sets of words like what Infochimps provides, you are sure to have duplicates. Who cares? This is why we are using a Redis set. It will only hold distinct curse words.
- As an optimization, make sure you add each curse word in lowercase only. This will eliminate the need to have a set containing all capital versions and all lower case versions as well as all combinations of capital/lowercase. In other words we don't want this in our set: Fuck FUCK fUcK we simply want: fuck f*ck fcuk, etc.
- Once your set is populated, Redis in its default configuration will eventually backup this dataset to disk in a redis dump file. You can then take this file and copy wherever you need for safe keeping. Redis upon reboot, will load this file when found in the correct location.
- If you ever need to add to your database, just have Redis crank on more curse word files, adding each word individually, you can expand your set overtime as you get better curse word data.
- The above steps only need to be done once or as needed when you discover new data sets of curse words.
- Now that we have a redis set of anywhere from hundreds of words, to hundreds of thousands, Redis is now populated and ready to do it's work.
- The good news is Redis should have no trouble storing hundreds of thousands of curse words in memory, so all lookups will be RAM based look ups.
- The other good news is because we added all our curse words to a Redis set, Redis will be able to do matches internally using a super fast hashing algorithm which will do a match in constant time also known as O(1). So, if we're are checking a single word to see if it exists within our bad-words set within Redis we can do something like:
//This will run in constant-time or O(1) resulting in FAST matches
SISMEMBER bad-words fuck
//Returns: 1 for true, 0 for false
- Keep in mind, Redis is not going to do a hash comparison per word in your curse word set, that would be bad. Basically, that would mean Redis is really doing a linear search, also known as O(n) to find a match. Nope Redis, is smart enough and will only need to do one comparison per word that you need to check.
Are you still with me?
If you are still with me, congratulations our first part is done. We have a working Redis database that simply has one key in it called: bad-words. This key is the name of a Redis set in memory that holds our database of curse words. At this point we will build a Go based tool that will allow us to check some input of data that we will store in Redis as another set named: user-words to see if any curse words are present within it.
The word checker app
Alright, let's break down the app below that actually takes some input data and verifies if a curse word is found. Please consider a few things: This app is a prototype, and it's a simple console based application that simple accepts some text from the user and validates this text against Redis upon the user pressing the ENTER key.
The basic structure of the app is as follows:
- Define our package main
- Import the required modules including the Redigo client
- Define our redis instance TCP address and port (in this case localhost)
- Create a single redis connection
- Define our submitData function
- Define our main function which kicks off the process
Now for the details of what's happening:
The app is going to do a blocking loop forever using the empty for construct in Go. Blocking is important because that means it's going to wait for the user to enter a line of text. When the user enters a line of text the app is going to determine if the user simply entered the string "q" then we are going to break out of the for loop and end the session. Anything else, and the app is going to take the user's input and split the line into a string slice. So basically you will have this: "Some text with a bad fucking word" and it will be converted to this: []string of {"Some", "text", "with", "a", "bad", "fucking", "word"}.
Once that string slice has been created we simply call the submitData(input []string) function which will then validate if we have curse words in our string slice with Redis. Now we have a few options here. We can individually loop over each word and send it to Redis seeing if a given word is in the bad-words set or we can package up all of the words and do our test in a sort of batch job. So in this case we are going to add our user submitted words into a user-words set on Redis and take the intersection of user-words and bad-words to give us a newly returned set. This newly returned set will have whatever words were found in both sets. If the returned set comes back empty, that means the check came out clean. Otherwise, it will include the set of bad words found. You can then take this returned set of bad words and do a validation warning to the end user telling them that some perhaps 5 bad words came back and for your viewing pleasure here they are, and then proceed to regurgitate those words on the screen in all their glory.
Please note: In the diagram, the bad-words set is only showing a subset of all the curse words, obviously there could be hundreds of thousands of curse words. The user-words set also stores the user submitted words in an undefined order. That is the nature of sets, their order cannot be relied upon. And in the middle, we have the intersection of both sets; in other words whatever words were found to be in both sets when the SINTER command was executed.
I chose to do it as a Redis multi batch transaction. Here is the core mechanic of this script and how it works:
//1. delete any existing set known as "user-words"
c.Send("DEL", "user-words")
//2. create and store our newly submitted set of words in "user-words"
c.Send("SADD", redis.Args{}.Add("user-words").AddFlat(input)...)
//3. take intersection of both sets
c.Send("SINTER", "user-words", "bad-words")
//4. check the MULTI-BULK response from the EXEC which will also contain any found values
//6. Done!
Here is the code in its entirety:
//The code in its entirety
package main
import (
"bufio"
"fmt"
"github.com/garyburd/redigo/redis"
"log"
"os"
"strings"
"time"
)
const (
ADDRESS = "127.0.0.1:6379"
)
var (
c, err = redis.Dial("tcp", ADDRESS)
)
/*
Submits data to our redis instance
*/
func submitData(input []string) {
if err != nil {
log.Fatal(err)
}
c.Send("MULTI")
//1. delete from temp set
c.Send("DEL", "user-words")
//2. store in a temp set
c.Send("SADD", redis.Args{}.Add("user-words").AddFlat(input)...)
//3. take intersection of both sets
c.Send("SINTER", "user-words", "bad-words")
reply, err := c.Do("EXEC")
if err != nil {
fmt.Println(err)
}
values, _ := redis.Values(reply, nil)
curse_words, err := redis.Strings(values[2], nil)
if err != nil {
fmt.Println(err)
}
if (len(curse_words)) > 0 {
for _, v := range curse_words {
fmt.Println(">>Found: ", v)
}
} else {
fmt.Println(">>Nothing found")
}
}
func main() {
for {
fmt.Println(">>Please enter some text with swear words than press Enter or \"q\" to exit")
bio := bufio.NewReader(os.Stdin)
line, _, _ := bio.ReadLine()
if string(line) == "q" {
break
}
terms := strings.Split(string(line), " ")
submitData(terms)
}
fmt.Println("Session Ended!")
}
In Closing
In this example, the whole point was to demonstrate an easy way to build a Redis curse word filter using Go. I know this example looks overly simple but it actually works surprisingly well. Redis will do my lookups in O(Mx2) where M is the size of []string slice and 2 is the number of sets we are working with on a given lookup. This translates to extremely fast lookups. On my MacBookPro, for a half a page of text, Redis only takes ~340 microseconds to do its work. Yes, that is MICROSECONDS, not milliseconds. I cannot stress to you dear reader how fast that is due to the power of redis sets. You will find the network latency will take far greater time than the time it's taking Redis to do these matches.
One more observation: Because we doing a redis transaction to handle the work we don't have to worry about multiple concurrent callers stepping on each other's toes by using the same named "user-words" set as our temp set to hold the current submitted words. Redis will queue all the commands locally and then make sure it executes the steps above as a single atomic unit of work.
Additional Improvements
- Convert the Go code into a web service
- Use connection pooling so that concurrent goroutines can safely share connections to Redis
- Currently, punctuation will mess up the user words. If you submit the following: "Today was a bad day. Fuck!", you will be submitting a slice that looks like this to Redis: {"Today", "was", "a", "bad", "day.", "Fuck!"} and so you probably won't get a match on "Fuck!" because we aren't doing substring detection.
If you liked this article
Please consider up-voting it which is encouraging to do more writing. Also check out my other article on using an open-source set implementation for Go: https://coderwall.com/p/latrzg?i=2&p=1&q=author%3Adeckarep&t%5B%5D=deckarep
Written by Ralph Caraveo
Related protips
5 Responses
Nice article! You'll be featured in our next RedisWeekly ( http://redisweekly.com )!
Awesome, glad to hear that @FGRibreau
+1 This is very helpful. I have a question. Where do you add "bad-words" to the key store in the example?
Hi,
I was just trying to run your code with redis. I typed word like fuck and it says "nothing found". am i missing something as per your blog? or it is as expected ?
this (A database of swear words - http://www.infochimps.com/datasets/list-of-bad-words) link is not opening. if i create my own word database where should i put this file and what would be the extension of file.
/go$ go run go-redis.go
Please enter some text with swear words than press Enter or "q" to exit
fuck
Nothing found
Please enter some text with swear words than press Enter or "q" to exit
fuck
Nothing found
Please enter some text with swear words than press Enter or "q" to exit
```
Very neat trick! Thanks a bunch.