JSON database engine
Summary
I had an idea of creating / developing a new database engine. An Engine that will consist of storing everything into a huge JSON, and doing query / retrievals based on JSON-Path. This would require a change in the DB perspective, because we would have no table, collections or databases, but everything would be either a value or an array.
Storing Examples
Lets consider storing a list of users
{
"users": [{
"id": 1,
"name": "John Doe 1"
}, {
"id": 2,
"name": "John Doe 2"
}, {
"id": 3,
"name": "John Doe 3"
}]
}
Retrieving an element from this storage will require:
Query: users.<asterisk>.id = 3
This will imply the object path based on which we are doing a query
Note: <asterisk> is the asterisk sign that I don't know how to write in this markup language
Pick : users.<asterisk>
*This will imply the object path from where we “cut” and retrieve the data
Let’s consider now a more complex situation
{
"companies": [{
"name": "Company 1",
"city": "Munich",
"employees": ["John 1", "John 2"]
}, {
"name": "Company 2",
"city": "Munich",
"employees": ["John 3", "John 4"]
}]
}
Lets say now that we want to find out what companies from Munich have an employee called John 3.
Query: companies.<asterisk>.city == Munich and companies.<asterisk>.employees.<asterisk> == John 3
Pick: companies.<asterisk>
The list of examples and complexity can go on and on, but this is not the scope of this post.
A factor that needs to be taken in consideration would be to have uniqueness of the key value pairs. Meaning that if one key value pair is duplicate or more, it will be stored once. If any of the existing usages will be modified, either a new entry will be created or an existing one will be linked.
Design ideas
In order to achieve this structure I had several thoughts, the most “good” ones were the following:
- Ordered path strings with values, for example: users.0.id = 1 users.0.name = ‘John Doe 1’ users.1.id = 2 users.1.name = ‘John Doe 2’
This approach will need to do a fast search for each level and keep this list ordered every time a new entry is inserted.
- Neural network approach:
Each entry (key = value) will be a neuron / node / cell. Based on the JSON structure the engine will build and store a chained tree structure with references for each entity. The search will be propagating itself through the network. A big advantage here would be programming language with high parallelism like (Erlang, Rust, etc.)
Conclusion
Now you might wonder why would anyone might need this kinda thing. Well for me the result is simple. I work with allot of JSON (retrieve, process, push), and any DB that I used at some point I would need to do an “abstraction” to it. MongoDB is one of the closest to this concept, but still, I would need to create Collections, know where each part of JSON is stored, handle document Id’s and take care that the JSON in one document is not too “huge”. By using this concept, I would just need to either push it to a key or append it to an array, I could go endless in levels of deepness.
As I said at the beginning this is just an idea, so I accept any thoughts on it, I never said that is complete, perfect or efficient. It is just a thought.
Written by Dan Harabagiu
Related protips
4 Responses
I'd recommend using a JSON based document store, such as MongoDB
As I said in the conclusion, using MongoDB or anything JSON based document store, would mean that I would need to handle databases, collections, documents which is a little an overhead for the scope of the idea above.
If I would use one db, one collection and one document, I am not sure what is the doc limit on the mongoDB as in size, considering that in the above concept I would store few GB of data.
Anyhow, thanks for the hint.
How exactly would you maintain this object in memory? In order to operate on it, you will have to have the entire dataset loaded, and as you state you would like to store a couple of gigabytes, you would need to maintain that object space in memory for it to be searchable. I can't see the advantage here over a traditional document datastore like MongoDB, where documents are limited to 16Mb for this exact reason
The idea would be not to store it into memory. And definitely not have the whole dataset into memory :), that would be very stupid. The idea was to distribute the tree structure to data-files and be able to fast search them. The whole concept would get out from the "document" concept, while the data structure would "be" only one document.
One of the ideas I had regarding storing this information, was something as following:
* Each level of the JSON tree would be stored into a separate file, with an text index attached to it.
* Each entry in the file would consist of the "leaf" entry to which it relates, the key and value.
So for the root of the JSON I would have a file sort of "DB0" containing:
deepKeyseparatorKeyNameseparatorValue / Or Key pointer to deeper levelend-separator_
So for level 0 let's say we would have for example:
NULLseparatorkey1separatorkeyvalue1end-separator
NULLseparatorkey2separatorkeyvalue2end-separator
NULLseparatorkey3separator#deepkeyID1end-separator
and for the level 1 we would have a file called "DB_1" containing same line logic and for example we could have:
deepKeyID1separatorkeylvl1separatorkeylvl1value1end-separator
deepKeyID1separatorkeylvl2separatorkeylvl2value2end-separator
Considering that each file would have additional an index build in or adjacent to it, for fast text file search, the lookup process would be the following:
for query: find "key1" -> the application would look into the root file "DB_0" and get the value of it.
for query: find "key3.keylvl1" -> the application would look into the "DB0" and get the value of it, will check that this is a deeper key and open "DB1" and look inside for deep_key == value of key3 and key == keylvl1 and return the value.
for query: find "key3.*" -> the application would look into the "DB0" and get the value of it, will check that this is a deeper key and open "DB1" and look inside for deep_key == value of key3, get all the keys and values and build the resulting json object.
With some extrapolation more cases can be build and even use fixed number keys for storing arrays (restricting the user to use number as keys)
I am ignoring from start here, the caching of most used entries and also the additional concept that I have to store the "value" only once in a different table and just used referral keys for values (If the value Germany or New York is existing multiple times would be store only once)
The possible problem that I see here is that the deeper level files would be considerably larger than the root ones. Maybe a partitioning or sharding based on source_key would help.
Cheers,
Dan.
PS: I updated several times this comment, to math the markup language.