Web 3.0 Smartups: Moving Beyond the Relational Database
By Jeff Sayre
<Smartups Series Part 3 of 5>
Today’s Web-based services are dealing with substantially higher volumes of data. But the challenges of data storage and management in the Social Web go beyond the issue of increasing data volume. In Web 3.0, data are significantly more complex and difficult to define ahead of time.
Unfortunately, many existing Web-2.0 startups continue to use only a RDBMS (relational database management system) model for meeting all their data storage and management needs—and some of these startups are starting to see the problems with that decision.
As I state in Part 2 of the Smartup Series:
The Social Web is an emergent property of the Web of Data. It is a logical outcome of the Web’s increasing social connectivity and the semantification of data to make it machine understandable, discoverable, and open.
Square Pegs and Round Holes
What does this mean to Web-3.0 smartups? Highly structured and predefined data models (such as those mandated by an RDBMS) are having a harder time at effectively handling the rich, complex, dynamic datasets that are becoming increasingly common in Social Web-based services.
Smartups are beginning to discover that it is easier to map social-data objects onto NOSQL databases than the traditional RDBMS models. Does this mean that the days are numbered for the RDBMS? No. As we will discuss shortly, a RDBMS may still play an important role in a smartup’s data model toolbox.
Note: Throughout this article, I’ll use the acronym NOSQL versus the NoSQL variant which is sometimes aligned with the “No to SQL” movement as in SQL sucks. NOSQL is now generally considered to mean “Not only SQL,” which in my opinion is the proper message. Some people refer to the non-SQL-alternative DBMSs as post-relational, but that is a misnomer as well as some of these models were in use before the relational model and SQL were even invented. You may occasional also see them referred to as non-relational.
Looking Beyond the RDBMS Box
In an attempt to future proof their application stack, making it as scalable as possible, many Web-2.0 startups continue to use a RDBMS (often MySQL) with a sharded instead of normalized data schema. They myopically choose the same database model and schema as the vast majority of startups that came before them. These startups are not thinking outside of the box. Instead they are following the herd deeper into the ever-expanding box.
Why do they do this? One reason is the false belief that they are on the cutting edge of the proverbial web-scale architecture debate. However, a RDBMS engine–whether sharded or normalized–is not the best storage solution for every data-asset type of a Social Web service.
RDBMS’s ability to scale to size quickly decreases as the complexity of the dataset increases.
Web-2.0 startups that insist on using RDBMSs as their only data warehousing model will see their database engine performance decline as their datasets grow—in size and complexity. They will also see their competitors–those Web-3.0 savvy companies practicing the art of keen smartup–have more success in managing their backends.
When it comes to building scalable services that sit on top of the Web, a one-size-fits-all data partitioning scheme is no longer desirable. In fact, it may be impractical. The days when a RDBMS could be the do all and end all of most Web-based companies back-end data system are over.
The Limits of the RDBMS for the Social Web
Think about the nature of RDBMS use in the enterprise space versus the Web-based social networking space. RDBMSs were designed for the needs of the enterprise, not the Web. Having to serve a few thousand or even hundred thousand employees at a large company is not the same thing as having to serve several million, or more, users on a highly-trafficked social network.
The Web-based Internet is a different arena than the LAN and WAN-based enterprise space. RDBMSs are fantastic for transactional-based data where data persistence (durability) is mandated—such as storage of data for financial applications, product records for large manufacturers, and a number of other enterprise-specific cases.
RDBMSs have worked well with very structured, tabular data. But the Web of Data is moving toward more complex, highly-connected datasets that are more difficult to pigeonhole into predefined table definitions. As we will see later, when the Social Web’s data complexity and connectivity are taken into account, the issues with exclusively relying on a RDBMS become clear.
When it comes to building the next, biggest Web-3.0 destination in the world, relying solely on a SQL-based database is not the best choice. They are terrible at scaling in situations where you have millions of concurrent users. And if you are building the newest killer Social Web service, you better plan ahead for the possibility of having millions of users.
The moral of the story? If your smartup is primarily a user-centric, Social-Web service, then there is little benefit derived from using a database engine optimized for transactional data integrity. Instead of relying on a RDBMS for the data store, one of the new breeds of NOSQL data stores may be better suited.
Scaling RDBMSs versus NOSQL Databases
Note: This section is my best, current understanding of the nuances between read and write operations in relational versus non-relational database models. If you have a clearer understanding, please provide any suggested corrections in the comments section.
Here are a few definitions you need to better understand scaling:
- Scaling up (also called vertical scaling) occurs within a single server
- Scaling out (also called horizontal scaling) occurs by adding additional servers
In brief, scaling really means nothing more than improving the speed of data input and output. It is about increasing data access operations per second. Thus, when faced with data access bottleneck issues, a startup first scales up (within a given server) then scales out by spreading the data across additional servers.
This is accomplished by spending money on hardware in two, primary ways. First, by improving a given box’s (server’s) performance by adding more memory, faster hard drives (to speed up I/O—input/output; writes and reads), and faster network interface cards. Then, if that is not sufficient, more boxes are thrown at the problem, expanding beyond a single box.
NOSQL databases do not have complex SQL-based read operations. Instead, they push the burden of computation to the write end of the equation.
What are the issues with scaling a RDBMS versus a NOSQL database?
In a typical RDBMs, efficiency in reads can be obtained be replication of data across multiple database instances—that can be done on the same physical server but it’s best if done across multiple servers. However, replication does not scale writes. To gain write efficiency requires the partitioning of data creating what is commonly referred to as a distributed database system. In a RDBMS, data can be partitioned either via sharding (horizontal) or functional partitioning (vertical).
When it comes to RDBMSs, once your write volume, database size, or some nebulous combination of both, get to the point were it is too much for one machine to handle, your only option is to partition the database. This is not a trivial process. There are a number of sharding schemes from which to pick, each with advantages and disadvantages.
But, there is more to the read versus write issue. By design, relational databases are more read-burdened than write-burdened. This is primarily the result of the computational requirements of the SQL operations which are read intensive. So, even with a properly sharded RDBMS database, the issue of being read-burdened still exists.
What does this mean in practice? Why do the SQL-based read operations cause the bottle neck? The answer is that distributed joins are difficult to perform efficiently. Complex joins are slow enough when performed on a single database, but when the database is sharded over multiple machines, the task becomes harder. The more machines, the greater the challenge.
Using a NOSQL database model does not mean you will never have to distribute your data across multiple machines. Once you have scaled up as far as your single server node will allow, you will have to shard the DB and scale out.
Even with a dataset that encodes lots of complexity, a graph database can theoretically scale up with several billions of objects and their associated relations without a noticeable decline in performance. More than likely, a RDBMS would have to have been sharded and scaled out much sooner and the performance would noticeably suffer.
Note: Some of the NOSQL database models distribute their data via other methods than sharding. Furthermore, not all of the NOSQL DBs are distributed. However, those that are not are working on making their model truly distributed. This is a topic beyond the scope of this article.
A Web-3.0 smartup can never know when it will need to scale next. It must remain flexible if it is to successfully meet the growing needs of its data ecosystem. When it comes to scalability, being read-burdened is not a desirable scenario since response time (returning data for display) is more mission critical to Social-Web smartups than write time.
When it comes to the data storage and management needs of your smartup, there is one, big question that you should ask: How can we model our entire data ecosystem for seamless scalability? Cost, effort, and long-term effectiveness are three big factors to consider when answering that question.
Meet the NOSQL Family: the New Kid On the Block?
If you think that NOSQL is the new kid on the block, you could not be further from the truth.
Some of the NOSQL database models are older than the Relational DB model. At least in academia, graph databases existed before the relational model was first presented in a June 1970 paper by Edgar Frank Codd. But RDBMSs, for a number of reasons, caught on and were instrumental in helping to build a successful, global IT industry. Their use has also been very influential in the successful development and proliferation of Web-based services on the Internet, from Web 1.0 through Web-2.0.
Below is a listing of some of the more prominent NOSQL database solutions. They are not listed in any particular order of importance nor do I necessarily endorse them by their inclusion in the chart. For an exhaustive list of NOSQL databases, visit this link.
The Proof is in the Digital Pudding
Some of the Web’s largest properties have experienced scaling issues with RDBMS-based data warehousing. What did they do? They went looking for acceptable, non-relational database models to help them effectively and efficiently scale their data operations. They discovered that the NOSQL database model was a perfect fit for part of their data storage needs.
Some chose to code their own columnar database storage solution: Google created Bigtable; Facebook created Cassandra. Others chose to code their own key-value storage solution: Amazon created Dynamo; LinkedIn created Voldemort. Cassandra and Voldemort have since been open sourced.
What were the benefits? Instead of using its existing MySQL database to power its new search feature, Facebook created Cassandra. It accrued a significant performance boost (see slide 21). With one of its 50GB data stores, what used to take MySQL roughly a third of a second to write or read a data element, Cassandra sliced the write time by a factor or 2,500 and the read time by a factor of 22. In both cases, that is a significant gain in performance.
Twitter is another great example. Up until early 2010, it was common to see the Fail Whale on a regular basis, often multiple times a day. Even with Twitter’s dedicate operations staff, they still had difficulties keeping their hundreds of MySQL machines efficiently working. Eventually they realized that throwing more money at beefing up servers, putting more servers online, and hiring more IT staff was not the solution.
Their solution? Twitter switched from MySQL to a NOSQL solution. At first they tried Cassandra but apparently that did not meet their needs. Twitter has since switched back to MySQL, but with a very, crucially-important twist.
They are not using MySQL as a RDBMS. Instead, they are using it as a data bucket, a storage engine. They’ve created their own Graph Database with an equally impressive middleware layer, called Gizzard, that handles the partitioning layer. They’ve also open sourced their solutions.
So, although they are using MySQL, they have created their own NOSQL graph database solution. You can read the linked-to article to see the results.
The moral of these stories? SQL-based databases are terrible at scaling. They are best used for highly complex, transactional-based data where data persistence (durability, concurrency) is mandated—such as storage of data for financial applications and online stores.
This is where the power of non-relational data models flourish in the Web sphere. NOSQL databases do not have complex SQL-based read operations. Instead, they push the burden of computation to the write end of the equation. This results in a marked improvement in overall performance right off the bat even on a single machine. But the real beauty is that scaling many of the non-relational databases is a relatively trivial task—unlike their RDBMS brethren.
To SQL or Not To SQL. Is That the Question?
The issue of whether to use a RDBMS, a NOSQL database, or some combination of the two is not something with which to be concerned if you envision your site as being small or even medium sized. In that case, moderate scalability will most likely not be too big of an issue. It is probably best to use the tools with which you are most familiar.
A niche social-networking site using an RDBMS with a decent data schema should perform adequately with many thousands of unique users. For instance, the vast majority of blog installs–even those with more sophisticated addons such as the new BuddyPress plugin suite for Worpdress—should perform fine under moderate load even though the data foundation is setup for sharded partitioning on a MySQL backend.
The data schema of some of these Open Source blogging and social networking platforms are pre-configured for sharded data storage. This does not mean that these systems are self scaling. In fact, they are not that easy to scale. Once a second physical server is required, a significant manual reconfiguration of the database serving environment is necessary to split the shards between different servers.
Whereas there are plugins and alternate data access classes that can be utilized, the point here is that the use of a traditional RDBMS does not lend itself to easy scaling as it is the read end, not the write end, of the equation that is the real bottleneck. And sharding is done to improve the write end, not the read end.
Comparison of Database Models
The following graph is my compilation and interpretation of others’ attempts at graphically representing the database-model ecosystem in terms of scalability and data complexity. It is an idealized representation at best, used more for demonstrating some salient differences between models. Although each database model is positioned relative to each other based on what others have reported, the placements are approximations and are not meant to be exact.
One final, important note about the graph to the left. The appearance of a single, sweeping arc to represent RDBMS performance should not be interpreted to mean that the NOSQL database models do not have their own performance arcs. They do. See the next chart (insert Performance Arcs chart).
Please do not use this as the definitive proof that NOSQL DBs are better than relational DBs. That is not the case. Each database model has its use. Each database model is a different tool used for a different job. The major point is that in the Social Web, smartups cannot use a RDBMS as the only tool for every job. A RDBMS is not a universal tool.
For an RDBMS, scalability and performance become more difficult as the complexity of the dataset increases. This of course has significant ramifications for the Social Web.
The Social/Semantic Zone: I’ve created an imaginary bifurcation point–more accurately a transition zone–that I call the Social/Semantic Zone. Whereas all the DBMS models can handle semantically-encoded data, there are differences in performance as the complexity of the dataset increases. Once again, this zone is rather arbitrary. The point is to draw attention to the real issue of the difficulties of managing all data types as the number of social connections and complexity of datasets increase.
90% Threshold: This is another rather arbitrary line with a non-arbitrary meaning. Whereas the positioning of the line is not exacting, the message is simple. Approximately 90 percent of all Web-based companies will never experience the data volume that the top 10 percent of the most successful Web-2.0 social networks do. This does not necessarily mean you have nothing to worry about and can just fall back on exclusively using a RDBMS. It does mean, though, that unless your smartup skyrockets to fame, you should have an easier job of managing your data ecosystem as long as you’ve picked the proper database models (tools) for your smartup’s data storage and management needs.
Whereas in general the above graph is a good representation of the relative strengths of each database model, real-world experience in using the various models, the quality of the business-intelligence (business logic) coding in the application layer, and the complexity within each smartup’s data ecosystems affect results. As an example, some engineers are better than others at tuning and optimizing their RDBMS whereas inexperienced engineers trying out a NOSQL database model for the first time may significantly undertune the interface.
The important take away is that certain database models are better at scaling versus handling complex datasets. The most interesting point is that a RDBMS’s ability to scale to size quickly decreases as the complexity of the dataset increases.
This last point demonstrates that all things being equal, there are clearly distinct usage cases where leveraging the proper database model will accrue significant increases in efficiency and performance.
Major sources for charts:
- A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
- Django and Neo4j – Domain modeling that kicks ass
- NOSQL: scaling to size and scaling to complexity
- MongoDB Design Philosophy
Moving Beyond the RDBMS
The days when relational databases are used for the entire data storage needs of a Web-based startup are over. The Web is shifting from a monopolistic to a pluralistic database technology culture. As startups power up to become smartups, they’ll analyze their data needs, determining the suite of database models (RDBMS and NOSQL) that best fit their data asset ecosystem.
When designing the back-end data system for your Web-3.0 smartup, the relevant thought process should be thinking about database layers instead of a single database solution. Only through properly assessing your smartup’s overall data asset ecosystem, will you be able to determine how wide and deep that database layer may need to be.
Whereas it is possible that some smartups might be able to stick with using just one type of database storage engine, it is more than likely that at least two different types may be required at some point. The optimal data warehousing solution may not be apparent for sometime. However, if a smartup is practicing lean (and keen), agile-focused development, the best solution will emerge over time as system performance is continually fine tuned based on user experience and feedback.
This coming Monday, September 20, 2010, I will be publishing article four, the last article, in my Powering Startups to Become Smartups series. We will explore the seemingly-less technical, but just as important topic of the Web 3.0 Business Space.
- Graphs in the database: SQL meets social networks
- Scalable Databases – From Relational Databases To Polyglot Persistence
- Is the Relational Database Doomed?
- InfoGrid Core Ideas
- Graph Databases and the Future of Large-Scale Knowledge Management
- A detailed analysis of part of the NOSQL landscape
- Interview With Eliot Horowitz – CTO Of 10gen / MongoDB
</Smartups Series Part 3 of 5>
Continue on to Part 4—Web 3.0 Smartups: the New Web Business Space
Other Smartup Series Installments