So, welcome to the track on NoSQL databases, although it seems like we've already had a couple of tracks on NoSQL databases. Um, my name is Martin Fowler. Um, Stephvenoski's hosted the track. He asked me to kick things off. Most of this track is going to be about practical experience of people making use of NoSQL databases. Um, but this talk is the exception because this is really an introduction to what NoSQL Databases are all about. Um, I'm going to do my best to cram into 50 minutes as much useful information as I can that will help

you give you a context for understanding a lot of what goes on in the later talks. And the first part of this is I'm going to talk a little bit about the history of NoSQL databases because as with many things to understand why something is the way it is, it's useful to know how on earth it got there in the first Place. Now when I started in the computer industry in the mid80s it was just at the point at which relational databases were really coming in and beginning their rise. Um it's kind of hard to

imagine that there was a time without relational databases but I remember when they were the new hot thing that was people were arguing about whether there would be any good or not and they've brought us many benefits. Obviously they Look at the persistence of our data. Um they're also very important in the fact that they manage concurrency through transactions. SQL has become a de facto standard language to talking to these databases. It's not entirely standard, but it's standard enough that once you know SQL, you can talk to these different tools. They've also become very important

for many organizations for integration and reporting, which as we'll see has both Its ups and downsides. So SQL databases are a really good thing, but they also have some problems. And the most obvious problem is one that most application developers run into um as they're working with them, which is that we assemble structures of objects in memory, often in terms of a kind of a cohesive whole of things, and then in order to save it off to the database, we have to strip out it into bits so that it goes into those individual rows and

Individual tables. a single logical structure in uh for our user interface and for our processing in memory ends up being splattered across lots and lots of tables. This is referred to as the impedance mismatch problem. Right? The fact that we have these very two different models of how to look at things and the fact that we have to match them causes difficulties. This is what leads to object relational mapping frameworks and all that kind of Stuff. Now the impedance mispatch problem is sufficiently of an awkward problem that in the mid '90s people said well we

think relational databases are going to go away object databases are going to come in that way we can take our in-memory structures and save them directly to disk without any of this mapping between the two. But we know what happened there. We didn't see the object databases. people Like me who thought that they were going to be a a dominant thing in the future. We were wrong and you still listen to me, but oh well. I guess you easily taken. But we argue endlessly about why it is that object databases didn't actually fulfill that potential.

And I think at the heart of it is the fact that SQL databases had become an integration mechanism that many people integrated different applications through SQL Databases and as a result that really made it very hard for any other kind of technology to come in and that led to relational continuing to be dominant right through into the 2000s. So relational has had 20 years of complete dominance of certainly the enterprise data space and plenty of other ones as well. I mean we saw with the the science work at the large hon collider they didn't really

want to use relational databases and but they had to To some degree at least. What changed really was the rise of the internet and particularly sites that have lots and lots of traffic. a big internet site such as an Amazon or a Google or a bet fair or something of that kind. As you get large amounts of traffic coming into your data, what do you do? Well, you need to scale things. And the one obvious route is to scale things up. Buy bigger boxes. But that approach has problems. You can only it Costs a lot

and there are real limits as to how far you can go. So, as I hope you all know, a lot of organizations, most famously Google, use a completely different approach. Lots and lots of little boxes, just basically CPU, motherboards, discs, commodity hardware, all thrown into these massive grids. But here there's an issue for the data storage. SQL was designed to run on those big boxes, designed to run as a single data uh node system. It does not Work very well with large clusters of little boxes. And several of the big data players understood this. Um

they tried they attempted I've talked to several people who have attempted to spread uh relational databases and put run them across clusters. The usual term that comes up in conversation when they describe how they tried to do this was unnatural acts. It's very hard to do. So, a couple of organizations said, "We've had enough of this. We need to do something different." And they developed their own data storage systems that were really quite different from relational databases. And they started talking a little bit about them, published papers um that talked about what they were up

to. And it is this that really inspired a whole new movement of databases, which is the NoSQL movement. Now, it's important at this point to talk a little bit about where this term NoSQL comes from. A lot of people complain about it quite reasonably because they say, "Well, it's a a really odd term trying to define a movement by something it's not." And the origin is really very simple. There was this guy in London, Johan Oscerson. He'd done a lot of work with Hadoop and things like that. He wanted to have a look. He had

to go to a conference in California. He wanted to take a look at all of these various interesting databases that were Poking around at the time. And he said proposed a meetup, a little meeting where people could discuss ideas. And of course, if you're going to do that in the late 2000s, you absolutely need something that's really, really important. You need a Twitter hashtag. So, we asked around, well, what would be a good hashtag? It's got to be short. It's got to be unique. Um, so we can easily sort on it. and a guy came

up with uh the hashtag NoSQL. That's all NoSQL was ever meant to be. A Twitter hashtag to advertise a single meeting one point in time. The fact that it has now become the the the name of a whole movement was completely accidental. Nobody thought that was going to be the case. So, you know, this is the way language often goes. It's very unpredictable, fits and starts. So there was a whole bunch of people who turned up to that meeting. By the way, this is the the list of of people there. That's not what we call

the whole set of NoSQL databases since of a lot of databases who weren't at that meeting are now considered part of that NoSQL umbrella. So this inevitably leads you to the question of well what is the definition of NoSQL? And this is something I had to kind of think about writing a book about the subject. If you it's important if you're going to write a book about something to define what it is you're writing about. My conclusion Is we cannot define NoSQL databases because of this very odd history. What we can do is we can

identify some common characteristics of NoSQL databases and there's a whole bunch of ease. Obviously NoSQL databases are not relational. It's actually more about non-reational than it is about NoSQL. obviously as a strong leading towards cluster friendliness, the ability to run on large clusters because that's what the original Spark through um Google and Amazon came from. But that's not an absolute characteristic. There are some NoSQL databases that aren't really focused around running on clusters. Most of these databases rather interestingly are open- source. So most of the things we generally call NoSQL databases are open source. There

are commercial tools that like to call themselves NoSQL databases and maybe over time that will become part of the that would no longer be a common characteristic but it is Still a common characteristic at the moment. Perhaps most importantly is they're all things that have come out of the 21st century website culture. Um there are plenty of databases out there going back long before relational databases that do not use SQL or the relational model. But we don't call such things as IMS or MS for those who have heard of either of those things rel uh

NoSQL databases. So that's what I see as the common Characteristics. I'll mention the last one in a moment. So one of the things that's interesting about NoSQL databases is they use different data models to the relational model obviously since the name says that. And if we kind of plot a picture of um the most commonly referred to NoSQL databases, typically what we see is that they get divided into four broad chunks based on their data model. And let's dig into these data models a little bit more. So the most simple data model to talk about

is that of the key value store. The basic idea is you have a key, you go to the database, tell me grab me the value of this key. The database knows absolutely nothing about what's in that value. It could be a single number. It could be some complex uh document. Um it could be an image. The database doesn't know, doesn't care. And you can think of this basically as Just a hashmap but persistent AC in the disk. Simple as that. Another data model that's very common is the document data model. Now the document data model

thinks of a database as this um storage of a of a whole mass of different documents where each document is some complex data structure. Usually um that data structure is represented in forms of JSON because JSON is what's fashionable these days. I mean you could Do it in XML but who wants to be seen wearing XML in public? No one. So we have these different documents that all flash around and the usual document databases will allow you to say give me a document that has these fields with these values. You can query into the document

structure and you can usually retrieve portions of a document or update portions of a document. So there's a big difference there to the to the key value store where it's a very Opaque structure and the document is much more transparent. One thing to notice right away about these databases about document databases and indeed all NoSQL databases is that they don't tend to have a set schema. With a relational database, you can only put the data into the database as long as it fits in the schema that you've defined for that database. With almost all NoSQL

databases, basically you can shove anything in you like any any stuff You like, just go in there. And the NoSQL people will talk endlessly about how this increases your flexibility. It makes it easier for to migrate data over time. It's all absolutely wonderful. And as usual, that's not really the entire truth. I mean, usually when you're talking to a database, you want to get some specific pieces of data out of it. You're going to say, I would like the price. I would like the quantity. I would like the customer. As soon as You're doing that,

what you're doing is you're setting up a implicit schema. You are assuming that an order has a price field. You are assuming that the order has a quantity field. You're you're assuming that it is called price and not um cost or um price to customer or whatever other thing you could think of what it would be. that implicit schema is still in place and you've got to manage that implicit schema in many ways in a similar approach to the way that You manage the relational um more strict schema. So schemalas is really a bit of

a wussy term here. Now it by having the no fixed storage schema does give you some options that you don't get with relational databases and and there is a difference and there are advantages in terms of flexibilities as well but you can't ignore the fact that you were always dealing with an implicit schema. The only time you don't have to worry about an implicit schema is if you do Something like give me all the fields in this record and throw them up on the screen field name value. And occasionally you want to do that but

most of the time you actually want to do something more interesting. So I've talked about two data models key value and document data models and I've presented them as two quite different things but actually the the line between these two is a hell of a lot more fuzzy than that. Many key value data stores allow you to store metadata about the value. This allows of course you to have build more complicated indexes. I mean it's if you want to get all the orders for a particular customer, you don't want to search every order in the

database to find the the moral equivalent of a table scan. You want to index that. So key value databases allow you to store various metadata things typically which kind of makes them feel a bit like Document databases, right? And then on a document database, yeah, you can do all sorts of queries against a thing, but often there's an ID and often when you actually look that up, you actually do it by saying, give me the thing with that particular ID. And that ID is effectively the same as the key in a key value store. So

the boundary between a key value and a document database, as I said, is somewhat blurry. And I've often heard a Particular database sometimes described as key value and sometimes described as document. In reality, I wouldn't worry too much about the difference between them. Think of it as a kind of a first approximation um to work with, but it's not actually that important as it goes on. What is important though is that both key value and document databases have this common notion of you're taking some complex structure that you can save as a Single unit into

the database. Whether it be an a relatively transparent document or a completely opaque value, that notion still exists. And that commonality made me think well we really need some term to describe databases that work kind of like that. And so um for the book I came up with a term an aggregate oriented database that have that allows you to store these big complex structures. And where did the term aggregate come from? It comes from Um this book here um written by Eric Evans domain driven design. How many people have read domain driven design? Hopefully a

good few of you. Excellent book. Um it really talks about how to think about modeling domains. And one of the key concepts in the early part of domain driven design is that often when we want to model things, we have to group things together into natural aggregates because when we're talking to a database, even a relational database, It makes sense to think of those aggregates when we're storing and retrieving data. If we're modeling um orders for instance, usually we'll have separate classes for the orders and the line items. That's pretty kind of a standard object

101 model. But we think of the order as a whole thing, a single unit. So an aggregate may be many diff many objects in many classes. It may be quite a complex structure. But when we're talking about persisting it or Retrieving it from memory, we think of it as one thing to cross back and forth. Now in a relational database we have to splatter that aggregate across a whole bunch of tables. But nice thing about an aggregate oriented database is we can save that aggregate as its single unit in the terms of the database itself.

So for a key value database the aggregate is the value. In a document database, the aggregate is the document And that becomes the single unit that we move back and forth. And I I certainly find this is a much easier way to think about the commonality um of these classes of databases. Now, the third data model I'm going to briefly describe is that of column family databases. Now, this is the more complicated data model. um of these it is another aggregate oriented database. However, the column family database Basically says we have some think of it

single key they call it a row key and then within that we can store multiple column and families where each column family is a a combination of columns that kind of fit together. The column family here is effectively your aggregate and you address it by a combination of the row key and the column family name. Now column families can also be kind of different. Look at the lower one here That is effectively a list of items the the various orders for a customer. So that doesn't kind of feel so much like the a typical um

record structure that you might know about, but it is of course the same as storing an array in a document and and something of that kind. So again, you get something of that that kind of rich structure um that you can build in here. Column family databases give you a slightly more complex data model to work with, but the benefit you Get is again in terms of the retrieval. you can more easily pull individual columns and things of that out of the case. But again, the broad data model is that of an aggregate oriented picture.

So the great thing about this is that now when you're taking your aggregate in memory, instead of spreading it across lots of individual records, you get to store the whole thing in the database in one go. And the database knows what your aggregate boundaries are. Now, this is interesting. Where it becomes really useful is when we talk about running the system across clusters because if you're going to distribute data, what you want to do is you want to distribute the data that tends to be accessed together. And so the aggregate tells you what data is

going to be accessed together. So by placing different aggregates on different nodes across your cluster, you know that when somebody says, "Oh, give me the details About this particular order, you're only going to go to one node on the cluster instead of shooting around, goodness knows how many," to pick up different rows from different tables. So aggregate orientation naturally fits in very nicely with storing data on large clusters. And that's of course part of the whole thing with Big Table and Dynamo. both effectively went for um cluster oriented approach. Um big table very much a

Column family style approach. Dynamo much more a key value store but it makes running on clusters efficiently way more straightforward. And that's really been as I said the driving factor here. But however, nothing is perfect and aggregate orientation isn't always a good thing. Um, let's imagine we've got our order system and we want to look at the data like this. We want to say given a particular product, tell me the revenue, Tell me a past revenue. We now not care about orders at all. We only care about what's going on with individual line items of

many orders, grouping them together by product. Effectively what we're doing is we're saying we want to change the aggregation structure from one where orders aggregate line items to ones where products aggregate line items. The product now becomes the root of the aggregate. Now in a relational database this is Straightforward. We just query the data differently. Um it's very straightforward to rearrange the data into the structures we might want in different cases. With an aggregate oriented database it's a pain in the neck. um you can do it and what they'll typically do is they will run

map produce jobs to rearrange all your data into different aggregate forms and and probably keep those um persistent or Maybe to even do incremental updates but it's always going to be more complicated. So being aggregate oriented is an advantage if most of the time you're using the same aggregate to push data back and forth um into persistence. it is a disadvantage if you want to slice and dice your data in different ways. So what I've done so far is I've managed to cover some of these models. I've basically taken the document column Family and key

value and lumped them together under this aggregate oriented category. And I think that's a useful abstraction at least at the level of what I can say in 50 minutes. There's one very noticeable outlier that you see though and that is graph databases. Graph databases are not aggregate oriented at all. They use a completely different data model. Um a graph databases data model is basically that of a node and arc graph Structure. Not a bar chart or anything like that but just nodes and arcs. Something that hopefully we familiar um at least from uh um a

few boring computer science classes. And the nice thing about storing a graph database is that it's very good at handling moving across relationships between things. Relational databases, you might think with the word relation in there that they're good at handling relationships. But of course, relation doesn't mean relationship. It means something in set theory. And actually, relational databases are not terribly good at jumping across relationships. You have to set up foreign keys. Um, you have to do joins. If you do too many joins, you can get in a mess. If you've modeled a graph structure or

a hierarchy, a special form of graph structure in a relational database, you'll have had this experience. It's not Straightforward. Relational databases aren't good at this. So, graph databases come in and say, "Yeah, we can handle um jumping around relationships left, right, and center. We make it easy to do, and we optimize to make it fast to do that kind of thing." Furthermore, we can come up with um an interesting query language that is designed around allowing you to query graph structures. Um this kind of query here, this is a cipher from Neo forj is All

about saying well given a certain um graph structure let me use that graph structure to express a more complex query. And you can do some very interesting graph oriented queries in graph databases things that would be very very difficult to write in terms of SQL as well as a pig to uh u in terms of performance. So in many ways you can kind of think of we they've gone in opposite directions. Aggregate oriented databases take a lot Of stuff that's scattered around and puts them into bigger lumps while graphoriented databases kind of break things apart

into even smaller units and let you play with those um smaller units more carefully. I mean you can still model relationships in um aggregate oriented databases just as you can in relational databases. You basically refer to ids in different documents but it's a lot more messy. Um, so part of your decision as to whether a A NoSQL database is going to be interesting to you is how do you work with your data? Do you tend to work with the same aggregates all the time which would lead you towards an aggregate oriented approach? Do you want

to really break things up and jump across lots and lots of relationships in a complex structure that would lead you to a graph approach or is the tabular structure working well for you in which case you want to stay with a relational approach. So, no SQL divides into those two categories. All of these are schemalas. So, the graph databases as well allow you to add any bits of data to any node. You have all that flexibility but with the same caution about implicit schemas as well. So, that is kind of half of the picture, the

data model part. Now I'm going to move on to another issue which is about consistency and effectively dealing with lots of People trying to modify the same data at the same time. You've probably heard something like this that relational databases they are acid. They do the familiar acid transactions that we all know and love. Atomic, consistent, isolated, durable. No SQL, they don't do any of that kind of thing. And of course, no SQL people will say, well, we do base, which is an even more contrived and meaningless um acronym Than acid. And I won't even

attempt to tell you what it is because I can only remember what it is on Tuesdays. But basically what it boils down to is if you've got a single unit of information and you want to split it across several tables. What you don't want to be then is caught in a position where you only get to write half the data and somebody else reads it or you get to write half the data and somebody takes the same order And writes a different half of the data and things get really messy. In that kind of situation,

you need to have this mechanism to control to effectively give you atomic updates. And that's really what transactions are all about. Atomic updates so that you either succeed or fail and nobody kind of comes in the middle and messes things up. Now, when it comes to um our nicely organized set of NoSQL databases, the first thing to point out is graph Databases do tend to follow acid updates, which makes sense. They decompose the data even more than relational databases do. So they've got even more of a need to make sure they use transactions to wrap

things together. So if anybody tells you, oh no SQL databases, they don't do acid. You now know an immediate rejoinder. Ah, but graph databases do. Now aggregate oriented databases they actually don't need transactions as Much because the aggregate is a kind of bigger more richer structure. In fact if you read the domain driven design book one of the things they point out is that the aggregates in domain driven design are transaction boundaries. You shouldn't less transactions cross aggregate boundaries because if you do it'll just be complicated to manage the concurrence of your system. So the

domain driven design community from the beginning even before NoSQL said keep Your transactions within a single aggregate and that's effectively what you do in um the world of aggregate oriented databases. Any aggregates update is going to be atomic. It's going to be isolated. It's going to be consistent within itself. It's only when you update multiple documents in a document or in database that you have to worry about the fact that you haven't got acid transactions. But that problem occurs much more rarely than you'd Think. So that's the first line about acidbased think. Um a some

databases are as fully acid anyway and the aggregate oriented databases that aren't they are acid within their aggregates which is kind of what really matters. But there's also a bit more to thinking about consistency even than that because even in a relational world acid transactions don't mean we get to be completely consistent and don't have to Worry about um update anomalies and I will walk you through what hopefully is a very familiar scenario to point this out and also to illustrate how you deal with um some of this. So imagine we have some typical multi-layered

system. We've got a person talking to a browser. Browser talks to a server. server talks to a single database and we're going to have two people talking to the same data in the same database at the same time although through different browsers and Servers. And here's the basic little scenario. We begin with both people left and right taking the same piece of data with a get request essentially. They bring it up onto the browser screen and now the human being goes I need to make some changes to this and eventually the guy on the left

I always get my left and right confused says okay I've got my updated data let's post some changes and then shortly afterwards the guy on the right says I've uploaded my data now let's post some changes now of course if we let that happen just like that warning conflict. This is a right conflict. Two people have updated the same piece of information. They weren't aware of each other's updates and they've got themselves in trouble. Acid to the rescue, right? What do we do? Well, what we have to do to prevent This conflict is we wrap

the entire interaction from getting the data onto the screen and posting it back again in a transaction. That way, we make sure the database will ensure that we don't get a conflict. Effectively, one of them will be told, "No, you got to do this again. Retrieve your data again." We don't get conflicts. Problem solved. How many people do this on your production systems? Somebody. Yeah. Occasionally, you can Get away with this. Most of the time you can't. Why? Because holding a transaction open for that length of time while you've got a user looking and updating

the data through the UI, that's going to really suck your performance out of your system. Right? So, and and I want to stress you can do this in some circumstances. If your performance needs are are really very minor, you've only got a handful of people looking using the system at once, you might be able to Get away with this approach. And it and it is advantageous to do so because a whole lot of problems go away if you do this. But for most systems, you can't afford to you can't afford to hold transactions open that

long. And in fact, most people who write about transa building systems like this will tell you never to do this. Don't hold transactions open for a user interaction. What they say instead is you just wrap the transaction around That update, that last bit of updating the database. And that's a good thing because that stops a collision where one half done update me mixes up with another half-done update and you get some tables updated over here and some different tables updated differently over there and the result is an inconsistent mess. But you still effectively get a

conflict because the two people made updates of the same piece of information without knowing the Other person did that. And this is what typically might happen even in an aggregate orient a aggregate oriented database if you have to modify more than one aggregate because you might find one person modifies the first one and then they go over to the second one the other person does it the other way around and as a result you could lead into an inconsistent between aggregates. Now, if you've come across this, which You probably have, you will probably also know

how to solve this. And basically use a technique which uh in one of my previous books I referred to as an offline lock. Um basically what that means the usual way of implementing this is that you give each data record or each aggregate at least a version stamp and when you retrieve it, you retrieve the version stamp with the aggregate data. When you post, you provide the version stamp of where you Read from and then for the first guy everything works out okay. The version stamp gets incremented and then when the second person tries to

post um they still got the old version stamp and then you know something's up and you can do whatever conflict resolution approach that you take. You use the same basic techniques again with um you working with a NoSQL database. The nice thing is you don't have to worry about transactions about This problem so much because the aggregate gives you that natural unit of update. It is your transaction boundary. But once you cross aggregates then you've got to think about juggling version stamps and doing something of that kind. But it's not really very different to what

you have to do with a relational database because offline locks force you to do this juggling with version stamps anyway. So yeah, you don't get these um acid Transactions to the same degree that you do with a relational database, but the impact is not as great as some people think because we actually have to deal with this stuff all the time anyway. Now when we talk about consistency, I find it useful to think about actually two kinds of consistency. The consistency I've been talking about so far is what I call logical consistency. These consistency issues

occur whether you're running on a Cluster of machines or whether you're running on one single machine. You always have to worry about these kinds of consistency issues. Now when you start spreading data across multiple machines, this can introduce more problems. When it comes to distributing data broadly, you can talk about it in two different ways. One is sharding data, taking one copy of the data and putting it on different machines so that each Piece of data lives in only one place, but you're using lots of machines. Sharding doesn't really change the picture very much. You

still get the same logical consistency problems that you do with a single machine. They're exacerbated to some degree, but the basic problems are still the same. Another thing however that's common to do with clusters of machines is to replicate data to put the same piece of data in lots of places and this can be Advantageous in terms of performance because now you've got more nodes handling the same set of requests. It can also be very valuable in terms of resilience. If one of your nodes goes down the other replicas can still keep going. So hence

they'll talk a lot about availability and resilience um with u these cluster oriented approaches. However, as soon as you replicate data, a new class of consistency problem starts coming in and again illustrate With a s simple example. So here we have two people myself and my co-author promote and we both want to book a particular hotel room and so we send in our booking request and we're we're happen to be on different continents promotes in India I'm in the US we send our requests to our local processing nodes now the processing nodes at this point

need to communicate they need to go oh hang on what's going on here and the System as a whole needs to come up with some kind of decision essentially ensuring that one of us has to sleep on the streets. In this case, me. This is what happens 99.99 whatever% of the time. However, let's take a kind of variation on this example. Again, we both want to book our hotel room, but now the communication line has gone down. The two nodes cannot communicate. We send in our requests. What happens? Well, actually, there's two broad alternatives. Alternative

one is the system says, "Uh, our communication lines gone down. Sorry, we can't take your hotel bookings at the moment. Please try again later." The alternative is the system says, "Yes, we'll accept your booking. Thank you very much because we're really reliable and up to date and all the rest of it." And they proceed to double book the hotel room. I'm not that friendly With promote. We're good friends, but you know, we're our limits. um we may not want to share that hotel room. So basically what we're seeing is a choice. It's a choice between

consistency which means you know I'm not going to do anything if my communication lines down and availability which says yes I'm going to keep going but at the risk of introducing an inconsistent behavior. Now, the vital thing here to realize is That this is a choice and it's a choice that can only be made by knowing about the business rules, the domain rules that you're working with. I mean, it may sound really awful to say, "Oh, we're going to double book a hotel room, possibly with complete strangers." I mean, that would be bad. But actually,

maybe the hotels have ways of dealing with this. Maybe they have a block of rooms that they always keep available till the last moment for emergencies. They can just use one of them. Or maybe they just send an apologetic groveling letter and some frequent sleeper points out um to try and make me happy. There's various ways in business that people will deal with inconsistencies as they crop up. Now I'm not saying you should always go for availability over consistency but what is true is that it's always a domain choice. It is the business people who

will have to decide. What's more Important the risk of double booking the last room in the hotel or the fact that we have to bring down the site and say sorry we can't accept any orders at the moment which is kind of bad for business. This is one of the things that drove Dynamo. They wanted to make sure that the shopping cart was always available. Or you could always put things in the shopping cart. Why is this? Because it's America. What's the most important thing to do in America? Shopping. We must maintain our retail destiny.

We must always be able to shop. And what happens? You look, you come to checkout and you go, why is this item in here twice? Or I sure I put the so and so in here. Ah, computers, they make mistakes. Let me just fix it. When the worst could happen, you actually send out the order, you get duplicate stuff, YOU RING UP AMAZON, SORRY, SORRY, SORRY, and you get it all Back. You much better than actually someone not being able to shop for a few seconds. So, the point is it's a business choice. So, this

then ties into something you'll hear endlessly about whenever someone talks about this stuff, which is the cap theorem. Everybody, who's heard of the cap theorem? How many people understand the cap theorem? Some of you. It's actually pretty straightforward. It's described very Badly though. Well, no, not very badly, but I don't think it's terribly useful. They say there are these three concepts up here, and you get to pick any two. This is true, but I think it's easier to reformulate it. Um it's a bit clearer if you say if you've got a system that can get

a network partition which basically means communication between different nodes in a cluster breaking down and if you have a distributed system by the way you are going to get a network Partitions if you get a network partition you have a choice do you want to be consistent or do you want to be available and that's really what the cap theorem boils down if you've got a single database running on a single server it's not going to partition you don't have to worry you can be as available as that node is and you're going to be

consistent. You can maintain everything. Um that when as soon as you have a Distributed system, you have to make that choice. But that isn't a single binary choice right across your system. You actually have a a spectrum. You can go for a certain amount. You can actually trade off levels of consistency and availability. I'm not going to go into how. Just trust me, you can. Furthermore, it can vary depending on a particular operation you want to do. Certain operations can be highly consistent. Certain other operations can Be highly available. Any of the databases that do

this kind of stuff will give you all the knobs and tweaks to do this. And so you're going to learn out how to trade them across. And actually, most of the time, you aren't trading off consistency versus availability. It's not availability that's the issue. And it's not even dealing with network partitions that's the issue. A lot of the time what you're doing is you're trading off consistency Versus response time because what's happening is the more you want to have consistency across a cluster of nodes that means the more nodes have to get involved in the

conversation. Again, think of that hotel case. The two had to communicate. That's going to slow down the response time. So you might say even if the network is up, you know, I'm going to let each node book its own hotel stuff and sort it out later. uh even with the network up I'd still get You the faster response time rather than doing all the communication I need to to get the pist the consistency and again that's a business decision another thing that Amazon said was we want to always get people shopping fast because what's the

most important thing in America shopping so therefore we want really rapid times and even if all the nodes are available and we could give you a completely consistent solution we want to be quick and then also So it helps That merging shopping carts, dealing with the inconsistency of shopping carts is relatively easy. Oh, they asked for this over here, they asked for that over there. Well, clearly they want both because this is America. Everybody wants everything. Taking stuff out of shopping carts. How why would we want to encourage that? And in fact, this is a

broader trade-off in terms of computing. This is really just another aspect of a general Concurrency trade-off between safety and livveness. I mean, if you've gone to concurrent classes and you've heard people talk about that, really, this should actually seem fairly familiar in in in those kinds of terms. Now, what I really wanted to do with this little segment on consistency was focus on giving you a feel for how consistency is different in the particularly the aggregate oriented NoSQL world as opposed to how you may Have thought about consistency so far. There's a lot of topics

I could have talked about here that I've just haven't got time to talk about. The important thing to go away with is realizing that you have to think about consistency issues differently essentially because you've got this um different data model and the the possibility of replicated data. Um and in particular you have to think of it about this terms of this Consistency availability tradeoff and that it's not up to just us as techies to make that decision. It is actually up to the way the business want to works as to where we make these tradeoffs.

And if you want more um well I'm going to tell you to buy my book anyway so you know what to do. Okay. So the last little segment I'm going to talk a bit about when and why you might want to use a NoSQL database. And the way I think of it is there are Two drivers that push us towards a NoSQL database. The first one is the one that I've already talked about as the real driver for the whole NoSQL movement itself. And that is you've got to deal with large amounts of data. If

you've got more data than you can comfortably or economically fit onto a single database server, you are going to go you're going to have to deal with some pain. You can either take the pain of trying to run a relational Database across a cluster or you can go into this new NoSQL stuff and yeah most of the time I think I'd go for the NoSQL stuff u because running datab relational databases across clusters is still somewhat of a black art. So big amounts of data is a big issue. Now some people have said and for

one of the reviewers comments on my book was yeah but only very few organizations have to worry about this stuff. If you Google and Amazon, yes. Pretty much everybody else, No. As I read that, what I heard in my head was 640K is enough for almost everybody. Reality is there is tons of data coming at us lot and every organization is going to be capturing and processing more and more data. [snorts] So this large scale data problem, it's only going to grow and and that is a factor. But actually this is not the main reason

I think why most people go into NoSQL. There was a survey I saw in the the Track on on Monday that pointed out that most people actually aren't interested in um uh big amounts of data for NoSQL databases. What they want to do is they want to be able to develop more easily. So a good example of this is I have some friends who work um on the Guardian newspaper and website. How many people have heard of the Guardian? Good English language newspaper. many of you good and you know they're dealing with articles they're saving

articles updating Articles pushing articles back and forth the article for them is a natural aggregate spreading that articles data and metadata across relational databases it's a pain in the neck it's awkward but taking it as a single thing a single article and pushing it into the database that's much more straightforward the m the impedance mismatch problem is drastically reduced if you've not a natural aggregate and many of the Projects of that I've talked to in Fort Works that have used um a uh NoSQL database have gone that route. They've said our data model doesn't really

fit very well with relational. These one of these NoSQL options is better. It might be a natural aggregate in which case they've gone the aggregate oriented route or it might be we've got something that feels much like a graph structure so we go the graph database route. And and that I think is the most common Reason at the moment why people are using NoSQL databases because you've got you get effectively getting rid of that impedance mismatch problem. Now of course that raises a question that was of course the promise of object databases. They were going

to get rid of the impedance mismatch problem but they got clobbered because databases are being used for integration. Why is that same problem not hitting us now? Well, it is hitting us, but it's Greatly reduced because now more and more people are saying we don't want to integrate that way. We want to hide our databases inside a broader application or service. And then we want to use some kind of serviceoriented interaction between the two, which may be web services. It can be something as really disgusting as SOAP on ESBs with god knows what thrown in.

Um, but the point is applications are now controlling access to data. And if if you're in a scenario where you can do that, where you can effectively encapsulate your database, then the integration issue becomes a lot less serious. And that I think is a very important enabler to make it possible for NoSQL databases to thrive. That this is a good practice anyway. Even if you got relational databases, you do not want to be integrating through integration databases. They cause no end of trouble. believe me if you haven't Experienced it yourself. Um so [snorts] much better

to try and encapsulate something like that. And if you're going to do that, then you've got much more freedom for what database to use. And I think that's going to be a very um driving structure towards this. Um another thing that's encouraging people to use um these databases is to deal with analytics. We all know about data warehousing. The usual data warehousing project as far as I can tell Is that salesman turns up from one of the big companies and says, "Oh, you want to do data warehousing?" Well, here's this project plan by which every

piece of data you could possibly have in your organization is all put into one place so that everybody can get at it easily. And it's a multi-year project with lots and lots of very diverse stakeholders. Um, we know that story. I mean, have people come across these big data warehousing projects that they felt Have succeeded? There's usually one or two. No one's prepared to admit it. Oh, you're prepared to admit it. Um, but most of them go badly. What we look for instead is a different approach that says, um, let's particularly focus on one particular

problem and see what how do we grab the data from that. And the data, by the way, might not be in well-known relational or even NoSQL stores. It might be scattered around in log files or, you know, what truly runs Most enterprises, which is Excel spreadsheets. Well, let's get at that data and let's poke it and pull it together. And NoSQL databases play an important role in this. The graph databases allow you to easily do graph-like analytics on the database, which is really quite nice. The aggregate oriented databases are generally less good at this because

they can't slice and dice so well, but what they can do is store large quantities of Data. So, if you are pulling stuff off devices or log files or the like, then they become very attractive. And of course that's what's given a big advantage to the the Amazon because they're able to mine all this information. So with all of this, does this mean that NoSQL is the future of uh databases? That relational databases are going to disappear and we're all going to be doing um NoSQL stuff? I don't think so. I think really the future

is something that I refer to as polygot persistence. And what this means is we think that there's going to be room for lots and lots of different kinds of databases with relational databases still playing a big role. And if you're building an application, maybe you'll use lots of different databases as part of your application. Certainly across an organization, you'll use lots of databases. And what you're doing is You're choosing the appropriate database for the nature of the problem that you're working with. And because there are different natures of problems, there are different data stores. But

what the idea of whatever your problem is, the answer is a relational database will go away. Now this is great. It gives us lots of opportunities for the future. But as every cynic knows, every opportunity is really a problem. And there are plenty of them. You've now got to think about this kind of stuff. You've got to decide what is the appropriate NoSQL database for a problem. You got to deal with organizational issues. Relational DBAs are not going to like this. In fact, for some people, that's a big advantage. But let's not go there. No

SQL databases are immature. They don't have the tools and the experience and the knowledge of how to work with them well that we've had From 20 years of relational databases. And all of these consistency issues can still end up biting you. So when it comes to what kind of project I get I start with the drivers. If you want rapid time to market fast cycle time you need to be quick. Easier development is really important. Therefore if you can do that with NoSQL databases that's a reason to go with them. Similarly if you've got a

very data inensive project then obviously NoSQL's ability to deal with large amounts of data is very important. But I think there's another overriding goal as well, which is, is your project really important to the competitive advantage of your business? What I refer to as a strategic project. Because if it's a strategic project, then it's worth taking on the extra risk, the unknowns of dealing with an immature and not so well-known technology, which is what no SQLs are. If, on the other hand, you've got a project that's what I call a utility project. it's kind of

a straightforward, it's not really vital to the business's operation, then that may be not the best place to bring in an unknown like this. In that kind of situation, you're probably better off with a familiar, at least for a few years. But there's lots of strategic projects out there and um certainly our experience over the last two or three years at Fort Works has Been very positive um with NoSQL databases. I've heard remarkably few complaints. um and thought workers always complain about what they're working with. So um I certainly am very much convinced that NoSQL

databases have an important part to play in the spectrum of future developments. And the rest of the talks in this track will explore different ways in which they've been used. So I hope you found that helpful. Um if you Want more depth, um the book is very thin. My target was 150 pages and I only missed it by two. So it's 152 pages. quick overview, a bit more than what I just gave you. Um, and I hope that will be handy. Um, if you go to that page on my website, I collect together various other

things um that I've done or talked about in terms of NoSQL. And uh, thank you for listening to me. [applause]

Introduction to NoSQL • Martin Fowler • GOTO 2012