Tony Marston has published an interesting critique of my posting about why intelligent database are helpful. The response is thought-provoking and I suggest my readers read it. Nonetheless I believe that the use cases where he is correct are becoming more narrow over time, and so I will explain my thoughts here.
A couple preliminary points must be made though. There is a case to be made for intelligent databases, for NoSQL, and for all sorts of approaches. These are not as close to a magic bullet as some proponents would like to think and there will be cases where each approach wins out, because design is the art of endlessly making tradeoffs. My approach has nothing to do with what is proper and has to do instead with preserving the ability to use the data in a other ways for other applications, as I see this as the core strength of relational database management systems. Moreover I used to agree with Tony but I have since changed my mind primarily because I have begun working with data environments where the stack assumption for application design doesn't work well. ERP is a great example and I will go into why below.
Of course in some cases data doesn't need to be reused, and an RDBMS may still be useful. This is usually the case where ad-hoc reporting is a larger requirement than scalability and rapid development. In other cases where data reuse is not an issue an RDBMS really brings nothing to the table and NoSQL solutions of various sorts may be better.
My own belief is that a large number of database systems out there operate according to the model of one database to many applications. I also think that the more that this model is understood the more that even single-application-database designers can design with this in mind or at least ask if this is the direction they want to go.
And so with the above in mind, on to the responses to specific points:
The above opinion works quite well in a narrow use case, namely where one application and only one application uses the database. In this case, the database tables can be modelled more or less directly after the application's object model and then the only thing the RDBMS brings to the table is some level of ad hoc reporting, at the cost of added development time and complexity compared to some NoSQL solutions. (NoSQL solutions are inherently single application databases and therefore are not usable where the considerations below exist.)
The problem is that when you start moving from single applications into enterprise systems, the calculation becomes very different. It is not uncommon to have several applications sharing a single database, and the databases often must be designed to make it quick and easy to add new applications on top of the same database.
However as soon as you make this step something awful happens when you try to use the same approach used for single application databases: because the database is based on the first application's object model, all subsequent applications must have intimate knowledge of the first application's object model, which is something of an anti-pattern.
A second major problem also appears at the same time, and that is that while a single application can probably be trusted to enforce meaningful data constraints, applications can't and shouldn't trust eachother to input meaningful data. The chance of oversights where many applications are each responsible for checking all inputs for sanity and ensuring that only meaningful data is stored is very high, and the consequences can be quite severe. Therefore things like check constraints become suddenly irreplaceable when one makes the jump from one data entry application to two against the same database.
Indeed. However, if you aren't thinking in terms of what the RDBMS can do, but rather just in terms of it as a dumb storage layer, you will miss out on these features when they are of benefit. Therefore it is important to be aware of the costs of this when asking "should my database be usable by more than one application?"
That question becomes surprisingly useful to ask when asking questions like "should data be able to be fed into the database by third party tools?"
The idea comes from the necessities of ensuring that the database is useful from more than one application. Once that happens, then the issue is one of details of too much intimate knowledge of data structures between components.
However one thing I am not saying is that SQL-type interfaces are categorically out. It would be a perfectly reasonable approach to encapsulation to build updatable views matching the object models of each application, and indeed reporting frameworks perhaps should be built this way to the extent possible, at least where multiple applications share a database. The problem this solves is accommodating changes to the schema required by one application but not required by a second application. One could even use an ORM at that point.
How does this work though when you have multiple applications piping heavy read/write workloads through the same database, but where each application has unique data structure requirements and hence its own object model?
It seems I was misunderstood. I was saying that the generation of database structures from Rails is a bad thing. I am in complete agreement with Tony on that part I think.
This is an interesting perspective. I was comparing it though to initial stages of design and development in agile methodologies.
Here's the basic thing. Agile developers like to start out when virtually nothing is known about requirements with a prototype designed to flesh out the requirements and refine that prototype until you get something that arguably works.
I am saying that in the beginning at least, asking the question of what the data relating to topics being entered is and how to model it neutral of business rules.
I would agree though that the time spent there on looking at the data, modelling issues, and normalization is usually recouped later, so I suppose we are in agreement.
So if you are doing an ERP application and at the same time putting all business logic in your application, you do not expect any third party applications to hit your database, right? So really you get the same basic considerations I am arguing for by making the database into the abstraction layer rather than encapsulating the db inside an abstraction layer.
If we think about it this way, it depends on what the platform is that you are developing on. We like to have the database be that platform, and that means treating interfaces to it as an API. Evidently you prefer to force all access to go through your application. I think we get a number of benefits from this approach including language neutrality for third party components, better attention to performance on complex write operations, and more. That there is a cost cannot be argued with however.
I would conclude by saying that we agree on a fair bit of principles of database design. We agree normalization is important, and I would add that this leads to application-neutral data storage. Where we disagree is where the API/third party tool boundary should be. I would prefer to put in the database whatever seems to belong to the general task of data integrity, storage, retrieval, and structural presentation while leaving what we do with this data to the multiple third party applications which utilize the same database.
A couple preliminary points must be made though. There is a case to be made for intelligent databases, for NoSQL, and for all sorts of approaches. These are not as close to a magic bullet as some proponents would like to think and there will be cases where each approach wins out, because design is the art of endlessly making tradeoffs. My approach has nothing to do with what is proper and has to do instead with preserving the ability to use the data in a other ways for other applications, as I see this as the core strength of relational database management systems. Moreover I used to agree with Tony but I have since changed my mind primarily because I have begun working with data environments where the stack assumption for application design doesn't work well. ERP is a great example and I will go into why below.
Of course in some cases data doesn't need to be reused, and an RDBMS may still be useful. This is usually the case where ad-hoc reporting is a larger requirement than scalability and rapid development. In other cases where data reuse is not an issue an RDBMS really brings nothing to the table and NoSQL solutions of various sorts may be better.
My own belief is that a large number of database systems out there operate according to the model of one database to many applications. I also think that the more that this model is understood the more that even single-application-database designers can design with this in mind or at least ask if this is the direction they want to go.
And so with the above in mind, on to the responses to specific points:
Well, isn't that what it is supposed to be? The database has always been a dumb data store, with all the business logic held separately within the application. This is an old idea which is now reinforced by the single responsibility principle. It is the application code which is responsible for the application logic while the database is responsible for the storing and retrieval of data. Modern databases also have the ability to enforce data integrity, manage concurrency control, backup, recovery and replication while also controlling data access and maintaining database security. Just because it is possible for it to do other things as well is no reason to say that it should do other things.
The above opinion works quite well in a narrow use case, namely where one application and only one application uses the database. In this case, the database tables can be modelled more or less directly after the application's object model and then the only thing the RDBMS brings to the table is some level of ad hoc reporting, at the cost of added development time and complexity compared to some NoSQL solutions. (NoSQL solutions are inherently single application databases and therefore are not usable where the considerations below exist.)
The problem is that when you start moving from single applications into enterprise systems, the calculation becomes very different. It is not uncommon to have several applications sharing a single database, and the databases often must be designed to make it quick and easy to add new applications on top of the same database.
However as soon as you make this step something awful happens when you try to use the same approach used for single application databases: because the database is based on the first application's object model, all subsequent applications must have intimate knowledge of the first application's object model, which is something of an anti-pattern.
A second major problem also appears at the same time, and that is that while a single application can probably be trusted to enforce meaningful data constraints, applications can't and shouldn't trust eachother to input meaningful data. The chance of oversights where many applications are each responsible for checking all inputs for sanity and ensuring that only meaningful data is stored is very high, and the consequences can be quite severe. Therefore things like check constraints become suddenly irreplaceable when one makes the jump from one data entry application to two against the same database.
Just because a database has more features does not mean that you should bend over backwards to use them. The English language contains a huge number of words that I have never used, but does that make my communication with others unintelligible? I have used many programming languages, but I have rarely used every function that has ever been listed in the manual, but does that mean that my programs don't work? With a relational database it is possible to use views, stored procedures, triggers or other advanced functionality, but their use is not obligatory. If I find it easier to do something in my application code, then that is where I will do it.
Indeed. However, if you aren't thinking in terms of what the RDBMS can do, but rather just in terms of it as a dumb storage layer, you will miss out on these features when they are of benefit. Therefore it is important to be aware of the costs of this when asking "should my database be usable by more than one application?"
That question becomes surprisingly useful to ask when asking questions like "should data be able to be fed into the database by third party tools?"
Where did this idea come from? The application has to access the database at some point in time anyway, so why do people like you keep insisting that it should be done indirectly through as many intermediate layers as possible? I have been writing database applications for several decades, and the idea that when you want to read from or write to the database you should go indirectly through an intermediate component instead of directly with an SQL query just strikes me as sheer lunacy. This is a continuation of the old idea that programmers must be shielded completely from the workings of the database, mainly because SQL is not objected oriented and therefore too complicated for their closed minds. If modern programmers who write for the web are supposed to know HTML, CSS and Javascript as well as their server-side language, then why is it unreasonable for a programmer who uses a database to know the standard language for getting data in and out of a database?
The idea comes from the necessities of ensuring that the database is useful from more than one application. Once that happens, then the issue is one of details of too much intimate knowledge of data structures between components.
However one thing I am not saying is that SQL-type interfaces are categorically out. It would be a perfectly reasonable approach to encapsulation to build updatable views matching the object models of each application, and indeed reporting frameworks perhaps should be built this way to the extent possible, at least where multiple applications share a database. The problem this solves is accommodating changes to the schema required by one application but not required by a second application. One could even use an ORM at that point.
If your programmers have to spend large amounts of time in writing boilerplate code for basic read/write/update/delete operations for your database access then you are clearly not using the right tools. A proper RAD (Rapid Application Development) toolkit should take care of the basics for you, and should even ease the pain of dealing with changes to the database structure. Using an advanced toolkit it should possible to create working read/write/update/delete transactions for a database table without having to write a single line of SQL. I have done this with my RADICORE toolkit, so why can't you? Simple JOINs can be handled automatically as well, but more complex queries need to be specified within the table class itself. I have implemented different classes for different database servers, so customers of my software can choose between MySQL, PostgreSQL, Oracle and SQL Server.
How does this work though when you have multiple applications piping heavy read/write workloads through the same database, but where each application has unique data structure requirements and hence its own object model?
The very idea that your software data structure should be used to generate your database schema clearly shows that you consider your software structure, as produced from your implementation of Object Oriented Design (OOD) to be far superior to that of the database.....
It seems I was misunderstood. I was saying that the generation of database structures from Rails is a bad thing. I am in complete agreement with Tony on that part I think.
You may think that it slows down development, but I do not. You still have to analyse an application's data requirements before you can go through the design process, but instead of going through Object Oriented design as well as database design I prefer to ignore OOD completely and spend my time in getting the database right. Once this has been done I can use my RAD toolkit to generate all my software classes, so effectively I have killed two birds with one stone. It also means that I don't have to waste my time with mock objects while I'm waiting for the database to be built as it is already there, so I can access the real thing instead of an approximation. As for using an ORM, I never have the problem that they were designed to solve, so I have no need of their solution.
This is an interesting perspective. I was comparing it though to initial stages of design and development in agile methodologies.
Here's the basic thing. Agile developers like to start out when virtually nothing is known about requirements with a prototype designed to flesh out the requirements and refine that prototype until you get something that arguably works.
I am saying that in the beginning at least, asking the question of what the data relating to topics being entered is and how to model it neutral of business rules.
I would agree though that the time spent there on looking at the data, modelling issues, and normalization is usually recouped later, so I suppose we are in agreement.
The problem occurs when CRUD operations are not really as simple as this sounds, and where they must have complex constraints enforced from multiple data entry applications. A good example here is posting GL transactions, where each transaction must be balanced. This is far simpler to do in a stored procedure than anything else, because the set has certain emergent constraints that apply beyond the scope of a single record. Also if abstracting the application interface from the low-level storage, then this may be necessary as part of the mapping of views to relations.
This is another area where OO people, in my humble opinion, make a fundamental mistake. If you build your classes around the database structure and have one class per table, and you understand how a database works, you should realise that there are only four basic operations that can be performed on a database table - Create, Read, Update and Delete (which is where the CRUD acronym comes from). In my framework this means that every class has an insertRecord(), getData(), updateRecord() and deleteRecord() method by default, which in turn means that I do not have to waste my time inventing unique method names which are tied to a particular class. Where others have a createCustomer(), createProduct(), createOrder() and createWhatever() method I have the ubiquitous insertRecord() method which can be applied to any object within my application. This makes use of the OO concept of polymorphism, so it should be familiar to every OO programmer. Because of this I have a single page controller for each of the methods, or combination of methods, which can be performed on an object, and I can reuse the same page controller on any object within my application. I have yet to see the same level of reuse in other application frameworks, but isn't a high level of code reuse what OOP is supposed to be about?
I am often told "but that is not the way it is done!" which is another way of saying "it is not the way I was taught". This tells me that either the critic's education was deficient, or he has a closed mind which is not open to other techniques, other methods or other possibilities.
I can even make changes to a table's structure, such as adding or deleting a column, or changing a column's size or type, without having to perform major surgery on my table class. I simply change the database, import the new structure into my data dictionary, and then export the changes to my application. I don't have to change any code unless the structural changes affect any business rules. My ERP application started off with 20 database tables but this has grown to over 200 over the years, so I am speaking from direct experience.
So if you are doing an ERP application and at the same time putting all business logic in your application, you do not expect any third party applications to hit your database, right? So really you get the same basic considerations I am arguing for by making the database into the abstraction layer rather than encapsulating the db inside an abstraction layer.
If we think about it this way, it depends on what the platform is that you are developing on. We like to have the database be that platform, and that means treating interfaces to it as an API. Evidently you prefer to force all access to go through your application. I think we get a number of benefits from this approach including language neutrality for third party components, better attention to performance on complex write operations, and more. That there is a cost cannot be argued with however.
I would conclude by saying that we agree on a fair bit of principles of database design. We agree normalization is important, and I would add that this leads to application-neutral data storage. Where we disagree is where the API/third party tool boundary should be. I would prefer to put in the database whatever seems to belong to the general task of data integrity, storage, retrieval, and structural presentation while leaving what we do with this data to the multiple third party applications which utilize the same database.