Simone Bruno: September 2009

Software Development is one of the most entropic fields of human activity. "Errare humanum est, but if you want to screw up everything you need a computer", as I read somewhere. Without proper management of all factors that come into play in the lifecycle of a software project, failure is the most probable epilogue of the story. Every experienced programmers know very well that delivering a software product is not simply a matter of writing code. Coding is only a part (very important, of course) of the game, but it's not the whole game, and losing the overall picture is as easy as dangerous. I am currently working on a project to manage structural databases changes and data migrations between different database versions. It's a common mistake not to think about these issues since the very first phases of design and development of software products: usually architects and developers focus on good database design, on proper Object Relational Mapping, and on performance issues. Project managers and deployers begin to think about database migrations when a new software release is dropped, which requires database changes. Is the currently deployed database already compatible with the new release, or not? How can we determine this? If changes are needed, which scripts do need to be run? Which schemas need to be changed, all of the schemas used by the application or only a subset of them? The first time I had to think about these issues, it was for a pretty huge and complex project (the Vodafone Live! portal), in a quite complicated scenario: there were localized releases of software developed on different CVS branches (localizations were needed for functional adaptations of the software for the countries that would use it), and whenever a new release was dropped deployers needed to know how to properly migrate the database and how to change the schemas so that it could work with the new version. Finally, after talking a bit about this with some colleagues (mainly Peter Moore, Mida Boghetich, and Alexis Konstantopoulos, thank you guys) I came out with some basic principles to approach this issue.

The most important thing was to define a good database versioning mechanism. When I say "database version" I mean a unique identifier that describes the full set of Users and, for each User, Schema Objects (tables, constraints, indexes, stored procedures) that are required by given version(s) of the software. We choose to name database versions using the CVS labels used to tag the software modules responsible of the database creation. Then we started re-organizing the existing database migration scripts (mainly sql scripts) as point to point migrations, that means migrations from a database version to another. (The database version needs to be a separated, independent concept from the software version, because different software versions can possibly require the same database version; consequently, not all releases do imply the need of a database migration). At this point we needed to solve the following problem: how can we check that a given database installation is compliant with a given database version? In other words, how can we run a database compliance test on an existing database instance, to see if the database structure is compliant with a given db version or it has been corrupted, or not properly installed, or a past migration has not been run properly? For each database version, a textual description of all users was generated in a definite format, and to do that a dedicated tool was written which was able to connect to an existing database instance and generate the desired textual description of the database, so that comparing the actual database structure with the expected one was as simple as diffing two text files (The tool simply queries the database metadata tables to get all the required information, and formats the result according to a definite set of rules). The last ingredient of the recipe was a simple algorithm (let's call it the pathfinder algorithm) to compose existing atomic migrations in order to run generic ones. For example, let's imagine that the following database migrations have been written: A to B, B to C, and C to D. If we need to migrate from A to D, we will need to run these three migrations in the correct order; this is precisely the pathfinder's job: we tell the pathfinder which migration we need to run (Version Alfa to Version Omega), and it will tell us which existing migrations we need to run, and in which order (i.e., Alfa to X, X to Y, Y to F34, F34 to H67, H67 to Omega). Of course the provided examples are pretty simple, but you can easily imagine that if many database versions need to be supported, it's convenient to have a dedicated tool to compute which scripts need to be run and in which order.

To sum up, the solution that was successfully used to properly manage database changes was made up of the following "ingredients":

A Database versioning mechanism: for each software release it must be clear what is the required database version, and this must be identified univocally with a string identifier
Point to Point migrations: all sql scripts to manage schema changes and data migrations need to be written so that it's perfectly clear what the source database version and the target database version are.
A pathfinder tool, able to compose the existing point to point migrations into generic ones
Compliance tests, to automatically check if a database installation is compliant with a given database version

The previous description is just a short introduction to the approach I am currently using to manage database changes. Actually, I am working on a simple java tool to automate many of the processes implied by the adoption of the described key concepts. Furthermore, there are several existing tools and resources that can be very useful to handle database changes, or simply to provide inspiration regarding these topics: www.liquibase.org, www.dbunit.org; I'd recommend also the "migrations" chapter in the Doctrine Manual (http://www.doctrine-project.org).

Simone Bruno

Tuesday, September 29, 2009

Managing databases changes

Followers

Blog Archive