I recently read a great blog post on smartdatacollective.com, From Master Data to Master Graph by Peter Perera. I found myself agreeing with almost everything in the post, particularly once I realised he was using terminology slightly differently to how we would here at Magnitude Software (in particular I suspect he and I think of different things when we refer to “MDM Applications”). What’s interesting is that although I’m in broad agreement with the arguments made, I’m not yet convinced with his conclusion of the post (which as I understood it is that graph database technology is the best foundation for master data management systems).
Let’s start with the bits we agree on. First, an MDM implementation absolutely needs a data model (what we’d describe as a business information model) that is independent of the models used by any particular source system or application, such as a CRM application. A master data repository is likely to receive data from tens of source applications, and make trusted data available to many more. What’s more, these systems are likely to come and go over time, so tying your master data model to any single application model is likely to be a mistake. Jeff Kerr from our product management team has a recent blog post on how to go about creating an application-neutral business information model in practice.
The second big area where I agree with Peter’s blog post is that you should use what you could call a “relationship-centric” data model. In addition to tracking relationships between customers, suppliers, and the like, it’s important to understand that concepts as important as ‘Customer’ should themselves usually be treated as a handful of separate entities, with important relationships between them. For example what one team calls ‘Customer’, other teams may call ‘Account’, ‘Contact Person’, ‘Project’, ‘Location’, ‘Counter Party’, or ‘Debtor’. The key thing to realise is that there is no right answer here – these are probably all valid (but distinct) concepts, and the key is to figure out how they relate to each other, and to ensure you can capture that in your MDM implementation. This kind of relationship-centric modelling in fact lies at the heart of Kalido technology, to the extent that some of our innovations in this area are covered by US Patent 70035014.
Now, let’s come on to my big point of disagreement. If I understood the blog correctly, it’s arguing that the best way forward is to build your master repository on top of a specialized NoSQL, graph database (as opposed to relational database technology for example).
The way I see it, assuming you have a suitably flexible, relationship-centric, data model, the choice of underlying database technology becomes mainly one of engineering tradeoffs. You certainly don’t need a specialized graph database in order to store graph-style data; here at Kalido we have been successfully storing graph-style data within relational database technology for almost two decades (and a large part of our software is dedicated to doing precisely this).
That’s not to say that using graph database technology doesn’t have its advantages – it does. Graph databases make many types of queries easier to write and faster to execute, and can allow you to ask questions that some relational databases would find impossible. A simple example would be any query that requires you to traverse an unknown number of associations, something like “Is there a chain of ‘friend’ associations between these two of my customers?” To write that in SQL you’d need to use a recursive query, something that has varying levels of support on different RDBMS implementations and BI tools, and is in general very fiddly to write.
On the other hand, any technology choice will involve trade-offs that need to be carefully considered. Speaking as an engineer, if I was considering using a graph database for MDM, I’d be thinking about the following factors:
- Has the technology proven itself in terms of stability and robustness?
- How easy will it be for consuming systems to access data stored in the database? Does it support industry-standard protocols such as JDBC/ODBC?
- How easily and at what level of performance can it execute non-graph style queries? Is it possible to formulate the full range of queries you could with SQL?
- What are the performance characteristics for bulk data import and export?
- Does it provide full ‘ACID’ transactions to ensure data integrity?
- Does it allow you to define data validation and consistency constraints?
- How mature is its operational tooling for things like monitoring, backup, migrations, ease of failure and disaster recovery?
- How easily does it scale?
Some graph database implementations probably do very well along some of these criteria – but I’m skeptical as to whether any relatively new ‘NoSQL’ technology excelled on all these fronts when compared to relational technology that has matured over tens of years. I’d also be surprised if graph database platforms even attempted to be competitive when it comes to traditional relational-style queries (at least those that aren’t a good match for graph models) – after all, that’s not their aim.
My intuition is that for an enterprise MDM installation, the vast majority of queries won’t hit the graph databases’ sweet spot – they’re simple things like asking for a list of all account status codes, or requesting details of an individual customer, or loading a reference hierarchy into a data warehouse – things that relational databases are great at. I’m not sure I’d want to optimize for the minority of graph-style queries, at the expense of the more traditional workload.
So, what’s the solution? We have an important class of queries that graph databases are perfect for, but we’re not yet sure we want to make the trade-offs required to rely on a graph database to act as the core of our master data repository. I think in most cases the correct solution is to use graph database technology where needed, but to consider it to be just another consuming system for your MDM hub. A successful MDM installation is already likely to have dozens of downstream systems consuming its data, from data warehouses and analytics platforms, to operational systems such as CRMs, to internal line-of-business applications needing ad-hoc access to master data. What’s needed is an architecture where more of these systems can easily be added over time; you’re being asked for graph analytics today, but who knows what will be needed tomorrow – predictive analytics perhaps? It should be easy to use your MDM software to stream changes to any analytics platform, graph or otherwise, allowing the best of both worlds.
So while I don’t think we necessarily agree on the best choice of low-level database technology for MDM, the key takeaway for me is that none of this is possible unless your MDM implementation is built on a system-agnostic, relationship-centric, cross-domain business model. Those should be considered core business requirements for a solution. The choice of persistence technology is just an implementation detail.
What are your thoughts? Leave a reply below!