In recently times there has been a belief that one should be aim for a Single Version of the Truth (SVoT) when considering the Data Management solutions.
This has been primarily driven by the risk of dirty data which has been estimated at costing the financial industry between 15% -25% in lost revenue as quoted by Chris Johnson, Senior Product Manager, Market Data, HSBC Securities Services and in other industry sectors far more. Can we quantify the real cost of the recent incident where a spreadsheet was manually edited and saved, resulting in the reported loss of 16,000 COVID test results?
Was there ever a single version of the truth?
In the early days of computing a mainframe stood standalone, unconnected to any other devices or the outside world. Each program or computer focussed on specific business domain and it managed its own data. Life was simple then and one can argue that there was a single version of the truth.
As the years have progressed, with the universal adoption of the relational database model, network connectivity and the success of the Internet, we have seen the proliferation of data. As corporations needed to remain agile to obtain business advance or penetrate new markets, it is not surprising that this resulted in what is known as Multiple Versions of the Truth (MVoT). This does not refer to an alternative version of the facts, but the use of slightly different variants or subsets of the information for related business contexts.
A classic example of this can be seen in the lifecycle of a trade of a financial instrument. As this transaction progresses along the trade lifecycle, different identifiers and specific information to each stage are required. There is an exchange ticker at the initiation of a deal and a settlement identifier in the later stages, with others along the way. Each are considered the truth with their given business domain. In addition, other information is required, such as, to calculate the risk to financial institutions involved in the transaction. There are several business units involved in the workflow and each has their operational interpretation of the truth.
Also, axioms can behave differently within different model representations. A workflow taxonomy comprises multiple steps, these steps are grouped within blocks. Nonetheless, when executing the same workflow constructs their relationship changes, the block now becomes a completion milestone, where the steps within it are expected to be completed before the next block of steps can be initiated.
At this juncture, attempts are underway to construct industry specific semantic ontologies to support specific business requirements. Existing ontologies include the Global Legal Entity Identifier (Multiple Sectors), Financial Industry Business Ontology (Finance) and SNOMED Clinical Terms (Health). Early adopters are attempting to overlay their Big Data repositories with conceptual semantic graph models with the aim of emulating a single version of the truth. Nonetheless, even the ontologies mentioned above seem to be hit by inconsistent assertions or vagueness in the models. Another observation is that organisations attempt to start the semantic modelling prematurely before all stakeholders are onboard. Naturally, a pragmatic approach is to live with multiple versions of the truth and model this knowledge accordingly, focussing on a specific business context or business domain. An overarching single version of the truth can come after the stakeholders in each business unit have confidence in their model of the truth.
The truth is out there
Maybe if the Federal Bureau of Investigation X-Files Department had adopted the appropriate data quality processes, the TV series may have finished after the initial first six episodes, rather than continue for nine subsequent sessions.
So, let us examine what is understood as Data Quality Management, most believe that this is only achievable if one adheres to its five pillars:
1. Team Composition
The first one should consider is knowledge and capabilities of staff. A database designer, for example, will talk in term of Tables Columns, Primary and Foreign keys. A business analyst or data architect will use terms like classes, objects and relationships, whilst others within the organisation may use other terminology. One needs to consider who will implement and maintain the solution. Its implementation is only as efficient as the business expertise of individuals. Ideally, one should attempt to adopt an appropriate framework that reduces the need for the involvement of technical staff, such as an IT development team and remove any potential limitations associated with deploying inappropriate technologies, thus avoiding the risk of the “tail wagging the dog”.
2. Data Profiling
This is the process of examining the data available from the diverse sources. This primarily takes the form of a data discovery function. In a perfect world this can be automated, by pointing the service at the relevant datasets and it undertakes a first assessment, identifying the data formats, min/max lengths values, and datetime ranges of data sets to be ingested.
3. Data Quality
This encompasses the creation of an appropriate Data Catalogue encompassing metadata, and modelling of relationships within and between datasets.
4. Data Reporting
In addition to Data Profiling capabilities any Data Warehouse solution should incorporate data cleansing activities and data quality logging. This being a prerequisite for the final pillar, Data Resolution and Repair.
5. Data Resolution and Repair
The framework will need to provide full traceability enabling the user to modify metadata, workflows and edit and replay data without the need to always reach out to the originating source.
Truth is in the eye of the owner
It is worth highlighting that it is not just about understanding the truth, in its various forms, but there is also the question, “Can your proposed solution handle the volume?” Not necessarily the volume of information ingested, but its capability to provide enhanced query/response times on large datasets.
The ideal self-service preparation framework should be able to effectively support the Data Quality Management requirements of your organisation, bringing the user’s attention to data loss and quality issues. Resulting in reduced dirty data, remediation costs and ultimately financial implications associated with loss of reputation. The Finworks’ Data Platform has achieved this on several occasions for high profile clients where the quality of the data is an essential expectation.
Martin Sexton is a Senior Business Analyst at Finworks
For further information
We will be presenting at the Big Data Analytics 2020 conference