According to Gartner, "transforming into a (truly) digital business is the number one priority of most organisations. However, a digital business cannot exist without data and analytics. If an organisation struggles with digital transformation, perhaps they haven't given enough thought to data and the potential for valuable insights."
Data is the lifeblood of successful enterprises. Growing organisations must tap into all of the available information and develop ways to use it as a valuable asset. Through the process of exploring, organising and analysing data from every transaction and touchpoint, organisations can gain vital insights about their customers and potential market opportunities. Ideally, these insights should drive the priorities for the product roadmaps and map out where and how digital activities could maximise the business impact. Driving a business strategy with data to continuously improve and shape the customer experience is one of the greatest challenges facing companies today.
As artificial intelligence becomes more and more ubiquitous, coupled with the exponential growth in connected devices and systems, the amount of data will continue to grow exponentially. Successful companies can leverage a wide variety of data streams to drive and optimize their customers' digital experiences. Data pipelines for AI and sensor-driven analytics must be designed with scalability, security, and availability in mind to meet the needs of modern enterprises and the expectations of end users.
We conducted a Q&A with Chris Gojlo, our Data Architect, on the challenges when dealing with data, and also on building bulk data processing tools for one of our major strategic clients.
SC: Given the exponential pace of change and the scale of innovation in almost every industry, what are the main challenges large organizations face when dealing with data?
CG:With the advent of cloud computing, some of the important problems of the recent past regarding the mechanisms for storing and capturing large, raw datasets have been sufficiently addressed and are no longer crucial to the discussion. Data-driven insight is a key competitive advantage for any industry today, but obtaining information from the raw data can still take days or weeks. Addressing bottlenecks that slow down time-to-insight across the data discovery, transformation, processing, and production is now a common challenge in large organisations. The focus has shifted from the data processing infrastructure, which has been largely commodified, to new models for effective data management.
Inefficient data governance leads to the acceptance of low-quality data circulating in the corporate system, with incomplete or invalid data points. Poor quality datasets shared by downstream components force each consumer to create quality protection mechanisms that increase the data pipeline's complexity and increase latency in the data channel. Lack of consistent rules regarding data discovery, data reuse policies or regulations governing the creation of associations across the data are key factors contributing to the increase in time-to-insight.
As data streams can flow through different channels, using a variety of formats and semantics, it is essential to ensure that the data dictionary supports real-time data profiling and auditing. This can improve operational efficiency, ensure compliance and increase customer satisfaction as data problems are flagged as soon as they occur.
All of this contributed to the continuing trend of creating self-service data roadmaps to support data discovery, data quality scoring, lineage, and governance. The self-service approach, coupled with the automation across data governance processes, can be used to build the capabilities required to democratise data in an organisation and reduce time-to-insight, which is a substantial bottleneck in modern data architectures.
SC :What is the role of data architecture and data governance in building a solid bridge between the consumer of data and the data provider ?
CG: A successful enterprise data strategy should have a clear plan on how to unleash the value of the data assets to serve business purposes. It requires a systematic approach to data governance by managing data assets in terms of ownership, integrity, compliance, access methods, and relationships between different datasets.
With increased regulatory demands on business to implement compliance policies to protect sensitive data, companies are required to build data-driven solutions aligned with the current regulatory framework and evaluate existing systems for new rules and restrictions. This emphasises creating a coherent strategy for all the processes that define how to store, process, access sensitive data and guard against breaches. This tactic is an important factor in ensuring uninterrupted and secure access to valuable data for both internal and external customers. The data architecture and data governance model are part of this strategy.
The bridge between the data service and data recipients is always based on trust. This can be developed by providing high-quality data, with metrics, metadata describing the structure and semantics that reflect business requirements. Improving reliability of the service, not only operational, but also in terms of delivering data that is relevant to the customer, is part of this process too.
SC: How were the challenges you have mentioned addressed in the design of bulk data processing tools?
CG: Understanding the problem domain and the business context is a critical starting point. By leveraging Domain-Driven Design (DDD) as a method for developing understanding of a problem space, we were able to create and refine our set of conceptual models for the solution within a target domain. This approach helped us with structuring the data architecture in line with the domain model and at the same time, controlling the complexity of the solution architecture itself. It has proved to be very useful in exposing potential misalignments between different services operating in the same domain.
As our data processing solution operates within a bounded context, with strong information architecture requirements, we had to address specific issues related to data governance, especially around ownership, data quality, integrity, security and compliance.
Two areas were particularly important for our client: the ability to audit data changes across the processing pipeline (data lineage) and data quality assurance. The self-service aspect of our solution allowed users, for example, to automatically check the quality of data coming from multitude of sources against a spectrum of data quality factors, without the need for manual and time-consuming effort. Managing data traceability from a single service or exposing changes as data audit events has led to a number of improvements and benefits from compliance and security standpoint, but also for the end users. Being able to find out what happened to your data at any stage of the process is critical to building user trust.
SC: You have mentioned data quality. What is the impact of poor data quality on organisations and their operations ? With massive datasets available for exploration, does this have any relevance to data science and AI ?
CG: In general, without high-quality data, companies are unable to recognise and react to changes in the market landscape, correctly assess their position relative to competitors or provide accurate and reliable analytics. Worst of all, this can lead to ill-conceived business strategies.
There are two aspects to consider. The first one: using automation and AI to improve data quality and data management processes. Data quality is a multidimensional problem as it covers consistency, integrity, accuracy, and completeness. Traditional data profiling, combined with machine learning and semantic analysis techniques, can help to tackle more complex problems such as disambiguation and relationship extraction.
We have implemented smart data validation and correction in our bulk data processing tools to ensure that high-quality data is available to the consumers regardless of their relative position in the business process or distance from the original data source.
The second aspect concerns the use of data in the data science and AI pipelines. Access to verified, consistent and reliable data is extremely important at all stages of data science projects, for example, in exploratory analysis and hypothesis testing. In machine learning, training data is the dataset of labeled images, video, audio, and other data sources used to train the ML algorithm. Anything that could mislead the learning algorithm is not high-quality data.
A dataset, regardless of its volume, can contain biased data that does not provide enough information for the model to learn the problem, or it’s unrepresentative of reality in some way. Models trained on biased data not only produce inaccurate results, they also pose ethical, legal and safety problems. We should not assume that operating with high volume of data will reduce the role of data quality checkpoints. ML/AI can deal effectively with noise while working on data, but this is only possible if the training dataset has all quality issues resolved.