Historically, IT organizations have defined data strategy with a focus on storage. They’ve built comprehensive plans for sizing and managing their platforms and they’ve developed sophisticated methods for handling data retention. While this is certainly important, it actually addresses the tactical aspects of content storage – it’s not planning for how to
improve all of the ways you acquire, store, manage, share and use data.
A data strategy must address data storage, but it must also take into account the way data is identified, accessed, shared, understood and used. To be successful, a data strategy has to include each of the different disciplines within data management. Only then will it address all of the issues related to making data accessible and usable so that it can support today’s multitude of processing and decision-making activities.
There are five core components of a data strategy that work together as building blocks to comprehensively support data management across an organization: identify, store, provision, process and govern.
Identify data and understand its meaning regardless of structure, origin or location
One of the most basic constructs for using and sharing data within a company is establishing a means to identify and represent the content. Whether it’s structured or unstructured content, manipulating and processing data isn’t feasible unless the data value has a name, a defined format and value representation (even unstructured data has these details). Establishing consistent data element naming and value conventions is core to using and sharing data. These details should be independent of how the data is stored (in a database, file, etc.) or the physical system where it resides.
It’s also important to have a means of referencing and accessing metadata associated with your data (definition, origin, location, domain values, etc.). In much the same way that having an accurate card catalog supports an individual’s success in using a library to retrieve a book, successful data usage depends on the existence of metadata (to help retrieve specific data elements). Consolidating business terminology and meaning into a business data glossary is a common means to addressing part of the challenge.
Persist data in a structure and location that supports easy, shared access and processing
Data storage is one of the basic capabilities in a company’s technology portfolio – yet it is a complex discipline. Most IT organizations have mature methods for identifying and managing the storage needs of individual application systems; each system receives sufficient storage to support its own processing and storage requirements. Whether dealing with transactional processing applications, analytical systems or even general purpose data storage (files, email, pictures, etc.), most organizations use sophisticated methods to plan capacity and allocate storage to the various systems. Unfortunately, this approach only reflects a “data creation” perspective. It does not encompass data sharing and usage.
The gap in this approach is that there’s rarely a plan for efficiently managing the storage required to share and move data between systems. The reason is simple; the most visible data sharing in the IT world is transactional in nature. Transactional details between applications are moved and shared to complete a specific business process. Bulk data sharing isn’t well-understood and is often perceived as a one-off or infrequent occurrence.
As organizations have evolved and data assets have grown, it has become clear that storing all data in a single location isn’t feasible. It’s not that we can’t build a system large enough to hold the content. The problem is that the size and distributed nature of our organizations – and the diversity of our data sources – makes loading data into a single platform impractical. Everyone doesn’t need access to all of the company’s data; they need access to specific data to support their individual needs.
Package data so it can be reused and shared, and provide rules and access guidelines for the data
In the early days of IT, most application systems were built as individual, independent data processing engines that contained all of the data necessary to perform their defined duties. There was little or no thought given to sharing data across applications. Data was organized and stored for the convenience of the application that collected, created and stored the content.
When the occasional request for data came up, an application developer created an extract by either dumping that data into a file or building a one-off program to support another application’s request. The developer didn’t think about ongoing data provisioning needs, or data reuse or sharing. At that time, data sharing was infrequent. Today, data sharing is definitely not a specialized need or an infrequent occurrence – data is often used by 10 other systems to support additional business processes and decision making.
But most application systems were not designed to share data. The logic and rules required to decode data for use by others is rarely documented or even known outside of the application development team. Most IT organizations don’t provide budget or staff resources to address nontransactional data sharing. Instead, it’s handled as a courtesy or convenience – and often addressed as a personal favor between staff members.
When data is shared, it’s usually packaged at the convenience of the application developer, not the data user. Such an approach might have been acceptable in years past, when just a few systems and a couple of teams needed access. But it’s completely impractical in today’s world where IT manages dozens of systems that rely on data from multiple sources to support individual business processes. Packaging and sharing data at the convenience of a single source developer – instead of the individuals managing 10 downstream systems that require the data – is ridiculous. And expecting individuals to learn the idiosyncrasies of dozens of source application systems just so they can use the data is an incredible waste of time.
--Customer details stored and referenced differently in each operational application.
Sharing data is no longer a specialized technical capability to be addressed by application architects and programmers. It has become a production business need. Businesses are dependent on data being shared and distributed to support both operational and analytical needs. Sharing data can’t be managed as a courtesy; the method for packaging and sharing data can’t be treated as a one-off need.
If a company’s data is truly a corporate asset, then all data must be packaged and prepared for sharing. To treat data as an asset instead of a burden of doing business, a data strategy has to address data provisioning as a standard business process.
Move and combine data residing in disparate systems, and provide a unified, consistent data view
Data generated from applications is a treasure trove of knowledge – but data is a raw commodity at the time of creation. It hasn’t been prepared, transformed or corrected to make it “ready to use.” Process is the component of data strategy that addresses the activities required to evolve data from a raw ingredient into a finished good.
Source system data is much like a raw ingredient in a manufacturing process. For a manufacturer to construct a product (let’s say a box of cereal), it must acquire a large quantity of raw ingredients (flour, fruit, nuts, cardboard, printing ink, etc.) and develop a manufacturing process to build and deliver a box of cereal to the grocer’s shelf. A box filled with flour, nuts and ink isn’t ready to use; baking, processing, packing and shipping are required to make a product that’s ready to use and available on the grocer’s shelf.
Data generated from an application is very much a raw ingredient. At most companies, data originates from both internal and external sources. Internal data is generated from dozens (if not hundreds) of application systems. External data may be delivered from a variety of different sources (cloud applications, business partners, data providers, government agencies, etc.). While this data is often rich with information, it wasn’t packaged in a manner to be integrated with the unique combination of sources that exist within each individual company. To make the data ready to use, a series of steps are necessary to transform, correct and format the data. The result of this process is a small set of homogeneous data sets that can be merged or integrated by a data user with a set of data preparation tasks specific to their individual needs (analytics, transaction processing, data sharing, etc.).
It’s common for companies to establish a centralized team to address data cleansing, standardization, transformation and integration for the data warehouse. Unfortunately, many have learned that this type of processing isn’t unique to a data warehouse. Most data users (applications, analytics users, developers, etc.) require ready-to-use data – so these users end up taking on the development effort themselves. Developing code to identify and match records across these individual sources can be quite complex, particularly when some systems require data from 20 or more sources.
Developers spend enormous time building logic to match and link values across a multitude of sources. Unfortunately, as each new development team requires access to individual data sources, they reconstruct or reinvent the logic needed to link values across the same data sources. The tragedy of data integration is that this rework happens with each new project because the learnings of the past are never captured for reuse.
While most organizations have initiatives to address code reuse and collaboration for application development, they have not focused this effort on delivering data that is ready to use and promotes sharing and reuse. It’s not practical (nor is it appropriate) for data users to become developers. Making data ready to use is about offering tools and establishing processes to produce data that individuals can use – without IT involvement.
Establish, manage and communicate information policies and mechanisms for effective data usage.
Since data is still often perceived as a byproduct of application processing, few organizations have fully developed the methods and processes needed to manage data outside the context of an application and across the enterprise. While many have begun investing in data governance initiatives, many are still in the infancy stage of their respective initiatives.
Most data governance initiatives start by addressing specific tactical issues (e.g., data accuracy, business rule definition or terminology standards) and are confined to specific organizations or project efforts. As governance awareness grows, and as data sharing and usage issues gain visibility, governance initiatives often broaden in scope. As those initiatives expand, organizations may establish a set of information policies, rules and methods to ensure uniform data usage, manipulation and management.
But all too often data governance is perceived as a rigor specific only to users and the analytics environment. In fact, data governance applies to all applications, systems and staff members. The biggest challenge with data governance is adoption – because data governance is an overarching set of information policies and rules that everyone must respect and follow.
The reason for establishing a strong governance process is to ensure that once data is decoupled from the application that created it, the rules and details of the data are known and respected by all other data constituents. The role governance plays within an overall data strategy is to ensure that data is managed consistently across the company.
Whether it is for determining security details, data correction logic, data naming standards or even establishing new data rules, effective data governance makes sure data is consistently managed, manipulated and accessed. Decisions about how data is processed, manipulated or shared aren’t made by an individual developer; they’re established by the rules and polices of data governance.
The purpose of data governance isn’t to limit data access or insert a harsh, unusable level of rigor that interferes with usage. Its premise is simply to ensure that data becomes easier to access, use and share. The rigor introduced by a data governance effort shouldn’t be overwhelming or burdensome. While data governance may initially affect developers’ productivity (because of the new processes and work activities), the benefits to downstream data constituents and dramatic improvements in productivity should more than counteract the initial impact.
It should be no surprise that a data strategy has to include data governance. It’s simply impractical to move forward – without an integrated governance effort – in establishing a plan and road map to address all the ways you capture, store, manage and use information. Data governance provides the necessary rigor over the data content as changes occur to the technology, processing and methodology areas associated with the data strategy effort.