In part 1 of this blog series post we discussed the need for a data model and how this supports the data platform. In part 2 of this blog series the focus will be on the considerations of processing of the data volume, velocity, variety, and value. The content will drive to what are the core elements that one should consider.
One may ask why we chose to address the data platform ahead of the data characteristics. This order appears to go contrary to what all project management and business analysis methodologies guide the industry to do. We did this for one simple fact. When a hardware or software vendor comes to do a demo or a presentation, the nature of the conversation automatically starts from the tool and expands to the business case. Starting at a hardware or software demo almost automatically precludes an investigation of the data held within an organization and instead starts the conversation from the point of what data the tool can manage. This becomes especially important when an organization is trying to leverage an investment in a current toolset before purchasing another piece of software, hardware, or cloud subscription. Now that the tendency to talk technology first has been satisfied, we can take a more deliberate approach in discussing the Characteristics of the Data, of Volume, Velocity, Variety, and Value.
Big Data is characterized by huge amounts of data. On a Cloudera community site entry there is anecdotal evidence of a Hadoop cluster up to 200 million files with 100 petabytes of data. If your organization is working with data sets of that size, then perhaps Hadoop will be a viable option. However, not all data types will correspond with that size of data. Perhaps this data size would correspond with 15 minute interval readings for a large utility company such as Consolidated Edison which will have 4 million smart meters by the end of 2020 as noted in a Utility Dive article. The volume on this would be 4 reads per hour X 24 hours which gives us 96 reads per day per meter. Multiply this by 4 million and that gives us 384 million data points per day or a little over 14 trillion data points per year. That seems like some big data to try to track. However, what if your volume is a little bit lower? What if you are pulling data on work orders, and perhaps your organization performs 100,000 work orders per year? In that case perhaps a relational database may be more aligned with the storage needs of that type of data.
The takeaway from this section is that the volume of different types of data should be considered. Not all data types will have the same velocity. Mixing all the data into one large data store may lead to less efficient data retrieval since the high-volume data needs to be sifted to find the low volume data. One way that an enterprise data model supports the selection of the tool set for a data type is by classifying what the data type is (i.e., customer data vs. meter reading data). This allows the profiling of the data to determine the true volume vs. the silo volume. I would invite you to think on the true volume vs. the silo volume for a minute. Silo volumes of data come from two sources. One source is multiple copies made of the same data set across silos because each silo feels they must have their own copy. This would lead to an artificial inflation of the data by a factor of 2 at a minimum (one original record and one copy record). Often the artificial inflation is by a much larger factor depending on how the data is manipulated, managed, and stored. The second way that the silo volume skews data is through silo definitions. As an example, if a utility has gas and electric operations the organization may choose to manage gas readings in one silo and electric readings in another silo. The reality is that these are both meter readings, just of a different commodity. Combining these readings into a common data type and managing them as a common data type could provide significant reporting and data service efficiencies. The siloed volumes of data mask the true volume based on data types, data duplications, and inconsistent definitions to name a few factors. Once the true volume of data is identified the organization can start to identify management technologies within the data platform for a specific data type.
To illustrate velocity, we can take the example in the prior section. The velocity of the meter data would be 4 million data points every 15 minutes, 16 million data points per hour, or 384 million data points per day. The velocity on the work order could be 150 per day.
Another example of data with high volume and low velocity would be an asset. A utility may have several hundred thousand assets in an inventory management system, but the general rate of change for those assets in the inventory management system may be relatively slow. A technology servicing this data in the data platform will need to be conscious of the tradeoff between the volume and velocity.
The typical definition of velocity revolves around the data input into a storage repository. What many organizations are starting to realize is that velocity is not only on storage but on data consumption. This is where the data profiling would note if the data is written once-read many (as with a meter reading), write many-read once (an example of this could be asset data which is updated often but read infrequently), or write many and read many (this could be an asset analytic which is constantly being calculated and constantly consumed). The technology selected for a specific data type would need to support the write and consumption profile as part of the data platform responsible for providing the correct information object.
An enterprise data model supports the discovery of volume in a couple of different ways. One way is the classification and reconciliation of data types. As an example, if a utility has multiple AMI systems the enterprise data model can help reconcile these into one stream which allows the velocity to managed on a meter reading perspective rather than the fragmented velocity of meter readings from head end A, head end B, etc. The second way that an enterprise data model supports this is through the assignment of source of data and the promotion of data cleanliness supporting data governance. An example of this may be that meter-related data held in the meter data management system may be transported along with the meter related data in the customer information system. If meter data is pulled from both systems, the velocity could be falsely inflated which would lead to the incorrect application of a technology to that data type and suboptimal data management performance.
One of the aspects of big data indicates that it needs to be able to handle variety. The typical big data definition of variety is structured, semi-structured, and unstructured data from heterogenous sources. In thinking about variety, an organization may want to push the definition a little bit further than just the structure needed to support data storage. One of the things Xtensible has discovered in some of the client engagements is that some technologies also limit the schemas that can be used within them or make it harder to implement data schema extensions. Relational databases have often come under criticism for not supporting variety due to fixed schema needed to define a table, but some table-based data storage technologies supporting key value pairs can offer as much flexibility as JSON-driven metadata stores. However, some data is highly structured and cannot support variety. As an example, the data elements that go into an invoice are governed by regulation and the addition of non-specified elements could lead to incorrectly generated invoices and potentially financial penalties.
Much of the big data literature focuses on the importing and reporting on the large varieties of data. An aspect that is sometimes missed is the ability to create relationships between the various data types to provide meaning and business value. Perhaps an organization needs to merge usage data with a meter point to the various customers based on a date range to determine how much energy is the base for a house vs. how much energy is driven by the customers living in it. This type of a query is best suited for a data reporting structure rather than a file storage structure because of the processing needed to ask this question. In this scenario, one technology would be great for tracking the usage readings, a little less great for tracking the customer to usage point changes, and not very good at all for reporting out the data answering the question at hand. Another technology would stack up differently. The point to this is that no one technology can do all of it well.
The enterprise data model provides a true picture of the variety of data. Using the meter reads example, a traditional approach would indicate that each type of meter reading coming from each of the different AMI systems is a different type of data that needs to be accommodated. However, the use of an enterprise data model could reduce that variety by 50% if two AMI systems are involved. The reduction in variety becomes 80% if 5 AMI systems are involved if gas, electric, and water meters are involved. This management in variety can have significant impacts on the technology selected.
One final V I would like to discuss is not in the traditional three (volume, velocity, and variety) originally identified for Big Data. The final V provides the reason all of this is being done. The reason for embarking on a big data project is to improve value of the data to the organization. The reason for embarking of the development of a data platform is the same. A data platform is an attempt to manage and reconcile data for the organization to extract maximum value from it.
The management of data along the lines of an enterprise data model helps enhance the value of the data platform by supporting the selection of the correct technologies for a specific data type with specific volume, velocity, and variety dimensions. This will also allow the grouping of data types by volume, velocity, and variety dimensions for common management, data governance, and data security. This will help reduce false starts in technology adoption due to faulty assumptions and incorrect data management practices.
At Xtensible, we understand that sometimes the discussion starts with the technology, but we also understand how critical it is to bring the discussion back to the data being managed within the data platform. As you look at your data platform landscape, it would be useful to reflect on whether or not your data platform represents a cohesive set of managed tools or a disparate set of siloed technology sets. The data characteristics dictate how the technology platform will be used, and the limitations of the data platform can also impact how the data is managed. As all of this comes together, we still have to ask the question of whether or not your data model can support your data platform.
Does your data model allow your data platform to operate at its peak operational efficiency regardless of the technologies used by your data platform? If the answer is no, maybe not, or I am not really sure, have a conversation with us at Xtensible to see if we can suggest improvements. Contact us at Sales@Xtensible.net to learn more.
James Meyer is an Associate Consultant at Xtensible.