As enterprises mature their big data capabilities, they are increasingly finding it more difficult to extract value from their data. This is primarily due to two reasons:
- Organizational immaturity with regard to change management based on the findings of data science.
- Scalability limitations, slowing the efficiency of the data science team.
This leads to disappointment, as encouraging early prototypes fail to deliver on promises. At The Data Incubator, a data science hiring and training firm, and Dell EMC Services, a big data professional services firm, we’ve worked with hundreds of Fortune 500 clients through growing pains as their big data capabilities mature. From those interactions, we’ve identified five key drivers that can help maturing enterprises reach monetization faster. Companies that want to capitalize on their early data science success need to embrace these five drivers.
1. Consolidate data into a single data lake to avoid data sprawl
As organizations grow into big data maturity, what often happens is deployments of Hadoop and other big data technologies pop up throughout the enterprise. The initial decentralized approach allows for faster adoption, but eventually results in silos of data and technology. These silos are problematic because data is often duplicated across the deployments, resulting in possible compliance issues, but certainly resulting in a higher overall cost to maintain. Furthermore, having multiple systems that do not interact nicely can hinder and discourage analyses by data scientists and increase the learning curve for anyone looking to start analyzing data. More importantly, providing visibility through reports and analytics across these silos is nearly impossible, preventing upper management from having a clear picture of the business. Successful clients have found tremendous value in consolidating the data into a single lake.
2. Provide users with the appropriate level of access to data
For organizations that have consolidated data into a centralized lake, the next challenge is providing the right level of access to the data. In order for data scientists to perform advanced analytics, they require a few things: access to large amounts of data, the ability to augment the existing data with outside data sources, and the ability to model the data using cutting-edge tools and libraries. This is often the exact opposite of what risk-averse IT administrators want to provide, which results in loss of productivity for the data scientists. Data security is an important consideration—especially for clients in financial services or health care. But IT policy requires a balance between security and stability. Successful clients have often sidestepped this problem by offering analytical sandboxes, independent of the production system, for the data science community. This allows them to freely experiment and iterate as they perform their work. This also postpones the complex questions around permissioning to a later stage, after business value can be more tangibly established so that managers can make more informed business decisions.
3. Strike a balance between governance and freedom
For some organizations, restrictions are not the concern—in fact, it’s the opposite. In these cases, IT administrators dial back restrictions on the data lake and allow a free-for-all to users. This may seem ideal to some users, but when expensive queries hog all the computational resources, or data becomes corrupted, everyone on the system suffers. Without governance and structure, data lakes quickly become uninhabitable data swamps, with lagoons of unsupported tables. The key here is to find the right balance between giving users the freedom to use certain tools and the ability to experiment while providing a consistent quality of service to the operational environment.
4. Align data initiatives with business goals
Far too many organizations, early in their big data deployments, move quickly to establish data platforms and make technology choices without considering the business strategy along the way. This mentality of “if we build it, they will come” may seem innocent initially—after all, how harmful can it be to build out a data lake? It turns out that if technology choices and business processes are put into place without understanding how the business will actually take advantage of the underlying system, then there is a good chance the deployed platform won’t meet the needs of the business and will be scrapped in favor of something else. On paper, the solution to this is simple: IT and the business must collaborate and work together to define the requirements for the system prior to implementation. In practice, this is often the most difficult thing to do and requires persistence and strong leadership from both sides to bring the parties together.
5. Create a data infrastructure with the ability to scale
Most good data lake implementations follow the tried-and-true guidance of deploying on commodity, bare-bones infrastructure. This is fine, until it isn’t. Once these deployments reach dozens of servers and hundreds of terabytes of data with dozens of analytical users, provisioning sandboxes becomes a full-time job—and it shouldn’t be. Two things can help streamline this process:
- Containerize the compute environments so that new sandboxes can be deployed with the click of a button.
- Decouple the data storage from the compute environment and provide read-only access from the containerized sandboxes to the data.
Docker and Kubernetes provide excellent tooling around this. This gives analysts flexibility and easy access to data that has integrity, while allowing for the independent scalability of the compute from the storage tiers. The result is lower total cost of ownership and easier overall maintenance.
As enterprise data efforts mature, they run into many new barriers that companies prototyping big data initiatives face. This is perfectly natural and a healthy sign of growth. But, by focusing on these five drivers, companies can start realizing the successes of big data pilots and driving long-term success and value with data.