The world of advanced big data platforms is a strange place. Like a Gilbert and Sullivan musical, there is drama, farce, and mayhem in every act. Once in a long while, the curtain rises, time stands still, and as if by magic, it all works. Platform engineering at global scale is an art form—a delicate balance of craft, money, personalities, and politics.
With the commoditization of IT, however, there is much less craft and little art. Studies have shown that 60 to 80 percent of all IT projects fail with billions of dollars wasted annually. The end results are not simply inefficient, but frequently unusable. Projects that do finish are often late, over budget, or missing most of their requirements.
There is immense pressure on CIOs to convert their IT infrastructure into something as commodity as the plumbing in their office buildings. Deploying platforms on the scale required for cloud computing or big data will be the most complex projects IT groups undertake.
Managing complex projects of this magnitude requires a healthy IT culture not only to ensure the successful discovery of the insights the business craves, but to continuously deliver those insights in a cost-effective way.
Computing platforms deeply impact the corporation they serve, not to mention the end users, vendors, partners, and shareholders. This mobius strip of humanity and technology lies at the heart of the very model of a modern major enterprise. A socially productive IT organization is a prerequisite for success with big data.
Humans organized themselves into hierarchies well before the water cooler appeared. In a corporate organization, hierarchies try to balance the specialization of labor and details only specialists worry about by distilling minutiae so that leaders can make informed business decisions without being confused or overwhelmed.
Distilling minutiae relies on preserving the right amount of detail and abstracting the rest. Because details are not created equal, the goal of abstraction is to prioritize the right details and mask the ones that cause confusion and fear, both of which do a cracker jack job of impairing judgment.
When done well, a lot of good decisions can be made very quickly and sometimes course corrections can mitigate bad decisions. Since organizations are made up of people whose motivation, emotions, and behavior combine with their understanding of topics to produce those judgments, it is rarely done well, let alone efficiently.
In large organizations, at the expense of having generalists organized around the platform, IT departments are set up as hierarchies of specialization to achieve economies of scale. Silos are a result of hierarchies, which need to organize people into economically effective groups. In IT, these silos are groups of specialists.
A group of database administrators (DBAs) are specialists who scale more economically when their group must grow from supporting tens to hundreds of databases. DBAs are specialists in databases, but not in storage. Storage Admins are specialists with spindles, but inexperienced at tuning SQL queries. However, fixing poor platform performance often requires actual collaborative work among specialties, and merely attending meetings together doesn’t cut it.
Smaller silos within silos often emerge in large corporations; for example, storage administration and database administration are typically collected together in the Operations silo, whereas UI design and application programming are contained in the Development silo. If it’s politically difficult for DBAs to communicate with Storage Admins, then DBAs and UI designers are barely aware of each other’s existence.
Although enterprises like to organize employees into silos and sub-silos, the platform is not well served, and whenever the platform fails to scale, recover, or accommodate new business, each silo is potentially implicated. All computing platforms span horizontally across organizations from the physical plant all the way out to the firewall.
Big data platforms also span horizontally, but they are even more extreme—technically, financially, and politically. The silo structure is not well suited for developing and managing platforms at global scale.
Though they have administrative and economic value, silos suppress cross-functional awareness and discourage generalists with a working knowledge of the platform who could fill a very important technical and diplomatic role. Some organizations have an Infrastructure or Reference Architecture group populated by individuals who seem to be the stewards of the platform.
Both technical and non-technical expertise must be represented in this group for the platform to be properly represented; instead, it’s often staffed with experienced technical experts with deep expertise in a limited area of the platform and frequently reports into Operations with little representation from Development, Marketing, or Finance.
If the infrastructure group is given the authority to behave unilaterally, it compromises the diplomatic mission. There is always a fine line between diplomacy, moral suasion, and unilateralism. Done well, this group serves both the platform and business. Done poorly, this group ends up being just another silo.
Other companies construct “tiger teams” by forcing subject matter experts from a number of different silos to work together temporarily. In contrast, when teams of specialists in the ’50s and ’60s needed to develop a working knowledge of those old mainframe systems, they were given the latitude and time to cross-pollinate their skills as specialists in one area and generalists in others.
Never a part-time job, specialists learning to be working generalists must be given the time to understand the rest of the platform. Not all specialists will be comfortable or adept at cultivating breadth of knowledge, so the casting of tiger teams is extremely critical. Tiger teams fail when members are miscast or never allowed to forget which silo they really work for.
If it’s hard for IT departments to tear down silos, imagine how hard it will be for the industry. Silos partially arose from the ashes of the one-stop-shop, single-vendor mainframe model. Vendors specializing in network or storage products found it easier to sell to network or storage groups and so reinforced the emerging silos of specialization.
The products from these companies were optimized for the specific demographics of specialists, so they evolved away from a platform awareness and closer to the particular needs of each silo of subject matter expertise. Poor interoperability of multiple vendors’ products is the best example of this force in action and over time the platform became obfuscated.
Some vendors are attempting a revival of the one-stop approach—mostly to increase the growth of their business, not necessarily to benefit the platform, their customers, or big data. But customers have distant (or recent, if they own the odd PC) memories of one-stop that may not be all that pleasant.
Throat choking is harder than it looks and, on closer inspection, the current one-stop attempts by larger vendors can’t tear down their own internal silos (oddly enough, vendors have organizations, too). They end up operating as several competing businesses under one brand.
Vendors who are now attempting one-stop shops still prefer the silo model, especially if they have franchise strength. Vendors who aspire to use big data to grow their current portfolio of products certainly don’t want to sacrifice their existing revenue base.
For some vendors, it will be a zero sum game. For others, it will be far less than zero because the economic rules of the big data ecosystem are unlike the economic rules of the current enterprise ecosystem. Like Kodak, whose business and margins were based on film instead of memories, traditional enterprise vendors will need to base their big data offerings on insight, not on capture or strand.
In the past decade, customers have grown increasingly dependent on advice from vendors. The codependency between vendors and IT departments is a well-entrenched, natural consequence of silos. It is now difficult for IT groups to be informed consumers, and the commoditization of staff has not helped.
Drained of advanced engineering talent, IT has outsourced this expertise to service companies or even vendors. For example, when enterprises take advice on how to do disaster recovery from their storage and database vendors, they get completely different answers. Vendors always try to convince customers that their silo-centric solution is superior; however, vendors don’t always have their customers’ best interests in mind.
Like performance and scalability, disaster recovery is one of the tougher problems in platform engineering. Even with the benefit of a platform perspective, doing DR well requires a delicate balance of speed, budget, and a good set of dice.
Attempting it from within silos is far more painful, since silo-centric solutions are usually about how to avoid being implicated in the event of an actual disaster. Most solutions consist of a piecemeal strategy cobbled together from competing vendors. Once again, the platform takes it on the chin.
The plethora of intertwined software and hardware that are part of a platform stubbornly refuse to operate like a dozen independent silos. Disaster recovery and performance problems are tough to triage even with a modest enterprise platform, but they take on a degree of complexity that is exponential in a 400-node cluster. Commercial supercomputers must transcend both the mechanics and politics of silos to be successful.
When performance problems are triaged within silos, the result is often like a game of schoolyard tag. The group with the most glaring symptoms gets tagged. If that’s storage, then the admins must either find and fix the storage problem or “prove” that it wasn’t their fault. The storage group rarely understands application code and they are not encouraged to cross-pollinate with application developers.
Likewise, many developers are far removed from the physical reality of the platform underneath them and they have no incentive to understand what happens to the platform when they add a seemingly insignificant feature that results in an extra 300,000 disk reads.
Whenever something goes seriously wrong within a computing platform, the organization of silos demands accountability. There are usually a couple of platform-aware individuals lurking within IT departments; they’re the ones who determine that the “insignificant” feature caused the “storage” performance problem.
The good news for each silo is that it’s not just their fault. The bad news is that often it’s everyone’s fault.
Advances such as Java and hypervisors and a general reliance on treating the computing platform in abstract terms have reinforced the notion that it is no longer necessary to understand how computers actually work. Big data is about performance and scalability first, so knowing what the hardware is doing with the software will become important again.
When everything is working as planned and being delivered on time within budget, silos of specialists are economical and make sense to the organization. When platforms fail and the underlying problem is masked by silos, statements like “perception is reality” start to swirl around the water cooler.
If you hear this enough where you work, you should start to move toward higher ground. As various groups scramble to make sure the problem is not theirs, the combination of fear and ignorance starts to set in and results in impaired judgment or panic.
Silos must compete for budgets, up-time stats, and political credibility, which frequently leaves the platform and business undefended. When the organization is more important than the business, companies can become their own worst enemy.
Organizations of any size are comprised of humans who all have varying tolerances for fear and risk. In order to make good judgments, our brains must discern and sort complex and competing bits of information. Fear does weird things to the human brain by disrupting its ability to make good judgments. The impairment from this disruption can lead to dysfunctional behavior.
But we are not robots; making decisions without any emotion at all is considered a psychological disorder. Studies have shown that every single decision has an emotional component, no matter how insignificant. Research subjects with these pathways fully disrupted find it difficult to even choose what cereal to have for breakfast.
Decision-making is a delicate balance of signals between the emotional part of the brain (amygdala) and the thinking part (ventromedial prefrontal cortex).
What does good judgment have to do with big data? Everything.
For organizations amped on ego and ambition, combined with the intolerance for error that comes with the complexity and scale of big data, this means a lot of decisions will have to be made quickly without all the information. And that requires judgment.
Fear and anger are two sides of the same impairment coin. The emotional, primitive part of the brain is responsible for fight or flight. Fear is flight, anger is fight—both are good at impairing judgment. For example, if I’m meeting with someone whom I’ve clashed with on a previous project, my emotional perception of their opinion will be distorted by my feelings toward them. I might unconsciously misinterpret what they are saying due to body language, tone of voice, or choice of words—all of which communicate information. Also, as I listen for subtext, plot their downfall, or construct a whole new conspiracy theory, I might not be really listening to them at all.
Making good decisions isn’t just about not being emotionally impaired by fear or anger. It also isn’t about knowing all the details, but about prioritizing just the right few. Since there are always too many details, the human brain must learn how to find those few that matter. Finding them requires our old reptile and new mammal brains to dance; fear and anger definitely kill the music.
A business relies on its staff to sort and prioritize details every day. Experience and informed judgment is required. It’s called business acumen or an educated guess. When you guess right, you are a hero; when you guess wrong, that’s OK—you just need to guess again, often with no new information, but mistakes are how humans learn.
Big data is a new, completely uncharted world with problems that have never been encountered. Right or wrong, those who guess faster will learn faster. Prepare to make a lot of mistakes and learn a lot.
Small companies have a cultural acceptance for risk that gets diluted as the company grows. Small companies may appear reckless when viewed from the risk-averse end of the spectrum where large companies operate; an acceptance for taking calculated risks is not reckless. Risk aversion often seems safe (better safe than sorry), but you can be sorry if you are too safe.
Every year, surfers from all over the world attempt to surf 20-story waves off the coast of Northern California at Maverick’s. From the viewpoint of risk aversion, these surfers seem like lunatics. Because a reckless surfer is a dead surfer, surfers must be effective risk technicians.
Similarly, rock climbers on the face of El Capitan in Yosemite National Park, especially the free climb variety, are also considered lunatics. In exchange for determining risks, which involves invaluable intuition and years of experience, surfers and climbers are rewarded with exhilarating experiences.
An organization’s operating risk spectrum is the gap between aversion and recklessness. In business, being risk averse is more about perception of risk than actual risk, so the gap between aversion and recklessness often contains competitors who are willing to take on more risk.
If you don’t believe there is a large gap, then you might be complacent about the competition, but the gap can be wide enough to accommodate both competitors and opportunities for new business.
Disruptive forces like big data also widen the gap, and accurately perceiving this gap relies heavily on how well your old and new brains can get along. Making better decisions requires us to become better at accurately assessing risk.
Probability is an idea; outcome is an experience. Humans tend to perceive risk based on outcome rather than probability. Like most mathematics, probability is based on how the natural world functions at an empirical level and probability is an idea, whereas outcome is grounded in experience.
Using the classic case of driving versus flying, although we know it’s far riskier to drive down US Interstate 5 than to catch the shuttle to Burbank, this doesn’t wash with the human psyche.
If a plane crashes, the subsequent outcome of something very bad happening (i.e., death) is almost certain. However, the probability of being killed in a car crash is less certain than taking that commuter flight. You have a better chance of surviving the outcome, so it seems less risky.
Severity of outcome has no bearing on the probability of the accident in the first place, but this is how our brains work. Good risk technicians must fight this instinct in order to do things mere mortals would never dream of—surf the waves at Maverick’s or surf the capital burndown of a startup that takes on IBM.
Deterministic risk analysis is another example of aversion. In an attempt to protect the business from all possible outcomes, instead of all probable outcomes, organizations often assume the worst. They assume that failures will occur.
Deterministic analysis assumes that all possible failures will happen; probabilistic analysis assumes the components that are most likely to fail are the ones that actually fail. Being a better risk technician will help to optimize the platform.
One sure-fire way to get accountants and controllers mad at you is to ask them to quantify qualitative risk. Turns out, although this is not happening in spreadsheets, it’s happening at an intuitive level all the time in the form of the thousands of business decisions made every day.
An easy example of qualitative risk analysis is found when making a decision about recruiting new employees. The decision to hire one candidate over another, though a subjective judgment, involves the brain doing what it does well: making decisions with qualitative information.
There is nothing mathematical about intuition; so it’s an unmentionable word in many organizations. Not because it isn’t used everyday to make decisions, but because it appears to be biased or non-linear or random.
Good intuition is far from random and can allow for very quick decision making. Having the patience to listen for and recognize good intuition in others makes it possible for people to make better decisions faster.
Organizations don’t kill projects; people kill projects, and sometimes projects kill people. All are bad clichés, but it seems that some organizations have become bad clichés, too. Getting humans to work as a well-oiled machine is the hardest part of the soft platform—hard to understand, hard to preserve the innovation, and hard to change.
Changes in organizational behavior happen at a glacial rate relative to the technology and business conditions that accompany trends like big data. Humans simply can’t change their patterns of behavior fast enough to keep up with technological advances. It’s a cultural impedance mismatch.
The rate of acceptance of big data—which came up quickly on the heels of the cloud computing craze—will be necessarily slow and erratic. “We just figured out clouds, and now you want us to do what?”
Big data also brings with it a shift in the demographics of professionals as open source programmers and data scientists bring energy and new approaches to an established industry. Anthropology and technology are converging to produce a major shift in how everyone consumes data—enterprises, customers, agencies, and even the researchers studying how humans behave in organizations.