Information siloing used to be a problem, but that was a long time ago. Right?
Today, the silo lives on mostly as a minotaur of the imagination, a bogeyman we invoke when we want to tease the young folk about just how good they have it.
When I was your age we didn’t have ubiquitous mobile connectivity, cloud, open APIs, RESTful services, or [censored] emojis. We used to have to walk from one building to another—in the snow—when we needed to “exchange” data. We called it sneakernet. And we liked it!
Alas, the information silo is still very much with us. It is still much too difficult for people to discover, consume, and exchange data with one another. Innovation has helped to address some of the physical or technological barriers to data exchange; people and process barriers remain, however. The ideal is a situation in which a business analyst could search for, access, and consume information – seamlessly querying against OLTP database systems, data warehouses, and non-relational data stores alike; searching text files, spreadsheets, workbooks, etc. – much as she uses Google to search for and consume information.
The reality is one in which search and discovery is as much a manual as a technological process, such that the self-service experience entails about as much data-detective legwork as technology assistance. The self-service user could be forgiven if she feels a lot like a bloodhound, loosed in a swamp of data and instructed to cast about for relevant data sets. Once she finds the data she needs, she must use a complex suite of tools – including the eternal, ineradicable spreadsheet – to prepare and integrate the data for analysis. This process isn’t in any sense seamless. It isn’t coordinated or rationalized. The technical term for it is “willy-nilly.”
Siloing is particularly problematic in decision support. This has always struck me as richly ironic, inasmuch as data access/exchange is the lifeblood of business intelligence (BI) and analytics, to say nothing of other, even more advanced analytics practices.
Ironic or no, the reality of siloing in decision support is irrefragable. It encompasses not only the distribution (and isolation) of data, applications, and services but also the distribution (and isolation) of practices, too. This last is worth emphasizing. We tend to speak of a “silo” as a dis-integrated or unmanaged IT resource of some kind: a “data silo,” an “application silo,” etc.
But siloing in this conventional sense is only a visible manifestation of siloing at a more basic level: That of the human-initiated practices that produce silos of isolated data sources and information systems in the first place. Think about it. The so-called “spreadmart” is one of the earliest and most recognizable forms of siloing. The term, an amalgam of “spreadsheet” and “data mart,” was and still is used pejoratively by IT people. But the spreadmart is interesting for several reasons. In its original, pejorative use, it lifted the lid on a suppressed spreadsheet-based analytics culture that (even prior to the codification of the term “spreadmart” in a 2002 article) was at once thriving and deeply entrenched. Owing to its popularity among business people, the spreadmart had, by the first decade of the 21st century, become a problem for IT.
People and practices create the silo. The silo itself is usually a sign, a symptom, of something else: an underlying problem or unmet demand, such as (in this case) an essential asymmetry between the priorities and purposes of IT, on the one hand, and those of business people on the other. As the original self-service tool, the spreadsheet-cum-spreadmart was not primarily a demand for simplified access to data, nor, essentially, for new ease-of-use/access features.
At its core, the popularity of the spreadmart, like that of self-service itself, was an expression of a demand. People wanted to take some control back from an IT colossus that had become obsessed with controlling access to resources—with an emphasis on restriction, as distinct to accessibility.
Siloing didn’t begin and end with the spreadmart, of course; if anything, the availability of self-service BI and analytic toolsets compounded the problem. In real-world usage, each of these tools comprises a silo unto itself, not just in the sense that the first self-service tools lacked core data management features (e.g., such as metadata management and data lineage capabilities) as well as data synchronization features, but in the sense that each of these tools expects to consume data in certain specific ways. If you’re preparing/modeling data for use with Qlik, you’re going to want to structure it one way; if for Tableau, you’re going to want to structure it another way. This arrangement is so common that most of us don’t even think to question it. We expect siloing. We’re even willing to trade one kind of siloing for another.
So it isn’t surprising that we embraced the idea of an all-in-one silo – e.g., the data lake, the enterprise data hub – to displace the distributed silos we could not extirpate by other means. This was one proposed use case for Hadoop, which, to its credit, did address a few hard, knotty problems. However, because Hadoop was conceived of as a distributed data store – not, strictly speaking, as a database – it introduced a slew of strangely familiar problems, too. It lacked core data management features, it exposed untraditional (Pig, Scala, even Python itself) or unwieldy (HiveQL) programming interfaces, and, in its earlier incarnations, was tightly coupled to MapReduce, which proved to be a comparatively inflexible data processing engine.
Enterprises knew all of this before they invested billions of dollars in Hadoop products and services. They chose to ignore it, seduced first by the vision of a Hadoop-based silo-to-end-all-silos and, second, by the promise that Hadoop itself would improve. The YARN scheduler would improve Hadoop’s workload management feature set and also break its dependency on MapReduce. Mesos and Kubernetes could be used to complement YARN, especially in conjunction with interactive workloads. Spark would supplant MapReduce as a scalable parallel processing engine capable of running in Hadoop or on a standalone basis. And, last, a succession of technologies (HCatalog, Apache Atlas) would solve the Hadoop platform’s data and metadata management issues. Things haven’t played out quite this way, of course. The upshot is that – eight years on – the silo-to-end-all-silos comprises still another gigantic silo.
The whole depressing cycle strikes me as a gruesome example of what Simone de Beauvoir (among other philosophers) has called “bad faith.” By this she meant a kind of willing self-deception: a need to deceive oneself. It’s what happens when someone insists that something is true (or untrue) even though, in some way, at some level, she knows the reverse is the case.
We’re masters of bad faith. We keep doing the same thing, keep reprising the same mistake, because we need to. We allow ourselves to be convinced that this time the paradigm really has shifted, and, in shifting, has created a situation in which the old rules don’t apply. We want (and need) to believe in the power of new technology to magically solve old, hard, knotty problems: even the class of old, hard, knotty problems that amount, in effect, to iron laws. Problems that are products of intractable physical, technological, economic etc. constraints.
It’s kind of like the old Times Square sidewalk shell game, except that we – IT, the line-of-business, industry talking heads, so-called “influencers,” etc. – keep getting off the bus and coming back for more. It’s driven me and many of my simpatico friends to the brink of despair.
The good news is that there actually is a (bright) harbinger of hope on the horizon. If you look at self-service as in part a bottom-up response to IT’s general unwillingness to take seriously the needs, priorities, and purposes of the business, you can also see that this latest clash between top-down and bottom-up has helped fuel demand for a new, pragmatic solution – a “middle-out” solution, so to speak, to borrow the name of the MacGuffin-like compression algorithm that is a running gag in the HBO series Silicon Valley – to the age-old problem of data access and exchange. “Pragmatic” in this context means a solution that strikes a balance between the needs, priorities, and purposes of both IT and the business to achieve what Alexander Hamilton once called a “prudent mean:” i.e., a genuine, respectful compromise.
I’m talking about the emergence of self-service-oriented data management platforms that are data- and infrastructure agnostic. Such platforms promise to simplify – and, to the degree practicable or desirable, make seamless – the process of exchanging data between and among diverse reporting and analytic tools. These platforms would permit IT to do its job, too, e.g., by enforcing (more or less strict) data access, data transformation, data retention, and data exchange policies. Believe it or not, this no longer looks like the stuff of a pipedream.
I’ll say more about this in a follow-up post.
 First, it provided a means by which to distribute both data processing and data storage; second, it exposed, via its built-in compute engine (MapReduce), a crude means of parallelizing data processing workloads. Of course, MPP databases addressed both of these problems, too – and is, moreover, a superior SQL processing platform. But Hadoop, unlike, an MPP database, was better suited for storing non-relational data – especially files.