Last week I had the opportunity to attend the inaugural Leverage Big Data conference put on by Tabor Communications. Following a non-traditional conference format, this conference brings together a curated, cross-disciplinary industry leaders, executives and vendors for several days packed full of keynotes, panels, assigned “boardroom” case study discussions, one-on-one meetings and social events. The result is an intimate and open sharing of ideas, expectations and predictions.
This year’s theme was Data-Driven Infrastructure: The Engines of Big Data
Attendees came from myriad industries and backgrounds. Keynotes discussed cancer research, space exploration, data science and informatics, biotech and their data woes and successes. Panels with active audience participation dove deeper into related subjects. All participated—researcher, business leader and vendor alike. How else would we be able to get a more holistic view of the issues at hand?
And, there are issues, make no mistake about it. Big Data is still a relatively new industry, if we can even call it that—even now that epitaph may be premature. It is struggling for identity inside and out, and as new players join the fray, the line between what it is and what it isn’t continues to blur, making self-definition even more elusive.
Here are some of my top key takeaway concerns from the event:
- The problems haven’t gone away — In some ways it can be a little disheartening we are still discussing the same problems about Big Data. Not only that, as we come to understand the Big Data better, newer, more subtler questions are arising. “How do we store the data?” and “How do we move the data?” are joined by “How are we going to do that with the exponential explosion or deluge of data seen on the near horizon?”
- Nomenclature normalcy may be lost forever — We are building our own modern-day Tower of Babel, though in our case we are starting out with everyone speaking their own languages, using their own plans and constructing it after their own image. In other words, we’ve started already under the burden of the curse. This phenomenon is not due solely to the open-arms embrace of unstructured data from sources of varying repute. Even well-meaning structured data carries the syntax, nomenclature, mental model and ontologies of the creators. Unfortunately, even researchers in the same field working on the same problem may have drastically different ways of presenting the data.
- The definition of “data” is growing — The days where the majority of data would look similar to that contained in Scrooge‘s accounting books carefully penned out by Bob Cratchit are gone forever. Tabular data is joined by all manner of images, time series, and all manner of abstract formats. Correlating these disparate types of data is becoming more difficult.
- We’re still playing at being a Pokémon Master — The slogan “Gotta Catch ’em All” is indicative of the feeling I’m still getting from this industry. We deploy all these sensors and attach to all these feeds and start to gather up information. We aren’t really sure yet what we are going to do with all of it, but we want it… we NEED it… almost like the addict looking for their next hit. We’re afraid if we don’t, we might miss that one critical piece of data that will provide the key insight, that will only be available sometime in the future once we’ve further perfected our analysis techniques. While I may be able to make an exception for scientific research data, in general we need more curation. Otherwise, we run the risk of becoming a hoarder, buried alive in our own home too afraid to clear out the clutter.
- The gut bestows validity — Given a large enough data set (which is always the case with Big Data), false correlations will manifest. Every good data scientist and practitioner knows this and the other pitfalls, like Simpson’s Paradox, that await the Big Data nouveau riche. Flush with data and devoid of the past 200 years of statistical science, we semi-blindly fumble our way into a “brand new world” of bad decision making. According to one keynote speaker, despite what we may be told or expect on our own, only a small fraction of large company leaders allow data to trump their gut. This confirmation bias can lead to ignoring important information and bad decisions. A recent example of a major “correlation with causation” failure is Google’s attempt to predict flu outbreaks before the CDC. Sure, it worked the first two years, but the third was a disaster.
- Enduring talent shortage — We don’t have enough trained, experienced people. Enough said.
So, this just sounds like doom and gloom, but it’s not. In my opinion, it is a healthy self-evaluation of where we currently are in this emerging field. It is true that Gartner’s Hype Cycle is going to play out, and the majority of the attendees believed we are still heading down into the Trough of Disillusionment. But, the truth of the matter is if we have this level of self-awareness of the issues, problems and pitfalls at this stage, we’re actually well on our way towards maturity.
Here are some of my key takeaway “reasons to be excited” from the last few days:
- Cross-functional, cross-discipline, cross-industry collaboration — More than anything else, I was happily surprised and encouraged by the level of collaboration I saw at the conference. Whether people were there mainly to learn or to impart lessons learned, the sharing was contagious. Each of us was coming at these problems from slightly different angles with slightly different backgrounds. Through open sharing, these unique perspectives bring a much more holistic view to the problems. And, I may have an answer for what you are seeing, and you for me. If this trend continues, I’m confident we’ll work our way through the problems.
- People are integral to success — As I noted in a recent article, people and process are complimentary. With Big Data, it’s so easy to get caught up in… well… the data. We can’t forget the people component. As part of solving the Big Data problems, a more comprehensive approach to not only develop new talent, but also to support and improve existing people and the processes that support them is needed. It was clear from the discussions this issue is clearly recognized and people are working to address it already.
- HPC and Big Data are becoming BFFs — That’s right, best friends forever. While I don’t think either will ever subsume the other (just as with HPC and Cloud), there is much to be learned by each from the other. For a conference dedicated to Big Data, HPC was constantly part of the discussion as a needed underlying technology. We still have a way to go before this union is finalized. We’re still trying to figure out how to make it work. However, we’re on the road and know where we need to go. Both technologies will be supporting the other.
So, all in all, it was a great conference with a wonderful front seat window view into the exciting vista of the future. It’s going to be a fun and exciting ride.