A look at how Fortune 500 companies use Big Data and the path to Big Data Analytics

“Yeah, we use Big Data.” I think you’d be hard pressed to find a Fortune 500 company that doesn’t use Big Data today. The term came onto the tech scene in 2005 when it was launched by O’Reilly Media. They created the term to describe large datasets that were too cumbersome to manage using traditional BI tools. Fast forward to 2018, working with Big Data is done and we can move onto bigger and brighter problems… except we can't. If anything, now there is more work to do with Big Data than ever before. 

In the mid 2000s, companies were coming to stark realizations about the current state of their data and what the future would look like. The internet was growing exponentially, web apps were collecting more data than ever, internet speeds were getting fast, and dreams of a mobile world promised a future of immense data growth. In a world where existing databases were already bloated, difficult to manage, and running in a closet in the backroom, the bottleneck was imminent. tl;dr: MySQL wasn’t going to cut it. 

When I was a Young Warehouse

Many companies, especially Google and Yahoo, were feeling the pain of large amounts of data sitting in small databases. As the data scaled, simple queries began shutting down databases. To solve this problem, datasets were split amongst multiple databases, making getting a clear picture of all the data time-consuming and resource-intensive. Soon, simple questions started to go unanswered. As Google sought solution to these pains, they arrived at Leslie Lamport’s scientific journal The Part-Time Parliament and his follow-up Paxos Made Simple.

In Paxos Made Simple, Leslie Lamport outlines an intricate “distributed” voting system of a fictional ancient greek society. He takes these principles and lays out the math and logic for applying this schema to a distributed group of computers. Google expanded on these principles, and in 2003, released a paper titled The Google File System. Yahoo expanded on this idea and in 2006 released Hadoop Distributed File System (HDFS).

And what’s the significance of the founding of Hadoop? If you work with Big Data, you almost certainly use Hadoop. It has laid out the framework for nearly all cloud storage and is an important part of the puzzle that is Big Data today. This point leads into the next one:

Hadoop == awesome Big Data Storage but Hadoop != awesome Big Data Analytics

Big Data, I Love You, But You’re Bringing Me Down

Prior to writing this article, I was discussing this idea with a buddy of mine who is a Senior Hadoop Engineer for a Fortune 500 company. He introduced me to a term which encompasses the Big Data struggle: Data Gravity. Data Gravity is when your dataset is so large and cumbersome that you feel like it’s weighing your company down. It’s so heavy that simple Map Reduce jobs, like a count of empty files, become planned events. So heavy that transforming data and joining two tables has to be scheduled out. Even so heavy that users would rather keep messy, unnecessary files than going through the pain of deleting data from such a bogged-down dataset.

“Data Gravity is when your dataset is so large and cumbersome that you feel like it’s weighing your company down”

And just like gravity in the vacuum of space, this data gravity can lead to a Data Black Hole - sucking up time, resources, and the souls of jaded engineers. To make matters worse, the landscape has changed so that companies are receiving large amounts of data from vendors, partners, subsidiaries, VC’s, etc. Wouldn’t it be nice if all these datasets could quickly join, and transform, and prune, and aggregate, and…

Data gravity is an issue that virtually all large companies face. Nowadays, data is coming from many disparate sources, often in different formats, managed by different teams. Data is siloed and the access is severely limited and controlled. It’s a complex landscape and the first step is to get a holistic view of all of your data.

Fortune 500 companies can store Big Data, but in doing so, they run into issues like Data Gravity, which end up pushing analytics further down the road. Business users have historically pushed technical teams to come up with a solution to the problem. Unfortunately, there was no simple solution and this challenge has resulted in a general air of distrust between business users and technical teams.

This landscape is tough to navigate and often the solution lies with outside experts who know how to find and address the needs of both of the business and technical sides of the company. The advantages of hiring outside experts lies in de-risking and accelerating. De-risking because the investment cost required to hire a consulting team is much lower than creating and managing a new intracompany team. Similarly, if something goes wrong or your companies goals change, it’s much easier to stop paying consultants than to disband a team. Accelerating because you don’t have to worry about hiring and onboarding new employees.

Starting with a discovery process with both business users and technical teams, broad and specific goals can begin to be mapped out. From there, existing datasets can be assessed. The result is a solution that can handle the complexity of the data as well as the specific goals of invested teams.

And Wouldn't it Be Nice to Store Together 

Once there’s a better understanding of the data landscape, companies can then begin Extracting and Loading (EL) data with Apache Airflow, an open-source tool originally developed at Airbnb that programmatically authors, schedules, and monitors workflows. One of the biggest challenges with extracting and loading data is managing the myriad of scripts, dependencies, and time sensitive events. Airflow provides a platform to manage these by using a intuitive UI which maps out job relationships, itemizes errors, and links directly to the logs of the different scripts and processes.

Airflow helps manage Spark jobs so that data can be properly exported and transformed from Hadoop. To manage SQL queries in an automated and scalable way across many transactional databases, organizations can choose to use tools such as Postgres or MySQL. And in a similar way, Python can be used to convert CSVs into objects and load them into the data warehouse.

Ultimately, for Airflow to be successfully deployed, there needs to be a strong understanding of all the underlying data and the underlying relationships between it all. But even as organizational work changes and evolves, Airflow maintains a flexible environment that facilitates change.

Once the data is moved and organized in a single data warehouse, the analytical magic can begin.

Gotta Keep ‘em Separated

When it comes to data warehouses, Snowflake is one to consider for numerous reasons. Snowflake is a cloud-based elastic data warehouse that leverages cloud technology to keep data storage and computational work separated. It does so by leveraging existing cloud-based infrastructures, such as AWS and Azure. What’s unique about it is that while it is a self-managed cloud based tool, it can be paired with your existing instances of AWS or Azure so that your in-house dev team can have access. For AWS, Snowflake stores your data into a specifically curated and managed instance of S3. For compute, Snowflake spins up Amazon EC2 “compute warehouse” which, depending on what you’re trying to do, can be spun up smaller, larger, or with many parallel instances at once.

A typical Snowflake instance has separate compute warehouses for ingesting data, scheduled jobs, a BI tool, and ad-hoc analytics. Depending on the specific organizational needs, these warehouses can easily scale up or down, and will automatically turn off when they’re not in use. And since you’re billed by per-second of use, you can scale up your warehouse to run a job for an hour and then have it automatically turn off upon completion, rather than waiting for your Spark job to run all night only to find out it’s failed in the morning.

Snowflake is also efficient at computational processes, meaning any transformations, joins, or aggregations run quickly. From the beginning it was planned to be a SQL-based data warehouse, so it has all your favorite SQL abilities, such as subqueries, common table expressions (CTE), and window functions, plus a few bonuses such as functions to work with semi-formatted data like JSON, XML, or ORC. And because transforming is a breeze, you can focus on using Airflow to extract and load into Snowflake and let Snowflake take care of the rest.

So to recap, we have Hadoop as the chassis of the car, acting as the underlying storage technology for Big Data. Then we can factor in Airflow, which is the plumbing and electrical wiring managing all the extract and load work. Finally there's Snowflake, which is the engine transforming and querying all of the data. Now we just need a user interface to control our car and a nice shiny paint job to show it off - sounds like we need a BI platform.

Can Any Tool Find Me, Some Data to Love

If you’re in need of a BI platform, you needn’t look further than Looker. At its core, Looker manages your BI logic through a modeling language called LookML. With LookML, you’re able to define the relationships between datasets, define metrics, and define any necessary transformations.

LookML is a SQL abstraction layer which writes out SQL queries so that end users can query the data warehouse directly, even without knowing how to code. And as it’s a modern coding language, it uses practices such as GIT, object oriented programming, and a built-in linter tool. Along with the language, Looker has developed a thriving online community around Looker Blocks, or reusable code snippets. Once logic is modeled, it can quickly and simply be edited, appended, or removed, allowing transformations to take only seconds or minutes, rather than hours or days.

Once a model is created, business users have access to explores where they can build queries from an “Excel-like” interface and query the data warehouse directly. From there, users can format how data is presented through Looker’s visualization interface, which was recently enhanced with Looker’s latest release of Looker 6 visualization capabilities. Once users have created a report or a “Look” they’re satisfied with, they can add it to a dashboard of similar Looks, schedule it to be sent to their email, or even set it to be conditionally sent based on the changes in the data.

Lastly, Looker has a powerful API, as well as a white-labeling service, which means sharing your data between apps, teams, partners, and customers is baked in. With this, you can build dashboards for specific customers, or send reports specifically to your C-Suite team. Because of these reasons, Looker is considered a BI platform, rather than just a BI tool.

… but where to start?

No matter how large your organization is or how “big” your Big Data is, we at DAS42 believe there is a solution that can align with your goals and lead your organization to data accountability, data visibility, and data stability. Reach out to info@das42.com for more information on how you can do Big Data Analytics with the DAS42 team.

Joey Bryan is an Associate Data Analyst at DAS42