Distributing Big Data

Andrew Hicks
Dec 27, 2020
4 min read

Working as an analyst in the data sciences, one is expected to know a lot about how to manage, aggregate and present data so that end-users can utilize it for its indented purposes. One of the most useful ways to manage big data is by storing it in a database and then querying directly against it for analysis and reporting. It is not uncommon where the skills of the business don't enable end-users to directly query such tables. It is often the case that you find teams who package data for distribution. Not all users need direct access, and for some, a dashboard is all they need. But in some cases, having the skills to be able to tap directly into that data is exactly what can give those users the information that they need to make the best decisions.

In December of 2020, I ran into such a case. I had developed a dynamic marketing dashboard in Power BI for the entire consumer organization. This dashboard gave insights into the following information:

Household Information: Number of enabled households, subscribers, subscribers by product, revenue by product, and household-enabled broadband speed.

Demographic Information: Estimated age range of subscriber, estimated number of adults per household, estimated household income, and estimated indicator of if a subscriber has a credit card.

Other Information: Household sale indicator, household disconnect indicator, household campaign marketing touches by type, and hierarchal view by State, CBSA, Zip, and household identifier.

This dashboard had a single dynamic control page that allows its user to filter through many attributes to conform the dashboard to address the business question that they are trying to address. It was long thought that such a marketing view was not feasible, but with the support of my team, I was able to turn what was impossible possible.

This dashboard was built off a SQL table infrastructure, that all together consists of just short of one-hundred million records. In fact, the dataset was so large that I first composed a proof-of-concept with just our fiber customers. To get the full view into a dashboard, I had to upgrade all my application to 64-bit processing capacity just so I could utilize more then 2GB of RAM.

My manager was so excited about the fiber marketing dashboard version that he began sharing it across the marketing organization including his boss, a vice president, who absolutely loved it. One of the teams with whom we shared it had an issue though. They had a lot of salespeople who could make good use of our dashboard but might want direct access to the details behind the data. The simple solution to this is to give them access to the SQL tables and let them run their own queries against it to address whatever business question that they were dealing with at the time. This was not something that that team felt ready or willing to do. There was political resistance to training salespeople how to run SQL commands. It appeared that they wanted the information handed to them in Excel which at the time, due to the massive size of the dataset, did not seem remotely possible.

My manager and I had many conversations about what the best and most efficient way we could distribute the underlying data. We looked at the possibility of using Hyperion Essbase since many users at the company have access to it but were concerned with its maintenance. We looked at the possibility of connecting the database to Access but were concerned that users lacked the skills to use it. I even began building a Python GUI standalone application that users could click through a control panel and download an associated Excel file with the sample they selected. I produced a proof-of-concept sample of this application, but a deployable version would have required more development time and routine maintenance.

Finally, the solution became apparent. There is a function in a published Power BI dashboard called Analyze in Excel that connects a spreadsheet to the published dashboard though an OLAP connection. The spreadsheet depicts a pivot table with all the fields available from the dashboard. None of the data is stored in the spreadsheet, so the file size is quite small. When you bring in an attribute into the pivot table, it connects to the published dashboard and sends back the results. I had to build in some measures in Power BI to get the pivot table value section to work, but after that it worked brilliantly and fulfilled the business need.

Anyone to whom we distribute the spreadsheet model will have access to tens of millions of records if they have been granted permissions to access the dashboard. Teaching the sales workforce technical skills or maintaining an aggregated report is no longer a concern. Some may have a bias against dashboards assuming that it lacks the ability to make data useful outside of its constructed scope. With some research and development, I was able to prove otherwise.

Distributing Big Data

Recent Posts

Comments