Analytics Stack – Predicting the future of modern data teams

I see a systemic change in how we are leveraging the analytics teams at organizations and as a result a complete new analytics stack is getting developed where people with different skillsets are individually working on the different layers of this stack. This evolution path is similar to how software development teams came up over the last ~20 years. HBR has labelled Data Scientists as having the sexiest job in the world, and McKinsey recently has published a report calling the present times as the age of analytics.

While the companies are convinced that they need to invest in their data teams, but young professionals who would be taking up these roles are often unaware of what they are signing up for. So, a lot of articles exist for the CIOs and CDOs on how to levarage analytics, but nobody is telling the potential analysts how things are changing from their perspective and what new roles are available for them, so that they can hone their skills accordingly.

We are right now living in a world where most of us in this field are self-taught, so I feel it is becoming increasingly important to carve out time to understand the new oppurtunities rather than just following existing career paths. Everyone today wants to be a data-scientist but I believe soon there wouldn’t be any such generalist role. Mature analytics teams have such complex workflows that they have to break-down the work into multiple jobs and have analytics dedicated to a part of the problem. The middle-layer manager managing the analytics would be the only one with grip over the whole project but he’d spend little time in technicalities and a big part in people management. This is a classic individual contributor vs manager role conflict, managers often have a good grip over the existing tools and frameworks but find it hard to innovate because they are no longer directly connected with the technical world. Thus, although smaller teams right now have data scientists but as the teams would mature these folks would have to either pick a technical segment to specialize in or become people managers.

This post is an attempt to deconstruct the workflows and list the roles available in a typical analytical IT team. I’ll try to follow it up with a post on how to acquire the skills needed for these roles. I’m also learning this way!

The biggest reason we were not able to take all business decisions based on hard data was that it was too expensive to store data. So, we had to prioritize, important financial data was stored and archived digitally but transactional data was mostly still recorded on paper or it was just stored at an aggregated level instead of individual records. At best we were using data to publish financial health of companies. Trading firms had pioneered by creating quant jobs to predict outcomes and possibly alter decisions as there was a lot of potential to increase profits just by having faster access to data. For all other industries, standardized month-end reporting was enough. Not anymore! You’d be surprised to know that hard disk drive capacity increased 1 million times in last 5 decades!

As the data storage costs have come down drastically, bulky transactional data is now being stored and archived, it is also being parsed through models to identify patterns and trends as an additional source of insights for increasing profits or decreasing costs. Models and algorithms which were earlier limited to research are now being employed in daily routine and are even getting standardized into plug and play models, making it easier to skim through datasets storing these mundane daily transactions. This practice will only get deeper and sharper as we scale up things!

But, big data doesn’t always mean it is good data, a lot of the times the transaction data is just noise and not really insightful at a macro level to the human eye. A general business user can only do simple analytical operations in his mind like finding co-relations. Thus, a lot of times, the only way you can discover hidden insights is by assuming some co-relations and iterate through the standard templates known to us, in a way hit and trial to see if we can guess the complex relations. In my experience, often analysis at this level is equivalent of searching for a needle in a haystack. It involves working at such microscopic levels that to focus one has to disassociate with other aspects such as managing people who are executing strategies formed on the basis of these insights. Thus, often these roles are very technical and disconnected with their business impact and a company can afford to get to such analysis only when it has solved all easier problems!

So, although bigdata is a new found asset, regular business users’ life hasn’t changed a lot, they still use excel for most of their adhoc analysis. Thus, I believe the bigger impact will delivered by making it increasingly easy for an average user to store regular data in an online connected environment instead of isolated offline ms excel/access silos. I say so because majority of the businesses don’t really generate such huge amounts of data that a single hardrive is not enough for them. The problem that they really face is that it takes a lot of manual effort to convert the non-standardized raw data in their ERP, CRM and other systems into usable insights. Plus the data is hardly available at the instant you need it, some analyst needs to understand your request, and create an adhoc report to serve you, and this asynchronous approach translates into the overall slower speed of execution.

System interface are getting redesigned as well; monopoly of SAP for ERP and SalesForce for CRM will soon end. A lot many options are coming up which are mobile friendly for easier data entry and the database is accessible for connections with other dashboarding tools. To summarize, with the infrastructure changes in place, we have changed the game and the roles of players playing the game.

And, because systems are getting re-designed, I see a lot of scope for data analysts to get involved in the designing process of these new systems. Whoever is leading the analytics efforts at a firm, needs to manage 3 gears that in a way run in series – Webforms for users to enter data into databases, Database tables to store and process data and Dashboards to visualize and analyse data for users. These 3 fields are individually big enough by themselves for anyone new to specialize in!

Backend Data Engineers are responsible for ensuring forms are connected with correct fields in databases and tables are interconnected with each other. Bulk of traditional IT teams would soon get concentrated towards this role. We’d always want custom apps to be created for enterprise in Java but that industry has reached its saturation point, with hardly any innovation. I see much more effort being done to capture data from new kinds of input devices and then storing the big data in newer efficient ways. Once data pipelines are laid down, engineers are needed to fetch data and push it through other pipes. We live in an interconnected world, where not only websites but soon products would be communicating with each other via APIs, a backend engineer is also laying wires for these connections to make sure every data point needed for analysis internal or external is made available.

Front Data Analysts are responsible for pulling aggregated data from tables into BI dashboards. more often than not they need to enrich the data before publishing it. Because we keep adding the types of datatables available for storage and the technology on which they are based, a lot of upcoming IT roles are asking candidates to be just able to read data from these new kinds of tables. Another breed is focusing on the enrichment part. Often we’d simply want the dimensions to be grouped – eg – data providing us just pincodes but we want to compare performance of states. At times the end-result of the custom analysis results in a few extra columns storing these new ratios, coefficients, probabilities, relations, etc – eg – basis the historical shopping history of a customer, identifying 5 items on which he can be offered a discount. This post here explains the enrichment of data using machine learning technologies.

The third class is of designers who visualize the data in the form of charts and tables. No matter how rich the data is, if it is not represented correctly, not everyone would find it actionable. This subject can be more art than science. It may seem simple but being the last layer in architecture, this is where a lot of collaboration happens. And with the ever-growing features in dashboarding tools, business users will soon ditch powerpoint in favor of creating live dashboards. Which is the most commonly used form of visuals? It’s not a bar or a line chart, data-tables are by far the most commonly used visuals still and I don’t see that changing anytime soon. Why? because that’s how people consume data, only when they’ve got acquainted with the order of magnitude and relevance, they start comparing one item with another or look at the trend, and that’s where the graphs come into the picture. Thus excel or similar spreadsheet tools will remain strong, and the scripting languages like VBA and Google App Script won’t die. Their roles are surely changing though, instead of getting used for data transfer and manipulation, they are increasingly getting used for data representation.

You’d be surprised if I tell you that last year I had created this complete stack within G Suite and I was managing a multi-million dollar business on it.  It is the easiest way to get started with your business. Especially when you’re still prototyping and changing your tables frequently.  Within G Sute, you can use forms to allow users to enter new records from their mobile devices, these forms are connected with google sheets, thus creating new records in the sheets in realtime. Also, these sheets can then be interconnected with each other using importrange function which allows you to convert these sheets into mastertables. DataStudio can then display dashboards based on live data in google sheets which can be your first ERP and CRM. Thus, without any upfront investment, you can create a realtime workflow and build prototypes.

But I see these workflows getting further simpler when their makers put up a layer of GUI on top to allows non-technical users to drag-drop their customization instead of coding them. I love the work that AirTable and Podio have done in this area. There are so many other tools that more specific in their use-case like industry specific CRMs and workflow apps. They provide you the complete playground that eases the work of a back-end engineer to a great extent. Obviously when you scale up the project, you’d need your own custom databases, but let’s not go there for now.

Alteryx is a great tool that fits in the middle layer and helps the front-end engineer to breakdown and re-aggregate data as needed without coding. This is why I strongly believe soon there will be tools that will have in-built drag-drop modules to predict complex co-relations between variables to create new labels/fields within the table, feature engineering is what this process is called. Maybe Alteryx will up-scale in this area or maybe newer tools like SparkBeyond will get better and become part of the toolkit. Those who can’t afford Alteryx or would like to have a no-constraint playground, only option left is to build a custom script over R or Python. It is the preferred path right now due to lack of alternatives but soon only specialists would need code to manipulate data. This is why I’m not too worried about learning every other new coding language, it is more important to know the difference between different techniques, and which one is to be applied in a given problem set.

Lastly, the layer that is already part of mainstream analytics workflow will only get better with time. Tableau and PowerBI are 2 popular tools here and it’s important to have a good command over either of them.

Excel and Google Sheets will soon be used at this last step of data representation only instead of the current approach where it is being used at all 3 steps. But still I would say it would be important to have a good command over VBA and Google App Script. Their relevance will not be diminished by all the other tools that I mentioned above.

Once you move over to enterprise level tools instead of these starter ones, you’d be seeing html/android forms on your devices for data entry, MySQL/Mongo databases storing your data, might as well have hadoop to manage the bigdata, php/python to connect the two, tableau/powerBI to pull data from databases with maybe some addon-ons depending on your usecases. So, essentially the workflow remains the same, but because of added complexity due to scale, you might end up working on only one of the pieces.

The best (or worst) thing is that none of this knowledge exists in our coursework anywhere! So, the field is pretty much open to anyone irrespective of their field of study. It’s best to get some hands-on experience through internships and see what you like best and then go deeper into that area.

Hopefully, we’ll see changes in the curriculum of schools and colleges, and these fields would become stand-alone subjects. We’re seeing these distinctions at graduate level but I believe we don’t want as many researchers here. More kids need to take up these jobs straight out of college but they need to be better prepared. Just an aptitude and will to learn is not good enough as it favors self learners. This field is application driven, we’re modern age engineers who need to need to manage the flow of data and make sure everyone is fed well :-)