Today’s post is from Sean McClure. Sean is the Director of Data Science at Space-Time Insight, a leading provider of advanced analytics software for organizations looking to leverage machine learning for their business applications. Having worked across diverse industries, and alongside many talented professionals, Sean has seen the blend of approaches required to convert raw data successfully into real world value. Sean’s passion is working with cross-discipline teams to build the next generation of adaptive, data-driven applications.
What is Data Science and how it builds better products
It would hardly be an overstatement to say we live in a software economy. Many of the products we use are software applications, and serve to help people and organizations become more efficient at everyday tasks. Data science is poised to bring big changes to the world of software by using analysis to help teams build adaptive, self-learning products. While data science becomes more prominent across industries most projects suffer from a severe disconnect between the efforts of data scientists and the applications being built. By advocating the use of APIs to better connect data science to software, this article asserts that the next generation of products will require data science and product development to be done in concert.
Organizations are a product of their choices. Whether it’s hiring candidates, positioning messages, or pivoting products, companies attempt to select the best option among many. An obvious consequence of this is the preoccupation with making good decisions, yet sound decision-making only comes when companies have learned something about their environment. After all, choices don’t get plucked from thin air, but are based off the concepts acquired through information and experience.
Analytics attempts to assist in this learning process by using data and analysis. By ingesting data, spotting trends, and uncovering patterns, analytics uses a numerical approach to gleaning something that can be acted on. If interesting patterns can be discovered inside an organization’s data the company stands a chance of resting their choices on something more informed and justifiable.
An up-and-coming field inside the broader scope of analytics is data science, and its meteoric rise in the world of business is due to the way it uses analysis to help decision making. Data science goes beyond mere trend spotting and instead looks to automate the learning that is required to connect an organization’s data to their decisions.
When you automate and scale the workable knowledge that binds raw information to decisions the benefit is substantial. It dramatically enhances the professional’s ability to execute on what is known about their environment and transfers knowledge to anyone in the organization exposed to the insight.
This is possible thanks to a core piece of technology inside the data scientist’s toolkit; the aptly named machine learning. Machine learning uses statistical models to learn general ‘concepts’ about the relationships in data. This allows software to move beyond the hard coded rules of traditional programming and instead use models to learn where markets will move and prescribe optimal actions. In the age of information the idea of handing over some of the required learning to software seems like a godsend, and has made machine learning one of the most talked about technologies in the world today.
But to make machine learning relevant to our businesses we must do better than predictive models and prescribed actions. We need to bridge the gap between how analytics are created and how they are consumed. At the consumption end of that spectrum are the people who use products and the decisions they make. People bring with them nuances that make decision making something more than simply choosing among options.
As we span across business functions and industries we find a wide array of professionals each with their own set of unique approaches to acting on information. These differences are not mere biases to be stamped out by the rigor of analysis, rather they are methods and procedures discovered over time and tailored through years of experience. Any analysis whose final home rests in the hands of decision makers must remain accountable to how insight is consumed.
This reframes the primary purpose of machine learning when it is used in data science. In an academic setting successful machine learning translates into finding a model that fits data well, and shows good predictive accuracy. This effort has brought about decades of innovation in new algorithms and better models. But in enterprise a well-fitting model is just the beginning, as the primary goal is to find a model that fits the product. Statistical validity alone will not lead to an application that delivers relevant business results. The outputs of machine learning must touch on the various facets of how people work, and deliver something more than raw capability. In data science, machine learning needs to help us build great products.
Data science is less about finding the most predictive model and more about discovering ways to make analysis work with people.
Making great products is as much about discovery as it is design. Variations of the product must be exposed to stakeholders and their use cases. Users need to touch and feel what does and doesn’t work and inform the product team of what’s missing. This discovery process requires the ability to swap software features in and out as the team works towards something relevant. By rapidly changing the angle of approach software can become what it needs to be, with the overall process directed by the natural forces of user interaction.
Achieving this level of “swappability” requires an abstraction where there is less concern about how something works and more focus on what it can do. The art of abstraction is to use a high-level understanding of something in order to more readily grasp its benefits. In the world of software development this abstraction often comes in the form of the API (Application Programming Interface). APIs are like gateways that allow different components of software to talk to each other. By way of abstraction APIs only expose the pieces of a program that are needed to hook into its core capabilities, while hiding the more complex parts. This allows different functionality to be added and removed from software where they can be tried and tested to see if they improve the product.
APIs and the Future of Data Science
APIs enhance an organization’s ability to collaborate, allow companies to leverage the work of others, and bring more agility to how software is made. Importantly, APIs help collapse the gap between creation and consumption by providing the ability to rapidly pivot software features in response to what is being demanded by the end user.
Nowhere is the gap between creation and consumption starker than in data science. In machine learning almost all of the innovation has occurred behind ivory towers, divorced from the concerns of making enterprise products. Although open source libraries have made data science more accessible they don’t lift analysis out from the workbench and into the actual software people use. The act of creating analytics remains a separate activity from that of making applications, where the two interests only meet when both have arrived at something ‘complete’ on their own terms.
Data science and product need to grow together, in a holistic fashion where both analysis and software are developed in tune with how people use an application. This requires more than finding better ways to deploy an analyst’s model into production. It means connecting the entire process of data science to the application, as it is built. Data science is an iterative activity involving a workflow of steps that include data gathering, data preparation, model building, model validation, and model improvement. Each of these steps need to be hooked into the application so that product feedback can advise the data science approach as the software is created.
This holistic view of data science and product is not possible when the tools data scientists use are detached from the “stack” of technology put in place to make software. Every other role on the product team uses tools that allow their work to make an immediate impact on the application. But in data science this is not the case. Analysis efforts get “thrown over the fence” after something has been discovered, leaving developers with the task of trying to integrate uninformed results and foreign languages into existing applications.
Data science needs to be a part of the application itself. This means developers and data scientists must work together to allow data science to happen in the same environment where the software is being created. To make this a reality, data science APIs would need to capture the rich variety of approaches used by data scientists to solve problems. This is only possible with clever abstractions that would satisfy the fidelity needed to allow great data science to happen within the end-to-end application.
The precursor to abstraction is that which is encountered most often. In every API we see how often-needed code gets lifted out of its lower level complexities and made more readily available to the rest of software. This makes product development a more rapid, agile and creative process. There are many steps in data science that are encountered on virtually every project. Being able to rapidly execute these tasks via APIs that are connected directly to the application would not only be a dramatically more effective way to build data-driven products, it is arguably the future of data science itself. As different data are gathered, new attributes considered, and various learning algorithms attempted, data science efforts would be verified as they should; not just by statistics but as a continuous integration of new features into software that matters.
Companies live and die by the decisions they make, and those decisions are only as strong as what has been learned. While machine learning brings automation to the way this learning happens, its innovations remain largely disconnected from how people use software. In the evolution of product development we have seen that breaking down the walls between software components allows us to more readily discover the right application. With the help of APIs data science can emerge from the weeds and become an integral part of how we discover tomorrow’s great products.
We at MuleSoft help enable the data science culture with our Anypoint Platform to make data integration simple for business critical analysis. Within Anypoint Platform we have DataWeave which provides powerful data integration with an easy to use graphical data mapping interface and CloudHub which helps organizations create repeatable integration applications, making it easy to automate data integration processes.