As a lifelong developer, no matter what my role is within IBM, I tend to gravitate towards the software nerds in the room and what their daily lives are like. I get to see and work with new customers all the time, and I am finding more and more data science efforts being included in the development landscape. I think we have seen data science go from literally a science experiment outside of IT to now being a significant portion of the development ecosystem. And that brings with it some of the typical development challenges we know and love.
How many data scientists do you have?
A colleague of mine once asked a large customer “how many data scientists do you have?” The answer was around 3000. Wow!! That seems like a lot even with the recent explosion of data science and AI. But exploring further, this customer’s definition of a data scientist was all those people that are charged with making decisions based on data. This includes the most junior business analyst to the PhD data scientist. At the risk of being cliché, today data is the world’s most valuable resource and that can definitely be said of today’s corporations large and small.
With an ecosystem of so many “data scientists”, IT organizations now have a significant challenge in providing platforms and services to accommodate this fast changing community. Some of the same challenges exist in this space that we have been dealing with in traditional software development:
- Collaboration – In the early stages, data scientists were given computers with as much horse power they could find and they went at it. Hopefully they would come up with something useful for the organization. Now we have teams of people that need to work together to curate data, build, train, and deploy models, and create applications utilizing those models. This requires platforms that facilitate sharing of resources and collaboration between team members. The platform needs to provide security based on users and roles along with the ability to audit, source management, and change management. Corporations should be interested in knowing who, when, why, and how these decision engines were created that are driving the future of the company.
- Environment creation – The ability to create environments to crunch data must be agile and on-demand. Spinning up Jupyter notebooks in a matter of seconds is a requirement. Deep learning experts need the ability to create a GPU farm quickly to perform neural network training runs and then shut it down automatically when done. On demand Hadoop farms are needed instead of making a “request to IT” for resources. Many of the cloud techniques and benefits of environments-on-demand we excel at in traditional systems need to be applied to the systems needed for data science. Also, open-source language libraries provide a configuration management nightmare. Keeping track of the what version of Python and what library versions are used at any given point require some configuration management control.
- Cost control – Not every data science problem requires a neural network to solve. Not every data set needs to be stored in a Hadoop cluster or a data warehouse. Just like we don’t create fully featured production environments for every stage of development, we also need to be aware of the costs and benefits of data science environments. It may be OK to use an exported data set stored in object storage for the “development” efforts instead of the expense of a Hadoop cluster. Not every problem requires the compute power of NASA to solve.
You think technology changes fast…
A similar challenge exists in the data space. There is no arguing that technology is changing faster than ever before. Most could argue data is growing and changing faster. The idea that data stands still while we develop, train, and deploy models is a fallacy. This problem of “changing the tires while driving down the road” has to be dealt with. Chief data officers and the keepers of corporate data must provide capabilities and platforms to the data science community to keep up with the data science needs. This concept can be examined from a few perspectives:
- Finding data – Searching for and gaining the appropriate access to the right data can be a daunting task and can often produce incorrect results. No two people will find and use the same input data. Organizations need to approach the idea of cataloging data in a way that makes data easy to find. Applying social media techniques like tagging and ratings can help users shopping for data find not only appropriate data but data that others have used successfully. Achieving success in building data-driven decision systems can be directly attributed to quickly finding and utilizing the correct data.
- Masking data – To go along with finding data, data stewards also need to be responsible for insuring sensitive data is properly masked or even hidden depending on how it is classified. Creating a new “cleansed” data source is not the answer as this is time consuming and again forces the cleansed data source to continually be updated. A better approach is to use a policy-driven masking technology that masks data in flight. The original data source is not modified but instead, based on a user’s role and/or a masking policy, the sensitive data is either kept hidden all together, pseudonymized (is that a word?), or anonymized.
- Use data where it lives – Data changes very fast and AI opportunities need to be addressed in days or weeks and not months. The traditional process of shaping and morphing data into a data lake or data warehouse using ETL tools and ETL developers is too time consuming. Data needs to be consumed where it lives, and tools need to be given to data scientists to wrangle, morph, or join the data quickly without the need for the heavy weight of ETL.
- Retrain models – Trained models are only as good as the data used to train them. However, data trends change over time and there is no guarantee that historic data accurately predicts tomorrow’s results. So, the concept of “continuous delivery” of trained models needs to be adopted. Periodic retraining of models with up-to-date data and validating that newly trained model against a historical data sets is important. Using automation to do this on a periodic basis validates that old model assumptions are still true or that a new approach needs to be taken.
Data science and artificial intelligence is here to stay and will become a large portion of the development ecosystem at many organizations. The IT and data teams that support these efforts need to begin to address these challenges. The explosion of data science is making a profound impact on IT organizations. And this is just the beginning.
IBM is taking the lead in bringing AI to the masses with the open source initiative CODAIT. Read more about it here: https://developer.ibm.com/code/2018/03/20/creating-a-center-of-gravity-around-open-source-data-and-ai-technologies/
Take a look at IBM’s on-demand cloud-based platform addressing these concerns – IBM Watson Studio. Try it for free here: https://www.ibm.com/cloud/watson-studio