There’s been a lot of attention in recent months about the computing power requirements of Artificial Intelligence and the rise of various pieces of hardware (GPUs, ASICs, TPUs…) used to train massive data models. Surprisingly, there’s been much less hype around software and, specifically, around data science and analytics software that appears as important as hardware in the AI journey and benefits, accordingly, from a massive revenue opportunity.
The training of AI models indeed involves several steps, each one requiring a different type of software.
The first step, the gathering of data, usually involves a mix of web scraping (the extraction of data from web pages), open/non-open data repositories (in other words, online libraries of academic research, books…) and proprietary datasets. Once all the raw data has been collected, it is assembled and stored according to its type:
1) if the data is structured – meaning that it has a standardized format and defined data model (just like an Excel file) – then it can be stored in a relational database like the ones sold by Oracle or Microsoft or an open-source solution like MySQL or PostgreSQL. Examples of structured datasets include financial time series, e-commerce transactions, sensors’ temperatures or GPS records…
2) if the data is unstructured – meaning that it has no identifiable structure and does not conform to a data model – then it has to be stored in a so-called NoSQL database provided by companies like MongoDB or Amazon (DynamoDB) or in a cloud-based data lake from Snowflake or privately-held Databricks. Examples of unstructured datasets include academic research papers, audio files, pictures, social media data flows, emails, videos…
The second step, data cleansing and denoising, is by far the most important as the old adage goes: “Garbage in = Garbage out!”. If the dataset consists of numerical values, then data scientists will scan and correct for outliers, missing data, multicollinearity and so on. If the dataset is unstructured, then the cleaning process will consist of removing HTML tags and page numbers, lowercasing (and many other operations) for Natural Language Processing models while pictures will be resized and reorientated (and many other operations as well) for Computer Vision models.
The third step, feature selection and engineering, is clearly where the know-how of data scientists makes the difference in terms of the model’s accuracy. In the case of a numerical (structured) dataset, data scientists will select what they believe are the AI model’s most relevant inputs, called features, and will also create synthetic inputs (engineered from the dataset) to try to further improve the model’s accuracy.
As an example, let’s consider an AI model aiming to forecast the sign of next month’s return of the MSCI AC World index. The available dataset consists of hundreds of financial time series like equity and bond indices, yield spreads, commodity prices… Using a subset of this data like the price of gold, the dollar index, oil and the Nasdaq Index for example would be considered as feature selection while using the correlation between the Consumer Staples and the MSCI AC World indices (calculated/engineered from the dataset) would be feature engineering. This process obviously follows a quantitative process where the features are statistically measured and compared between themselves and the model’s expected output to check if their inclusion in the model makes any sense. The defined feature set is then normalized in order to avoid any scaling disparity that could skew the Machine Learning (ML) model.
The example above supposes that the dataset also integrates the necessary output (or the model’s results) to train the model. In the AI world, the model’s output is called the label. In our particular example, forecasting the sign of MSCI AC World’s next month return, the label would be either 0 (negative return) or 1 (positive return) and would be easily generated from the index’s time series. The labelling of the output data is another important step in the preprocessing phase.
The last preprocessing step consists of splitting the selected dataset into a training, validation and test sets. Data-splitting may involve several different strategies (random, stratified, cross-validation…) that will not be addressed here. The main point to retain is that once the model has been tuned on the training set, its accuracy is compared to several other models’ configurations (hyperparameters’ tuning) and, finally, validated (or not) with the test set.
All of the preprocessing operations described above are mainly done with Python scripts relying on open-source libraries like SciPy, Pandas or SpaCy just to name a few. Anaconda, an open-source scientific computing software package, is clearly AI researchers’ favorite development platform.
Overall, it is estimated that the majority of time spent on Machine Learning is in data preparation (well above 50%), suggesting that companies offering dedicated software platforms and tools for this processing phase should be large beneficiaries of the proliferation of AI applications and enterprise spending growth on data science and ML technology.
The data science software market, which already exceeds $100 billion, is then expected to grow above 30% over the coming years. That said, not every single software player will emerge as a winner as competition is intensifying and market share shifts are on their way.
The market is structuring itself, on one side, around open-source (then free) solutions and, on the other side, large platforms offering a wide range of data storage and preparation functionalities. While the main cloud providers have unsurprisingly a strong presence in this space, data storage leaders such as Databricks, MongoDB and Snowflake are gradually expanding their tools to offer platforms combining multiple capabilities across structured/unstructured data, storage (warehouse/lake), data engineering… and allowing to automate many ML tasks.
These companies do not come on the cheap (EV-to-Sales well above 10x) right now but they offer pretty good visibility on 30%+ revenue growth over coming years and strong margin leverage potential (from current low levels around 10%), suggesting that valuation multiples should decline fast.
Other players, that have been facing specific issues and, accordingly, exhibit more attractive multiples (EV-to-Sales below 3x), could be interesting as well. We are notably thinking of Alteryx – which is a pioneer in data preparation tools but failed to catch the cloud migration and to become a full data science platform – and Teradata, that has been shifting its business model from storage hardware to a data software platform. For both, M&A could be the end game as their solutions have obvious strategic value for leaders that, as we said above, seek to cover the whole spectrum of data science.