Automatic Machine Learning (AutoML) is a general discipline that involves automating any part of the entire process of machine learning application. By working with various stages of the machine learning process, engineers develop solutions to expedite, enhance and automate parts of the machine learning pipeline . What’s the current state-of-art of AutoML applications? What type of business problems can be better solved with AutoML capability? How to embed AutoML into familiar project delivery methodologies? In this three-part series we will discuss the functionalities of a few current available AutoML tools, practical considerations by adopting AutoML systems, and personal recommendations after involving in a few AutoML assessment tasks.
Part 1: Automated Machine Learning
Automated machine learning has been under the AI radar for a few years now. It is to automate the end-to-end process of applying machine learning to real-world problems. AutoML is not automated data science but is one of many tools in the data science toolkit.
As shown in the figure above, traditionally feature engineering and model building & evaluation process are carried out by data scientist with domain-specific knowledge, mathematical and programming skills. However, machine learning remains a hard problem when implementing existing algorithms and models to work well for new applications. Very rarely does an algorithm work the first time and so this ends up being where the majority of time is spent in tuning algorithms.
To improve the efficiency of model development, the rationale for AutoML stems: if numerous machine learning models must be built, using a variety of algorithms and a number of differing hyperparameter configurations, then this model building can be automated, as can the comparison of model performance and accuracy.
Boosted by cloud technology with hugely increased computational power, recent development of AutoML has been reflected on the release of both business focused systems such as DataRobot, H2O, DarwinAI, OneClickAI; and numerous open source libraries such as AutoWeka, MLBox, auto-sklearn, TPOT, AutoKeras, prophet, etc. Large companies which host cloud infrastructures such as Amazon (Amazon SageMaker), Microsoft (Azure AutoML) and Google (Google Cloud ML) not only have their own AutoML services but can also well streamline their machine learning products with the supported data storage and visualisation capabilities to deliver end-to-end project pipelines.
Part 2: Current state of the art
The biggest advantage of AutoML is to leverage cloud computational power to boost the efficiency of model development and optimisation. For classical machine learning problems such as regression and classification, with well-prepared data AutoML could quickly return relatively good results, providing first-hand insights in terms the predictability of the business problems. After interpreting the results, data scientists might further process data, modify features etc. for performance improvement. Currently, unsupervised-learning and deep-learning capabilities are still limited in most AutoML products. The supported functionalities generally include:
- Automated Feature Engineering: create new feature sets iteratively until the ML model achieves a satisfactory accuracy score. Typical examples include:
- Automated Model Selection and Hyperparameter Tuning: find a machine learning algorithm to train on the observations for those features and be able to predict a target value on new observations. Typical examples include:
- Automated Deployment: streamline ML model deployment process. Typical examples include:
For the completeness of a machine learning pipeline, AutoML platforms also support:
- Data Sourcing and Data Waggling: AutoML products usually support a variety of data sourcing systems such as local files, live SQL database, blob storage, Hadoop systems, etc. Most of these tools could also enrich the data set by way of combining existing features based on kernel PCA, select percentile, select rates, one-hot encoding, imputation, balancing, scaling, feature agglomeration, etc.
- Model Explanation: this refers to why a particular model or configuration is better than the others and what is the underlying built of this model. Common functionalities include feature importance, text explanation further drill-down of ensemble/regularisation methodologies, performance metrics grid, performance curve vs training speed, etc.
Part 3: Practical considerations
Current AutoML products still have various limitations and constraints. Their purpose is to assist data scientists and free them from the burden of repetitive, and less demanding tasks. Typical AutoML challenges include:
- Unsupervised Learning: as unsupervised learning does not rely on labelled datasets, there is no clear measure of success that can be used to assess the quality of results to compare algorithms directly.
- Complex Data Types: most AutoML systems initially designed to work with structured, tabular or relational data, then further extended to handle unstructured data such as text and images. However, by now, network data and web data are still not included in any AutoML products.
- Feature Engineering Embedded with Domain Knowledge: some AutoML systems offer automatic feature engineering such as DataRobot, H2O Driverless AI, but none of them could incorporate domain knowledge into the ML process.
Therefore, despite that AutoML systems can address many of the machine learning challenges that we face nowadays, considering when should we adopt AutoML for project deliverables as an integrated business decision would be advocated. It is also suggested to follow the recent progress of AutoML development from various vendors. The following questions may be helpful in terms of judging whether we should use AutoML.
Rui Wang, Data Scientist, I&D Insight Generation
Rui is an AI Engineer in the Insights & Data practice in the UK. She has visualised the industrialization of various AI methodologies and products over the past decade. “I am still interested in coding up basic mathematic functions for my machine learning models in order to fully capture the details, however over the years I gradually understand the efficiency boost provided by the various packages and tools, which can better serve dynamic business needs.” Rui has involved in DataRobot and Dataiku trainings, and further delivered project using Azure AutoML (certified MCSA) and AWS Sagemaker.