This blog is part of a series on the introduction of the Oracle AI Cloud as part of the Oracle PaaS offering. In the first blog, The business case for the Oracle AI Cloud
, we describe the capabilities and usage scenarios. The second blog, Prepping for the Oracle AI Cloud: libraries and tools
, provides an overview of Python and the available data science and machine learning Libraries in the Oracle AI Cloud. In this blog, we will we will delve into the basics of machine learning and take a detailed look at the scikit-learn library in order to be prepared for the next blog on the “Deep Learning Frameworks” feature of the Oracle AI Cloud.
Machine learning makes it easier to extract answers from data. It uses models and data, lots of data, to train a model in making a prediction. The goal of this training is to create an accurate model that answers (or predicts) our questions correctly based upon a set of measurements of specific data features. The distribution of the features then helps in finding groupings, also known as classifications. These classifications then in turn give input to the prediction. By continuously evaluating the results, we are able to improve the learning algorithm.
The Iris flower data set
is a well-known example of linear discriminant analysis. In the 1930s, three species of Iris were investigated based on four features: the length and width of the petal (part of flower) and the length and width of the sepal (leaves). The data set contained 50 samples which helped identify three species based on these features. Using this data to train the machine learning model helped to answer questions about the various Iris species.
In the image below, the data set is loaded into a Jupyter Notebook (part of the Intro to Machine Learning with Python book, O’Reilly, see Resources), with the three Iris species denoted in different colors. The data in the graph makes it possible to identify the three species, as well as overlap areas in which the choice is ambiguous. The classifications seen in the graph represent the distribution of the Iris species petal and sepal data.
Machine Learning thrives on data. The end result of a data investigation by Machine Learning depends on the quality of the data. As always Garbage in = Garbage out. Even the most sophisticated machine learning is useless without the correct data. The process of preparing and running the data through machine learning involves multiple steps and feedback loops. It starts with cleansing the data. The basic question is: which data is important for the classification determination? For example, when you want to investigate why specific chatbot conversations have been abandoned, you need to leave out the audit trails of chats that were executed successfully.
The next step in the machine-learning process involves selecting the data features that need to be measured. These features are then used by the machine-learning training model to find the groupings of data or classifications. These features need to be extracted from the data that is brought in. The next step is to run through (parts of) the data to check if it makes sense and then to run the data through a model (see below) and check the outcomes. As a result, it is possible to ascertain which features only add noise and can be removed, and which additional features are needed. In this respect, comparing the results obtained from other models can provide additional insights.
Linear algebra and statistics provide the basis for machine learning. Sara Guido and Andreas Muller provide a comprehensive introduction to theory and practice as well as a practical overview of models in their book “Introduction to Machine Learning with Python.” Luckily, the Python libraries abstract many of these complexities and bring different calculation methods to work. All you need to do is:
- Bring the data in the right shape
- Determine what you would like to investigate
- Decide which model will help you with your question
- Train the model with measurements.
Three types of machine learning can be distinguished: supervised learning, unsupervised learning, and reinforcement learning.
In supervised learning, the algorithm is fed with data (features) with an outcome for every data row (classification). In the Iris example, for every combination of length/width of the petal and sepal an Iris species is provided. The learning algorithm is then able to relate to the existing combinations and make a prediction when a new measurement is brought in.
In unsupervised data, input features are provided but no output is given. The machine learning system must identify classification groups on its own.
In reinforcement learning, an additional element is added to address interactions with the environment; the system receives awards by performing specific actions. Reinforcement learning is used, for instance, in gaming or operations.
Supervised learning is a good starting point for a machine-learning endeavor because it aligns with available knowledge and provides fast results. Scikit-learn is a supervised model that supports an extensive list of models, the most well-known of which are:
- Linear models
- Nearest neighbors
- Gaussian processes
- Naive Bayes
- Decision trees.
Apart from getting the right data and determining the features, the selection choice for the model is a difficult one. Luckily, the scikit-learn website provides a cheat sheet to help determine which model (or estimator as scikit-learn calls it) to choose:
- Machine Learning with Python (O’Reilly)
Good introduction to Machine Learning theory, directly translated into practical examples with Python libraries.
Machine learning is about creating a model that helps predict answers to questions based on training. Three types of machine learning can be distinguished: supervised (training with output known), unsupervised (no classifications known upfront), and reinforcement learning (learn by doing).
The Oracle AI Cloud supports supervised machine learning with the inclusion of the scikit-learn library. This model provides a multitude of models that provide support for a variety of data scenarios.
In the next blog, we will dive into the infrastructure supporting the Oracle AI Framework.
This blog series was co-authored by Léon Smiers and Johan Louwers. Léon Smiers is an Oracle ACE and a thought leader on Oracle cloud within Capgemini. Johan Louwers is an Oracle ACE director and global chief architect for Oracle technology. Both can be contacted for more information about this, and other topics, via email; Leon.Smiers@capgemini.com and Johan.Louwers@capgemini.com