This blog looks at how we can determine the smartness of an Artificial Intelligence by measuring its performance on real-world problems.
How Well Does the Curve Fit the Data?
If our curves do not fit the data well then our models do not explain reality and to have intelligence we need to be right at least most of the time so it is important to find a way to measure how well the model fits the data. This is really the key to implementing Artificial Intelligence. A clear and unambiguous way of measuring the successful application of an AI to solving a real-world problem. Unfortunately this is harder than it sounds, much harder. The difficulty starts in the lab as researchers and scientist will usually begin by constraining their problem in order to control the variables and apply the scientific method to test the efficacy of the solution. The AI is often found to perform well within the constrained environment but adding all the variability of the real-world impairs the performance such that it is little better than the naive solution. For example, in stock market forecasting the naive solution is normally taken as today’s price used to predict tomorrows price and many AI based forecasters struggle to improve on this simple approach in the long run. There are two main reasons (other than that the model is just wrong) why these occurs; over-fitting and peeking.
Over-fitting
Creating and refining an AI solution often involves some form of optimsation of a set of parameters controlling the behaviour of the AI, such as evolving the weights in the Artificial Synapses of an Artificial Neural Network with a process called Back-propagation which repeatedly uses the errors appearing at its output, when measured against data known as the training set, to modify the weights such that the errors are gradually reduced to a minimum. If this is done too aggressively, on too small a training set the patterns learned by the AI can be artifacts of the data set itself rather the the underlying processes that cause the data being observed.
For example, if we think back to the weight and height metrics in a previous blog, imagine a case where the data we are using happens to be a database of only basketball players. When we applied this model to the general population it would tend to over-estimate peoples height for a given weight or conversely, underestimate their weight for a given height. The model of the real-world that the AI has learnt is too specific due to the fact that the data it is seeing is actually only representing a narrow part.
What we want is a model that can “generalise” across data from the real-world and maintain a reasonable degree of accuracy. We can do this by making sure that the data we use to training the AI contains examples of all variations observed in the real-world with an appropriate frequency that is similarly manifest in the real-world. Sometimes this is not possible, because we don’t have access to the data in a historical format or the phenomena is not well understood so that it is not possible to say that every permutation possible has yet occurred for data to be captured about
Also, the processes underpinning a phenomena can be in a state of flux and even though we accurately model them from past data, tomorrow a new or modified process causes the real-world and the model to diverge. This is often know as a shifting baseline.
Another technique used to manage “over fitting” is to terminate the learning process early so that the specifics of any one data point is not to influential to the overall data set. In this way erroneous data points or poorly represented data points (either too frequent or not frequent enough) do not skew the results and AI is not focused on the fine details of the data set but rather the broad relationship represented by the data set as a whole.
Peeking
Another problem can occur that is notoriously difficult to avoid, know as Peeking, Peeking occurs when knowledge of future data leaks into the environment used to evolve the AI’s leaning. This can be as simple as measuring the performance of the AI based on the errors it generates on the data used to train it or as insidious as the researcher or scientist unconsciously introducing bias into the AI by tweaking parameters with the knowledge of what exists in the data set used for testing the AI’s performance. This is potentially so significant that best practice is to avoid keeping the “testing” data set within the same storage environments as the training data and ensuring that the researchers themselves have never been exposed to the data in anyway included reading about its characteristics from text written by a third party. In the real-world there is no way of knowing what future data will look like so the best way to replicate the real-world in the training and testing environment is to ensure data from the future does not pollute the present and thereby skew the performance of the AI. Anyone who has worked in this field has invariably been tripped up by seemingly miraculous results only to find later on that they have some how incorporate impossible future knowledge in their environment.
Error Measures and Cost Functions
As previously mentioned the aim of an AI is to understand the real-world and the way we measure how well it can do this is to compare its response to what the real-world is actually doing and then assigning a measure of error against this. We then evolve or train the system to reduce the error it makes in estimating the real-world problem we have set it. Although this seems quite straight forward, the way the measurement is constructed can impact how well the AI learns to generalise. Different measures are also more suited to different learning algorithms due to specific mathematical characteristics that can be exploited that make the learning process more efficient.
The way in which we interpret and measure the accuracy of the model can be very important and this is analogous to the idea that statistics can be used to represent any view you desired, aka Lies, Damn Lies and Statistics. It would be unfair to state that the error measures and cost functions are misleading but it would be accurate to state that the choice of the measurement can be interpreted in different ways in different circumstances and what is good for one domain may not work well in another.