Yahoo wrote up their key scientific challenges recently. It’s interesting to compare their machine learning challenges with those facing data mining and machine learning efforts in oil well drilling. The list was made by John Langford, which comment that the challenges are general enough to have applications outside Yahoo. And indeed, three of the five challenges mirror challenges in analysing drilling time series:
- The problem of nonstationary data is the most obvious and is shared by most real-world time-series. The multivariate time series changes abruptly when the driller switches between tasks or “drilling modes”. A given flowrate could indicate business as usual in one mode and spell disaster in another. Therefore, alarm systems and other analysis don’t get far without somehow recognizing these modes.
- The second of Langford’s challenges relates to label complexity and the lack of labeled data. In a drilling time series, the labels are sparse and coarse-grained. They point out a few unusual events and summarize stretches of routine operation orders of magnitude longer than the mode changes mentioned above. Manual labeling with the help of an expert is possible but time-consuming and expensive. Among the solutions, Langford mentions semi-supervised methods, an avenue of exploration shared with intrusion detection in computer networks, a domain similar to drilling in that it’s only feasible to manually label a tiny part of the data.
- The last challenge has been named Exploration. As Langford puts it: “You can’t rewind a user and try a different action, so you only get feedback for the chosen action” . As with nonstationarity, this is typical for real-world data and it’s perhaps not surprising that drilling faces a similar challenge. The high cost of drilling a well and the consequences of making a mistake, means that some sequences of actions or choices of parameters are never found in real-world data.
So the challenges appear very general. But will the solutions be as generally useful as the problems? In my PhD I’ve found that examples from biology, proteomics and intrusion detection helps me to understand where the stumbling blocks in drilling time series analysis lies. (Perhaps a topic for later posts?). But I’ve also tried applying methods from a similar problem, which fails due to seemingly small differences in the problem statements.
In any case, the last two challenges are problems I don’t seem to have come across in any of the introductory texts to machine learning I’ve read. The subjects are easily explained so have I read the wrong books or have they been blind spots in the curriculum until recently?