Machine Learning: The Data Challenge

by Mark Wickham

Machine learning (ML) and deep learning (DL) have gone mainstream.  As software developers, we must consider how these technologies could positively impact any new project we implement. While algorithms and platforms receive much of the attention, in this post I will discuss the importance of the underlying data foundation.

With algorithms, there are passionate debates about which learning style is the best and which specific algorithm we should employ for particular problem types. Any good text on machine learning will cover the "alphabet soup" of algorithms including CNN, RNN, K-Nearest Neighbors, K-Means Clustering, and Random Forest, just to name a few. There are many factors to consider when choosing the best algorithm and platform for your problem. In Practical Java Machine Learning, I present two different approaches for selecting the best algorithm: a data-driven approach (shown below) and a functional approach driven by some basic questions about what you are trying to achieve.

New Content Item

Fig. 1 

The debate over platforms is equally intense. The cloud providers give many viable choices, including Google's Tensorflow and Apache's MxNet.

Rather than focus on the algorithms, platforms, or programming languages, I want to discuss the importance of the underlying data required to drive our ML solutions. Data is the foundation of any ML solution. Building ML or DL models on top of a shaky data foundation will always result in failure. Especially with DL, small data integrity issues become magnified as they propagate through the hidden layers of DL algorithms. It is no coincidence that we see so much focus on the data scientist position in the current job market.

Organizations and individuals must consider several important factors when creating ML solutions: 

  • Data acquisition cost: We need data. And for deep learning, we need lots of it. Most organizations do not have enough high-quality data for deep learning solutions.
  • Data preprocessing costs: We need the data in a format we can use. We typically need "labeled" data, and much of the existing data is "unlabeled".
  • Talent acquisition costs: In the past, we would hire software engineers to create our solutions. With a new focus on data, we need talent with slightly adjusted skill sets.
  • Model generation costs: Once the data is organized, we need to build a model, and this often requires large processing resources.
  • Platform hosting costs: Once the model is built, we need to host it, providing access to clients who utilize the model's predictive power.

Not surprisingly, many of the costs are data-related. Let's discuss some strategies to address the acquisition and preprocessing costs.


The Data Dilemma

Data is the fuel for machine learning. Acquiring data sounds simple enough, as we are accustomed to searching the Internet and finding just about anything. There certainly is a lot of data out there. In Practical Java Machine Learning I take an in-depth look at potential sources of data we can use for ML applications. 

Some of the more promising sources include:

  • Public government data.
  • Private data such as social media data.
  • Personal data we may capture for ourselves using the greatest data collection device ever created: the smart phone.
  • Image and audio data. Yes, those .jpg, .mp3, and video files are in fact data.
  • Synthetic data, or data we create/can be derived from underlying "real" data.
  • Sensor data. Arguably the category with the most explosive growth potential.

Across these many potential sources of data, we find that not all data is created equal. So what is "high quality" data? We can divide data into two high level categories: structured and unstructured. Much of the existing data is unstructured or unlabeled. The chart below shows this imbalance.


New Content Item

Fig. 2 

Notice in the algorithm selection flowchart shown above that the type of data (structured or unstructured) determines which algorithm to use. For structured/labeled data we choose the classification family of algorithms, and for unlabeled data we choose the clustering algorithms. 

A majority of useful ML applications employ classification algorithms. Structured data is critical for classification because the algorithms require labeled training data to create the models. Once created, they can then be used to predict outcomes for new data samples.

With the imbalance of data, the key question becomes, how can we acquire sufficient labeled data so that we can create effective ML classification models?

A few strategies to consider:

  1. Perhaps the most obvious solution for the data imbalance is to manually label unlabeled data. The problem here is that this approach can be very expensive, especially for the huge datasets which are required for DL. This reality that has led some analysts to conclude that organizations, or even countries, where low cost labor is accessible, may have a significant advantage in achieving AI dominance. China understands this well as they attempt to move up the value chain.
  2. We can restructure unlabeled data in traditional SQL database tables into useful labeled data stored in a NoSQL database, such as Cassandra. This approach is more feasible than the manual approach because it can be automated. For example, we might be able to extract input data, or "attributes" from a SQL table table, and add the critical "label" value from another SQL table. We can convert the combined labeled entry to JSON and store it into a NoSQL database, which has the advantage of being highly scalable and well suited for ML. Such an operation could be completely automated.  
  3. Interestingly, we can use ML clustering algorithms to help us label unlabeled data. There is something satisfying about using ML to help solve the unlabeled data issue which plagues ML. In Practical Java Machine Learning, I use a sample data set of the Old Faithful geyser in Yellowstone National Park to demonstrate clustering algorithms. If I can store the cluster results with the original data, we have effectively labeled the unlabeled dataset. Later in the book, I create an Old Faithful classification application for the Raspberry Pi device using the newly labeled geyser dataset.
  4. We can purchase private labeled data, or even create synthetic labeled data. As an example, I recently visited a casino in Macao where I came up with an idea for a new ML application. Macao is interesting. The gaming revenues are five times the Las Vegas strip, and for only one reason: Baccarat. Baccarat is a card game where 2 hands are dealt. To implement the Baccarat application, I needed Baccarat hand card data, and lots of it. I discovered that you can purchase such data of actual dealt card hands. Presumably it took a lot of manual effort for someone to compile this dataset. But, as developers, we can easily write a Java or Python script to create such data for us (unlimited free synthetic data.) Assuming we have good random seed generators, this synthetic data should be just as effective when we feed it into the ML algorithm.


While algorithms and platforms are very important for ML applications, the most important success factor is starting off with a solid data foundation. Any ML platform you choose to work with will have excellent tools enabling you to clean, process, and visualize your data. My advice to anyone who wishes to implement ML applications is to focus on your data first. Spending time up front acquiring, understanding, and organizing your data will pay dividends when you ultimately use your model to make predictions on the back-end. 

About the Author

Mark Wickham is an active developer and has been a developer for many years, mostly in Java.  He is passionate about exploring advances in artificial intelligence and machine learning using Java. New software approaches, applied to the ever expanding volume of data we now have available to us, enables us to create Java solutions which were not before conceivable. He is a frequent speaker at developer conferences. His popular classes cover practical topics such as connectivity, push messaging, and audio/video.  Mark has led software development teams for Motorola, delivering infrastructure solutions to global telecommunications customers. While at Motorola, Mark also led product management and product marketing teams in the Asia Pacific region. Mark has been involved in software and technology for more than 30 years and began to focus on the Android platform in 2009, creating private cloud and tablet based solutions for the enterprise. Mark majored in Computer Science and Physics at Creighton University, and later obtained an MBA from the University of Washington and the Hong Kong University of Science and Technology. Mark is also active as a freelance video producer, photographer, and enjoys recording live music.  Previously Mark wrote Practical Android (Apress, 2018).

This article was contributed by Mark Wickham, author of Practical Java Machine Learning.