2/12/18

Think Like a Data Scientist

By Mark Wickham

New Content Item

You've probably seen those crazy statistics about the amount of data being created and stored on a daily basis. Experts suggest that 90% of the data in the world has been created in just the past 2 years! IBM stated that more than 2.5 Exabytes (2.5 Billion Gigabytes) of data is generated every day. You've heard the term "Big Data". It is used in many different confusing ways, but typically refers to datasets larger than 1 terabyte (TB). You may not be working with data at these scales on your own projects, but if you're like me, because of all this new data, you are facing new design considerations when architecting your software projects.

In my new book Practical Android, I allocated the very first chapter, Chapter 1- Introduction to JSON, to explain the approach I use for managing data in several of the projects presented later in the book. Although the chapter doesn't have much Android code, it is possibly the most important chapter in the book. JSON stands for JavaScript Object Notation. It is a very lightweight, text-based, flexible exchange format. I won't cover all the details here, you can get those from the chapter, but suffice it to say, JSON can represent almost any data structure, it is very easy to use, and JSON is available for all the platforms.

Back in 2012, I designed an Android based restaurant ordering system. I needed a way to represent customer orders. The orders would be stored on a server, and frequently passed around from tablet to tablet and from Android client to server. Although the general concept of NoSQL databases first began to gain traction in 2009, at the time I designed my app, there was not much debate about the merits of NoSQL database solutions. For my project, I felt compelled to avoid a MySQL database solution in favor of a file-based solution, partially because Android tablets back then were not very powerful and Android did not include all of the functionality we have in today’s builds. The system needed to be simple, yet future proof. JSON proved to be a perfect solution. To date the system continues to run and has processed and stored over 1 million records.

It turns out that representing unstructured data with JSON managed by NoSQL databases is now very common. Some amazing complimentary technologies have emerged. The open source Apache Cassandra NoSQL database project is a perfect complement to JSON file-based architectures. If you are working on Amazons AWS, Dynamo DB offers similar functionality. Additionally, you may have a requirement for streaming data. Streaming solutions provide for real-time access to the data. The open source Apache Kafka project can easily handle streaming requirements for JSON data. Kafka was created by LinkedIn who needed a way to share all of their massive data across services and geographies. It has since been released as open source and is used by many of the favorite apps and services you use every day.

What does all this have to do with thinking like a data scientist? Well, if you're like me, you may have noticed that the software development process is changing (again!). It used to be we would define the requirements, identify possibly useful 3rd party libraries, and then start writing and integrating code. But remember all that data that exists today? Today, we must start by considering how this data impacts our solution, decide how we can organize it, and then let the restructured data drive our software architecture. JSON is not the most elaborate solution for data organization, but its elegance and simplicity can produce excellent results.

I've been working on a Trader Bot app. Trader Bots typically use historical stock market data to place trades, attempting to outperform the overall market. In my case, I will be attempting to replicate a trading strategy which has been executed successfully by real option traders. The traders followed a strict methodology. Each trade has associated with it a large amount of data which needs to be organized. You probably guessed it, we are talking about implementing a machine learning, or more likely, deep learning app. Deep learning involves building models with neural network algorithms which involves more complexity. In both cases, we build models with existing data, and then use the model to make future predictions, in this example, successful option trades.

With all the progress made in AI, these types of apps are becoming more and more popular these days. While implementing them, I have found the 80/20 rule applies. It turns out that choosing the algorithm, building a model, and writing the actual Java code for the app are the "20" part. Most of the work is going to be "data wrangling,” creating and manipulating all the data associated with each of the historical trades so the model can make useful predictions for future trades.

In Practical Android, JSON is used to store configuration data for several of the apps. The structure of the JSON files is flat- typically a single Array with a few Objects. For the Trader Bot, we will need to store more complex data representing hundreds or even thousands of aspects of each trade. In my Introduction to JSON, I covered JSONLint.com, a useful site which validates JSON for correctness. For creating more complex JSON files like the Trader Bot app will require, I suggest using a JSON Editor. JSON Editor is an easy to use graphical editor that lets you build JSON files interactively. The open source project is available at github.com/josdejong/jsoneditor.

New Content Item

^{Figure 1. D3 Radial Dendogram Visualization of a JSON File}

Because D3 is Javascript based, it works well inside the Android WebView control. D3 can provide a simple way to integrate amazing visualizations within your apps. If you are interested in using D3 to visualize JSON, or even more generally to use D3 on Android, I have included a simple Data Visualization Android app on my Github site which can help you get started: github.com/Wickapps/Android-Data-Visualization. The app will show you how to display any D3 visualization on Android.

Thinking like a data scientist and taking control over your data using JSON and helpful JSON tools such as JSON Editor and JSON visualization with D3, combined with complimentary open source technologies such as Cassandra NoSQL database and Kafka streaming can produce powerful, highly scalable distributed solutions. Leveraging these tools and technologies has the added benefit of being compatible with many machine learning and deep learning engines. Who knows, perhaps your next implementation could unexpectedly reach Big Data status.

Reference Links:
_{Cassandra Apache Project: cassandra.apache.org
Kafka Apache Project: kafka.apache.org
Android Data Visualization App: github.com/Wickapps/Android-Data-Visualization
D3 Visualization Library: d3js.org
D3 Visualization Gallery: github.com/d3/d3/wiki/Gallery
JSON Editor: github.com/josdejong/jsoneditor}

About the Author:

Mark Wickham is a frequent speaker at Android developer conferences. He has been teaching Android since 2013. His popular classes cover practical topics such as connectivity, push messaging, and audio/video. As a a freelance Android developer, Mark has lived and worked in Beijing since 2000. Mark has led software development teams for Motorola, delivering infrastructure solutions to global telecommunications customers. While at Motorola, Mark also led product management and product marketing teams in the Asia Pacific region. Mark has been involved in software and technology for more than 30 years and began to focus on the Android platform in 2009, creating private cloud and tablet based solutions for the enterprise. Mark majored in Computer Science and Physics at Creighton University, and later obtained an MBA from the University of Washington and the Hong Kong University of Science and Technology. Mark is also active as a freelance video producer, photographer, and enjoys recording live music.

Want more? Pick up Mark's book, Practical Android: 14 Complete Projects on Advanced Techniques and Approaches, now available on Apress.com.