Think Like a Data Scientist
By Mark Wickham
You've probably seen those crazy statistics about the amount of data being created and stored on a daily basis. Experts suggest that 90% of the data in the world has been created in just the past 2 years! IBM stated that more than 2.5 Exabytes (2.5 Billion Gigabytes) of data is generated every day. You've heard the term "Big Data". It is used in many different confusing ways, but typically refers to datasets larger than 1 terabyte (TB). You may not be working with data at these scales on your own projects, but if you're like me, because of all this new data, you are facing new design considerations when architecting your software projects.
Back in 2012, I designed an Android based restaurant ordering system. I needed a way to represent customer orders. The orders would be stored on a server, and frequently passed around from tablet to tablet and from Android client to server. Although the general concept of NoSQL databases first began to gain traction in 2009, at the time I designed my app, there was not much debate about the merits of NoSQL database solutions. For my project, I felt compelled to avoid a MySQL database solution in favor of a file-based solution, partially because Android tablets back then were not very powerful and Android did not include all of the functionality we have in today’s builds. The system needed to be simple, yet future proof. JSON proved to be a perfect solution. To date the system continues to run and has processed and stored over 1 million records.
It turns out that representing unstructured data with JSON managed by NoSQL databases is now very common. Some amazing complimentary technologies have emerged. The open source Apache Cassandra NoSQL database project is a perfect complement to JSON file-based architectures. If you are working on Amazons AWS, Dynamo DB offers similar functionality. Additionally, you may have a requirement for streaming data. Streaming solutions provide for real-time access to the data. The open source Apache Kafka project can easily handle streaming requirements for JSON data. Kafka was created by LinkedIn who needed a way to share all of their massive data across services and geographies. It has since been released as open source and is used by many of the favorite apps and services you use every day.
What does all this have to do with thinking like a data scientist? Well, if you're like me, you may have noticed that the software development process is changing (again!). It used to be we would define the requirements, identify possibly useful 3rd party libraries, and then start writing and integrating code. But remember all that data that exists today? Today, we must start by considering how this data impacts our solution, decide how we can organize it, and then let the restructured data drive our software architecture. JSON is not the most elaborate solution for data organization, but its elegance and simplicity can produce excellent results.
I've been working on a Trader Bot app. Trader Bots typically use historical stock market data to place trades, attempting to outperform the overall market. In my case, I will be attempting to replicate a trading strategy which has been executed successfully by real option traders. The traders followed a strict methodology. Each trade has associated with it a large amount of data which needs to be organized. You probably guessed it, we are talking about implementing a machine learning, or more likely, deep learning app. Deep learning involves building models with neural network algorithms which involves more complexity. In both cases, we build models with existing data, and then use the model to make future predictions, in this example, successful option trades.
With all the progress made in AI, these types of apps are becoming more and more popular these days. While implementing them, I have found the 80/20 rule applies. It turns out that choosing the algorithm, building a model, and writing the actual Java code for the app are the "20" part. Most of the work is going to be "data wrangling,” creating and manipulating all the data associated with each of the historical trades so the model can make useful predictions for future trades.
In Practical Android, JSON is used to store configuration data for several of the apps. The structure of the JSON files is flat- typically a single Array with a few Objects. For the Trader Bot, we will need to store more complex data representing hundreds or even thousands of aspects of each trade. In my Introduction to JSON, I covered JSONLint.com, a useful site which validates JSON for correctness. For creating more complex JSON files like the Trader Bot app will require, I suggest using a JSON Editor. JSON Editor is an easy to use graphical editor that lets you build JSON files interactively. The open source project is available at github.com/josdejong/jsoneditor.
Figure 1. D3 Radial Dendogram Visualization of a JSON File
Thinking like a data scientist and taking control over your data using JSON and helpful JSON tools such as JSON Editor and JSON visualization with D3, combined with complimentary open source technologies such as Cassandra NoSQL database and Kafka streaming can produce powerful, highly scalable distributed solutions. Leveraging these tools and technologies has the added benefit of being compatible with many machine learning and deep learning engines. Who knows, perhaps your next implementation could unexpectedly reach Big Data status.
Cassandra Apache Project: cassandra.apache.org
Kafka Apache Project: kafka.apache.org
Android Data Visualization App: github.com/Wickapps/Android-Data-Visualization
D3 Visualization Library: d3js.org
D3 Visualization Gallery: github.com/d3/d3/wiki/Gallery
JSON Editor: github.com/josdejong/jsoneditor
About the Author:
Mark Wickham is a frequent speaker at Android developer conferences. He has been teaching Android since 2013. His popular classes cover practical topics such as connectivity, push messaging, and audio/video. As a a freelance Android developer, Mark has lived and worked in Beijing since 2000. Mark has led software development teams for Motorola, delivering infrastructure solutions to global telecommunications customers. While at Motorola, Mark also led product management and product marketing teams in the Asia Pacific region. Mark has been involved in software and technology for more than 30 years and began to focus on the Android platform in 2009, creating private cloud and tablet based solutions for the enterprise. Mark majored in Computer Science and Physics at Creighton University, and later obtained an MBA from the University of Washington and the Hong Kong University of Science and Technology. Mark is also active as a freelance video producer, photographer, and enjoys recording live music.
Want more? Pick up Mark's book, Practical Android: 14 Complete Projects on Advanced Techniques and Approaches, now available on Apress.com.