28/03/2017

Data by Set or by Row?

By Stefan Ardeleanu

Programming against a database benefits from a different style than is applied to non-database programming. Database programming relies on the relational model, and there is a distinction to be made between atomic, or row-by-row programming, and holistic, set-based programming.

Generally, we can admit that a Java or C# programmer has a style different to the style of a SQL programmer. The problem arises when a Java programmer, for example, as the most representative kind of programmer today, needs to read or write to or from the database. Programmers often take an atomic or row-by-row approach to database access. The atomic approach is not wrong and it will generate the desired results. However, the effects of row-by-row processing are dramatic and the consequences are terrible for the database.

The first casualty is often performance, especially as data sets become larger. Holistic, set-based approaches perform better often from the beginning, and they are better able to be optimized automatically by the database engine.

Imagine that I am grocery shopping. I have a basket full of chocolates. I go to the cashier to pay for them. There are ten chocolates in the basket. Instead of paying all at once, I pay for one chocolate at a time. Obviously, a huge queue forms behind me. People in the shop get upset. I am taking too long.

The cashier hands me a bag. I look at her, very surprised. “What do you want me to do with a bag?” I ask her. The cashier stares at me. “Don’t you know what these bags are for? They allow you to add many goods at one time and transport them easily. If you hand me your goods one by one, you don’t just act illogically and against common sense, you waste your time and that of others. Here is a tool, the bag. Learn to use the tool the right way.”

My story carries two points. The first goes to performance. My performance was very poor and the main consequence was high wait time in the queue that I caused by paying for my goods one-by-one. The second point goes to maintainability: It is easier to maintain one bag than ten.

In the database, using the one-by-one approach may not cause too much harm during the testing phase of a new system, when the quantity of data is generally low and the variety of data is generally reduced. In production, however, row-by-row processing has a huge impact. Very often, this inappropriate style may cause issues. Whenever I have been involved in solving performance problems, I almost always have had to rewrite large pieces of code and replace someone’s atomic approach with a holistic approach that used the relational database engine as it was designed to be used.

Another advantage of the holistic approach is the ease of debugging. I am referring now to specific systems where the goal is to move data between various systems, such as a data migration system, a replication system, or an ETL system. If such a system is built entirely in SQL, as I recommend, apart from all the advantages described so far, the debugging of the system will be easier as well.

In these specific systems, you don’t have the classic debug functionalities with variables and watch points. There is a much simpler debug method. Whenever you have an error of a certain type, like a constraint violation or a conversion error, there is an error handling procedure that shows you the place where the error occurred. In most cases, the reason for the error will be a data set. Take that data set, analyze it, and figure out which rows in the data set caused the error.

Let’s go back to my shopping example. The goal is to move some data, my pieces of candy, from one source to a destination. If I think holistically, I can put my pieces of chocolate into one bag, and hand the bag to the cashier in one go. Furthermore, the cashier can optimize the checkout process by ringing up the total price for the bag rather than entering the price of each individual piece.

Row-by-row database programming is me handing my chocolates to the cashier one-by-one, and leaving the cashier no option but to ring them up one-by-one. Experience database programmers will forego such inefficient processing and build their skill set toward being able to work in sets, using the database engine as it is designed to be used, and enabling the engine to automatically optimize their queries.

About the Author

Stefan Ardeleanu was born in Bucharest, Romania in 1967. He graduated Math and Philosophy, and he was a math teacher for 10 years. Then he began a career in software development. He felt attracted by databases from the beginning, so his entire career in software industry is related to databases, and especially to developing applications that run on them. Stefan has many years’ experience in systems such as Oracle, SQL Server, DB2, and PostgreSQL. He has experience in OLTP and data warehousing, and with replication systems. Stefan is a passionate SQL guy, a database trainer, and he has delivered courses on Oracle products as an Oracle partner, especially database development courses and business intelligence courses. Stefan has published works in both Romanian and English.

This article is excerpted from Relational Database Programming by Stefan Ardeleanu, ISBN 978-1-4842-2079-5.