Blog Archive for / 2008 / 01 /

Database Tip: Eliminate Duplicate Data

Friday, 25 January 2008

Storing duplicated data in your database is a bad idea for several reasons:

The duplicated data occupies more space — if you store two copies of the same data in your database, it takes twice as much space.
If duplicated data is updated, it must be changed in more than one place, which is more complex and may require more code than just changing it in one location.
Following on from the previous point — if data is duplicated, then it is easy to miss one of the duplicates when updating, leading to different copies having different information. This may lead to confusion, and errors further down the line.

Coincidental Duplication

It is worth noting that some duplication is coincidental — it is worth checking out whether a particular instance of duplication is coincidental or not before eliminating it. For example it is common for a billing address to be the same as a delivery address, and it may be that for all existing entries in the table it is the same, but they are still different concepts, and therefore need to be handled as such (though you may manage to eliminate the duplicate storage where they are the same).

Duplication Between Tables

One of the benefits of Using an artificial primary key is that you can avoid duplication of data between the master table and those tables which have foreign keys linked to that table. This reduces the problems described above where the duplication is in the foreign key, but is only the first step towards eliminating duplication within a given table.

If there is duplication of data between tables that is not due to foreign key constraints, and is not coincidental duplication, then it is possibly worth deleting one of the copies, or making both copies reference the same row in a new table.

Duplication Between Rows Within a Table

Typically duplication between rows occurs through the use of a composite primary key, along with auxiliary data. For example, a table of customer orders might include the full customer data along with each order entry:

CUSTOMER_ORDERS

CUSTOMER_NAME CUSTOMER_ADDRESS ORDER_NUMBER ITEM QUANTITY

Sprockets Ltd Booth Ind Est, Boston 200804052 Widget 23 450

Sprockets Ltd Booth Ind Est, Boston 200804052 Widget Connector 900

Foobar Inc Baz Street, London 200708162 Widget Screw size 5 220

Foobar Inc Baz Street, London 200708162 Widget 42 55

CUSTOMER_ORDERS
CUSTOMER_NAME	CUSTOMER_ADDRESS	ORDER_NUMBER	ITEM	QUANTITY
Sprockets Ltd	Booth Ind Est, Boston	200804052	Widget 23	450
Sprockets Ltd	Booth Ind Est, Boston	200804052	Widget Connector	900
Foobar Inc	Baz Street, London	200708162	Widget Screw size 5	220
Foobar Inc	Baz Street, London	200708162	Widget 42	55

In order to remove duplication between rows, the data needs to split into two tables: the duplicated data can be stored as a single row in one table, referenced by a foreign key from the other table. So, the above example could be split into two tables: a CUSTOMER_ORDERS table, and an ORDER_ITEMS table:

CUSTOMER_ORDERS

CUSTOMER_NAME CUSTOMER_ADDRESS ORDER_NUMBER

Sprockets Ltd Booth Ind Est, Boston 200804052

Foobar Inc Baz Street, London 200708162

ORDER_ITEMS

ORDER_NUMBER ITEM QUANTITY

200804052 Widget 23 450

200804052 Widget Connector 900

200708162 Widget Screw size 5 220

200708162 Widget 42 55

CUSTOMER_ORDERS
CUSTOMER_NAME	CUSTOMER_ADDRESS	ORDER_NUMBER
Sprockets Ltd	Booth Ind Est, Boston	200804052
Foobar Inc	Baz Street, London	200708162

ORDER_ITEMS
ORDER_NUMBER	ITEM	QUANTITY
200804052	Widget 23	450
200804052	Widget Connector	900
200708162	Widget Screw size 5	220
200708162	Widget 42	55

The ORDER_NUMBER column would be the primary key of the CUSTOMER_ORDERS table, and a foreign key in the ORDER_ITEMS table. This isn't the only duplication in the original table, though — what if one customer places multiple orders? In this case, not only are the customer details duplicated for every item on an order, they are duplicated for every item on every order by that customer. This duplication is still present in the new schema, but in this case it is a business decision whether to keep it — if a customer changes address, do you update the old orders with the new address, or do you leave those entries alone, since that was the address that order was delivered to? If the delivered-to address is important, then this is coincidental duplication as described above, if not, then it too can be eliminated by splitting the CUSTOMER_ORDERS table into two.

The Downsides of Eliminating Duplication

The benefits of eliminating duplication might seem obvious, but there are potential downsides too. For example:

If the application is already released, you need to provide upgrade code to change existing databases over to the new schema without losing any data.
If you split tables in order to reduce duplication, your SQL can get more complicated, as you need more table joins.

Conclusion

As with everything in software development it's a trade-off. However, as the database gets larger, and more data gets stored, the costs of storing duplicate data increase, as do the costs of changing the schema of an existing database. For this reason, I believe that it is worth designing your schema to eliminate duplication as soon as possible — preferably before there's any data in it!

Posted by Anthony Williams
[/ database /] permanent link
Tags: database, duplication
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

The Most Popular Articles of 2007

Monday, 14 January 2008

Now we're getting into 2008, here's a list of the 10 most popular articles on the Just Software Solutions website for 2007:

Implementing drop-down menus in pure CSS (no JavaScript)
How to implement drop-down menus in CSS in a cross-browser fashion (with a teensy bit of JavaScript for IE).
Elegance in Software and Elegance in Software part 2
What makes software elegant?
Reduce Bandwidth Usage by Supporting If-Modified-Since in PHP
Optimize your website by allowing browsers to cache pages that haven't changed
Introduction to C++ Templates (PDF)
How to use and write C++ templates.
Using CSS to Replace Text with Images
How to use CSS to display titles and logos as images whilst allowing search engines and users with text-only browsers to see the text.
Testing on Multiple Platforms with VMWare
The benefits of using VMWare for testing your code or website on multiple platforms
10 Years of Programming with POSIX Threads
A review of "Programming with POSIX Threads" by David Butenhof, 10 years after publication.
Review of Test Driven Development — A Practical Guide, by Dave Astels
This book will help you to learn TDD.
Implementing Synchronization Primitives for Boost on Windows Platforms
The technical details behind the current implementation of boost::mutex on Windows.
Building on a Legacy
How to handle legacy code.

Posted by Anthony Williams
[/ news /] permanent link
Tags: popular, articles
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

The Future of Concurrency in C++: ACCU 2008

Monday, 07 January 2008

I am pleased to start 2008 with some good news: I will be speaking on "The Future of Concurrency in C++" at ACCU 2008.

Here's the synopsis:

With the next version of the C++ Standard (C++0x), concurrency support is being added to C++. This means a new memory model with support for multiple threads of execution and atomic operations, and a new set of library classes and functions for managing threads and synchronizing data. There are also further library enhancements planned for the next technical report (TR2). This talk will provide an overview of the new facilities, including an introduction to the new memory model, and an in-depth look at how to use the new library. Looking forward to TR2, this talk will cover the proposed library extensions, and how facilities like futures will affect the programming model.

I hope to see you there!

Posted by Anthony Williams
[/ news /] permanent link
Tags: news, concurrency, threading, accu
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Previous Entries Later Entries

Just Software Solutions

About Us

Technical Writings

Subscribe to Blog

Blog Archives

Blog Archive for / 2008 / 01 /

Database Tip: Eliminate Duplicate Data

Friday, 25 January 2008

Coincidental Duplication

Duplication Between Tables

Duplication Between Rows Within a Table

The Downsides of Eliminating Duplication

Conclusion

The Most Popular Articles of 2007

Monday, 14 January 2008

The Future of Concurrency in C++: ACCU 2008

Monday, 07 January 2008