Blog Archive for / database /
Database Tip: Eliminate Duplicate Data
Friday, 25 January 2008
Storing duplicated data in your database is a bad idea for several reasons:
- The duplicated data occupies more space — if you store two copies of the same data in your database, it takes twice as much space.
- If duplicated data is updated, it must be changed in more than one place, which is more complex and may require more code than just changing it in one location.
- Following on from the previous point — if data is duplicated, then it is easy to miss one of the duplicates when updating, leading to different copies having different information. This may lead to confusion, and errors further down the line.
Coincidental Duplication
It is worth noting that some duplication is coincidental — it is worth checking out whether a particular instance of duplication is coincidental or not before eliminating it. For example it is common for a billing address to be the same as a delivery address, and it may be that for all existing entries in the table it is the same, but they are still different concepts, and therefore need to be handled as such (though you may manage to eliminate the duplicate storage where they are the same).
Duplication Between Tables
One of the benefits of Using an artificial primary key is that you can avoid duplication of data between the master table and those tables which have foreign keys linked to that table. This reduces the problems described above where the duplication is in the foreign key, but is only the first step towards eliminating duplication within a given table.
If there is duplication of data between tables that is not due to foreign key constraints, and is not coincidental duplication, then it is possibly worth deleting one of the copies, or making both copies reference the same row in a new table.
Duplication Between Rows Within a Table
Typically duplication between rows occurs through the use of a composite primary key, along with auxiliary data. For example, a table of customer orders might include the full customer data along with each order entry:
CUSTOMER_ORDERS | ||||
---|---|---|---|---|
CUSTOMER_NAME | CUSTOMER_ADDRESS | ORDER_NUMBER | ITEM | QUANTITY |
Sprockets Ltd | Booth Ind Est, Boston | 200804052 | Widget 23 | 450 |
Sprockets Ltd | Booth Ind Est, Boston | 200804052 | Widget Connector | 900 |
Foobar Inc | Baz Street, London | 200708162 | Widget Screw size 5 | 220 |
Foobar Inc | Baz Street, London | 200708162 | Widget 42 | 55 |
In order to remove duplication between rows, the data needs to split into two tables: the duplicated data can be stored as a single row in one table, referenced by a foreign key from the other table. So, the above example could be split into two tables: a CUSTOMER_ORDERS table, and an ORDER_ITEMS table:
CUSTOMER_ORDERS | ||
---|---|---|
CUSTOMER_NAME | CUSTOMER_ADDRESS | ORDER_NUMBER |
Sprockets Ltd | Booth Ind Est, Boston | 200804052 |
Foobar Inc | Baz Street, London | 200708162 |
ORDER_ITEMS | ||
---|---|---|
ORDER_NUMBER | ITEM | QUANTITY |
200804052 | Widget 23 | 450 |
200804052 | Widget Connector | 900 |
200708162 | Widget Screw size 5 | 220 |
200708162 | Widget 42 | 55 |
The ORDER_NUMBER column would be the primary key of the CUSTOMER_ORDERS table, and a foreign key in the ORDER_ITEMS table. This isn't the only duplication in the original table, though — what if one customer places multiple orders? In this case, not only are the customer details duplicated for every item on an order, they are duplicated for every item on every order by that customer. This duplication is still present in the new schema, but in this case it is a business decision whether to keep it — if a customer changes address, do you update the old orders with the new address, or do you leave those entries alone, since that was the address that order was delivered to? If the delivered-to address is important, then this is coincidental duplication as described above, if not, then it too can be eliminated by splitting the CUSTOMER_ORDERS table into two.
The Downsides of Eliminating Duplication
The benefits of eliminating duplication might seem obvious, but there are potential downsides too. For example:
- If the application is already released, you need to provide upgrade code to change existing databases over to the new schema without losing any data.
- If you split tables in order to reduce duplication, your SQL can get more complicated, as you need more table joins.
Conclusion
As with everything in software development it's a trade-off. However, as the database gets larger, and more data gets stored, the costs of storing duplicate data increase, as do the costs of changing the schema of an existing database. For this reason, I believe that it is worth designing your schema to eliminate duplication as soon as possible — preferably before there's any data in it!
Posted by Anthony Williams
[/ database /] permanent link
Tags: database, duplication
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Database Tip: Use an Artificial Primary Key
Monday, 19 November 2007
If your data has a clear "master" field or combination of fields, which can uniquely identify each row (such as customer name in
a table of customers or ISBN for a table of books), it is tempting to use that as the primary key of the table. However, my advice
is: don't do that, use a separate, numeric, artificial primary key instead. Yes, it's an extra column on the table,
requiring extra space, and you will have to generate it somehow, but that's not a big deal. Every database vendor provides some way
of auto-generating unique key values (e.g. SEQUENCE
s in Oracle, and AUTOINCREMENT
fields in SQLite), so
populating it is easy, and the complications it saves are more than worth the trade-off. You can still maintain the uniqueness of
the master columns by applying a unique index to those columns.
Save Space on Foreign Keys
Firstly, if you have any tables that are associated with the master table, and therefore have foreign key columns that refer to
the primary key of your master, then having a separate primary key can actually save space overall, as the data for the
master columns doesn't have to be duplicated across all the linked tables. This is especially important if there is more than one
"master column", such as customer_first_name
and customer_last_name
, or if the data for these columns is
large.
Changing the master data
Secondly, if the "master columns" are actually the primary key of your table, changing the data in them is potentially
problematic, especially if they are used as a foreign key in other tables. Many online services use a customer's email address as
their "master column": each customer has one email address, and one email address refers to one customer. That's fine until a
customer changes their email address. Obviously, you don't want to lose all data associated with a given customer just because they
changed their email address, so you need to update the row rather than delete the old one and insert a new one. If the
email_address
column is the primary key of the table, and therefore used as the foreign key in other tables, then
you've got to update the data not just in the master table, but in each dependent table too.
This is not impossible, but it's certainly more complex and time consuming. If you miss a table, the transaction may not complete
due to foreign key constraint violations, or (worse) the transaction may complete, but some of the data may be orphaned. Also, in
some database engines, the constraint violation will fire when you change either the master table or the dependent table, so you
need to execute a special SQL statement to defer the constraint checking until COMMIT
time. If you use an auto-generated
primary key, then only the data in the master table needs changing.
Changing the master columns
Finally, if the primary key is auto-generated, then not only is it easy to change the data in the master columns, but you
can actually change what the master columns are. For example, if you initially decide that customer_first_name
and customer_last_name
make an ideal primary key, then you're stuck if you then get another customer with the same
name. OK, so you make customer_first_name
, customer_last_name
and customer_address
the
primary key. Oops — now you've got to duplicate the address information across all the dependent tables. Now you encounter two
people with the same name at the same address (e.g. father and son), so you need to add a new designator to the key (e.g. Henry
Jones Junior, Richard Wilkins III). Again, you need to update all the dependent tables. If the primary key is
auto-generated, there's no problem — just update the unique constraint on the master table to include the appropriate columns,
and all is well, with the minimum of fuss.
Simplify your code
It's not going to simplify your code much, but using an auto-generated numeric key means that this is all you need to store as an identifier inside your program to refer to a particular row: much easier than storing the data from a combination of columns. Also, it's much easier to write code to update the data on one table than on multiple tables.
Conclusion
Don't use real table data as the primary key for a table: instead, use a separate, numeric, auto-generated column as the primary key. This will simplify the connections between tables, and make your life easier if the structure of the database or the data in the key columns changes.
Related Posts
In previous posts on Database Design, I've talked about:
- Using parameterized queries,
- Creating appropriate indexes,
- Using transactions, and
- Delaying the use of a database
Posted by Anthony Williams
[/ database /] permanent link
Tags: database, primary key, foreign key
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Database Tip: Use Parameterized Queries
Monday, 03 September 2007
This post is the third in a series of Database Tips.
When running an application with a database backend, a high percentage of SQL statements are likely to have variable data. This might be data obtained from a previous query, or it might be data entered by the user. In either case, you've got to somehow combine this variable data with the fixed SQL string.
String Concatenation
One possibility is just to incorporate the data into the SQL statement directly, using string concatenation, but this has two
potential problems. Firstly, this means that the actual SQL statement parsed by the database is different every time. Many databases
can skip parsing for repeated uses of the same SQL statement, so by using a different statement every time there is a performance
hit. Secondly, and more importantly, this places the responsibility on you to ensure that the variable data will behave correctly as
part of the statement. This is particularly important for web-based applications, as a common attack used by crackers is a "SQL
injection" attack — by taking advantage of poor quoting by the application when generating SQL statements, it is possible to
input data which will end the current SQL statement, and start a new one of the cracker's choosing. For example, if string data is
just quoted in the SQL using plain quotes ('data'
) then data that contains a quote and a semicolon will end
the statement. This means that if data is '; update login_table set password='abc';
then the initial
';
will end the statement from the application, and the database will then run the next one, potentially setting
everyone's password to "abc".
Parameterized Queries
A solution to both these problems can be found in the form of Parameterized Queries. In a parameterized query, the variable data in the SQL statement is replaced with a placeholder such as a question mark, which indicates to the database engine that this is a parameter. The database API can then be used to set the parameters before executing the query. This neatly solves the first problem with string concatenation — the query seen by the database engine is the same every time, so giving the database the opportunity to avoid parsing the statement every time. Most parameterized query APIs will also allow you to reuse the same query with multiple sets of parameters, thus explicitly caching the parsed query.
Parameterized queries also solve the SQL injection problem — most APIs can send the data directly to the database engine, marked as a parameter rather than quoting it. Even when the data is quoted within the API, this is then the database driver's responsibility, and is thus more likely to be reliable. In either case, the user is relinquished from the requirement of correctly quoting the data, thus avoiding SQL injection attacks.
A third benefit of parameterized queries is that data doesn't have to be converted to a string representation. This means that, for example, floating point numbers can be correctly transferred to the database without first converting to a potentially inaccurate string representation. It also means that the statement might run slightly faster, as the string representation of data often requires more storage than the data itself.
The Parameterized Query Coding Model
Whereas running a simple SQL statement consists of just two parts — execute the statement, optionally retrieve the results — using parameterized queries ofetn requires five:
- Parse the statement (often called preparing the statement.)
- Bind the parameter values to the parameters.
- Execute the statement.
- Optionally, retrieve the results
- Close or finalize the statement.
The details of each step depends on the particular database API, but most APIs follow the same outline. In particular, as mentioned above, most APIs allow you to run steps 2 to 4 several times before running step 5.
Placeholders
A parameterized query includes placeholders for the actual data to be passed in. In the simplest form, these
placeholders can often be just a question mark (e.g. SELECT name FROM customers WHERE id=?
), but most APIs also allow
for named placeholders by prefixing an identifier with a marker such as a colon or an at-sign (e.g. INSERT INTO books
(title,author) VALUES (@title,@author)
). The use of named placeholders can be beneficial when the same data is needed in
multiple parts of the query — rather than binding the data twice, you just use the same placeholder name. Named placeholders
are also easier to get right in the face of SQL statements with large numbers of parameters or if the SQL statement is changed
— it is much easier to ensure that the correct data is associated with a particular named parameter, than to ensure that it is
associated with the correctly-numbered parameter, as it is easy to lose count of parameters, or change their order when changing the
SQL statement.
Recommendations
Look up the API for parameterized queries for your database. In SQLite, it the APIs surrounding sqlite3_stmt, for MySQL it's the Prepared Statements API, and for Oracle the OCI parameterized statements API does the trick.
If your database API supports it, used named parameters, or at least explicit numbering (e.g. ?1,?2,?3
rather than
just ?,?,?
) to help avoid errors.
Posted by Anthony Williams
[/ database /] permanent link
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Database Tip: Use Transactions
Monday, 27 August 2007
Here's another Database Tip, to follow on from my previous one on creating appropriate indexes. This time the focus is transactions.
For any experienced database developer, using transactions might seem an obvious suggestion, but lightweight databases may
require configuration in order to use transactions. For example, MySql tables use the
MyISAM
engine by default, which doesn't support transactions — in order to use transactions you need to set the
storage engine of your tables to InnoDB
or BDB
. Also, whereas in Oracle, every statement occurs within a transaction, and you need an explicit COMMIT
or ROLLBACK
to end the transaction, in databases such as MySQL and SQLite every
statement is its own transaction (and thus committed to the database immediately) unless you explicitly begin a transaction with a
BEGIN
statement, or other database configuration command.
Benefits of Transactions
The primary benefit of using transactions is data integrity. Many database uses require storing data to multiple tables, or multiple rows to the same table in order to maintain a consistent data set. Using transactions ensures that other connections to the same database see either all the updates or none of them. This also applies in the case of interrupted connections — if the power goes off in the middle of a transaction, the database engine will roll back the transaction so it is as-if it was never started. If each statement is committed independently, then other connections may see partial updates, and there is no opportunity for automatic rollback on error.
A secondary benefit of using transactions is speed. There is often an overhead associated with actually committing the data to the database. If you've got 1000 rows to insert, committing after every row can cause quite a performance hit compared to committing once after all the inserts. Of course, this can work the other way too — if you do too much work in a transaction then the database engine can consume lots of space storing the not-yet-committed data or caching data for use by other database connections in order to maintain consistency, which causes a performance hit. As with every optimisation, if you're changing the boundaries of your transactions to gain performance, then it is important to measure both before and after the change.
Using Transactions in Oracle
Oracle databases are always in transaction mode, so all that's needed is to decide where to put the COMMIT
or
ROLLBACK
. When one transaction is finished, another is automatically started. There are some additional options that
can be specified for advanced usage — see the Oracle documentation for these.
INSERT INTO foo (a,b) VALUES (1,'hello'); INSERT INTO foo (a,b) VALUES (2,'goodbye'); COMMIT; INSERT INTO foo (a,b) VALUES (3,'banana'); COMMIT;
Using Transactions in SQLite
In SQLite, if you wish a transaction to cover more than one statement, then you must use a BEGIN
or BEGIN
TRANSACTION
statement. The transaction ends when you execute a COMMIT
or ROLLBACK
statement, and
the database reverts to auto-commit mode where each statement has its own transaction.
BEGIN; INSERT INTO foo (a,b) VALUES (1,'hello'); INSERT INTO foo (a,b) VALUES (2,'goodbye'); COMMIT; INSERT INTO foo (a,b) VALUES (3,'banana'); -- implicit COMMIT at end of statement -- with no preceding BEGIN
Using Transactions in MySQL
As mentioned above, by default all tables use the MyISAM
storage engine, so transaction support is disabled. By
changing a table to use the InnoDB
or BDB
storage engines, transaction support can be enabled. For new
tables, this is done using the ENGINE
or TYPE
parameters on the CREATE TABLE
statement:
CREATE TABLE foo ( bar INTEGER PRIMARY KEY, baz VARCHAR(80) NOT NULL ) ENGINE = InnoDB;
Existing tables can be changed to use the InnoDB
storage engine using ALTER TABLE
:
ALTER TABLE foo ENGINE = InnoDB;
You can also change the default storage engine using the default-storage-engine
server option on the server command
line, or in the server configuration file.
Once all the tables you're using have storage engines that support transactions, you have two choices. For a given connection,
you can set the AUTOCOMMIT
session variable to 0
, in which case every statement within that connection is
part of a transaction, as for Oracle, or you can leave AUTOCOMMIT
set to 1
and start transactions
explicitly as for SQLite. In auto-commit mode for MySQL, transactions are started with BEGIN
, BEGIN WORK
or START TRANSACTION
. To disable AUTOCOMMIT
for the current transaction, use the SET
statement:
SET AUTOCOMMIT=0;
You can configure the database to run this statement immediately upon opening a connection using the init_connect
server variable. This can be set in the configuration file, or using the following command:
SET GLOBAL init_connect='SET AUTOCOMMIT=0';
MySQL also supports additional transaction options — check the documentation for details.
Automatic ROLLBACK
and COMMIT
One thing to watch out for is code that causes an automatic ROLLBACK
or COMMIT
. Most databases cause an
automatic ROLLBACK
of any open transaction when a connection is closed, so it is important to make sure that all
changes are committed before closing the connection.
Also worth watching out for are commands that cause an automatic COMMIT
. The list varies depending on the database,
but generally DDL statements such as CREATE TABLE
will cause an automatic commit. It is probably best to avoid
interleaving DML statement with any other type of statement in order to avoid surprises. Check your database documentation for
details.
Posted by Anthony Williams
[/ database /] permanent link
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Database Tip: Create Appropriate Indexes
Monday, 20 August 2007
One of the simplest things you can do speed up database access is create appropriate indexes.
There are several aspects to this. Firstly, you need to identify which columns are used for queries on a given table; in
particular, which columns appear in the WHERE
clause of time consuming queries. If a query is only done once, or the
table only has five rows in it so queries are always quick then there is no benefit to adding indexes. It's not just
straight-forward SELECT
statements that need checking — UPDATE
and DELETE
statements
can have WHERE
clauses too.
Having identified which columns are used in queries, it is important to also note which combinations are used. A database engine
will tend to only use one index per table (though some can use more, depending on the query), so if your time-consuming queries use
multiple columns from the same table in the WHERE
clause, then it's probably better to have an index that covers all of
them rather than separate indexes on each.
The cost of indexes
Adding an index to a table does have a downside. Any modification to a table with an index (such as inserts and deletes) will be
slower, as the database will have to update the index, and the index will occupy space in the database. That can be a high price to
pay for faster queries if the table is modified frequently. In particular, an index that is never used, or covers the same columns
as another index is just dead weight. It is important to remember that PRIMARY KEY
and UNIQUE
columns
automatically have indexes that cover them.
Timing is everything
As with every optimization, it is important to profile both before and after any change, and this includes checking the performance of the rest of the application too. Don't be afraid to remove an index if it isn't helping, and bear in mind that it's also possible to improve performance by rewriting queries, particularly where the are joins or subqueries involved.
Posted by Anthony Williams
[/ database /] permanent link
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Delay Using a Database
Wednesday, 08 August 2007
A client of ours makes hardware that generates data files, and a few years ago I wrote a piece of software for them to help manage those data files. Initially it only had to deal with a few data files, so it indexed them on start up. Then someone tried to use it with several thousand data files, and start-up times got too slow, so I modified the indexing code to dump the current index to an XML file on shutdown, which it then loaded at startup. This has worked well, but now they're using it to handle hundreds of thousands of files, and the startup and shutdown times are again becoming significant due to the huge size of the XML file. Also, the data file access times are now getting larger due to the size of the in-memory index. We've now been hired again to address the issue, so this time I'm going to use a SQLite database for the index — no start up delay, no shutdown delay, and faster index lookup.
What lessons can be learned from this experience? Should I have gone for SQLite in the first instance? I don't think so. Using a simple in-memory map for the initial index was the simplest thing that could possibly work, and it has worked for a few years. The XML index file was a small change, and it kept the application working for longer. Now the application does need a database, but the implementation is certainly more complex than the initial in-memory map. By using the simple implementation first, the application was completed quicker — not only did this save my client money in the development, but it meant they could begin using it sooner. It also meant that now I come to add the database code, the requirements are better-known and there are already a whole suite of tests for how the index should behave. It has taken less than a day to write the database indexing code, whereas it could easily have taken several days at the beginning.
I think people are often too keen to jump straight to using a database, when they could often get by for now with something far simpler. That doesn't mean that requirements won't evolve, and that a database won't be required in the end, but this time can often be put off for years, thus saving time and money. In this instance I happened to use SQLite, which is free, but many people jump straight to Oracle, or SQL Server, which have expensive licenses and are often overkill for the problem at hand. Just think how much money you could save by putting off the purchase of that SQL Server license for a year or two.
Don't be scared into buying a full-featured enterprise level RDBMS at the beginning of your project; simple in-memory maps or data files will often suffice for a long time, and when they're no longer sufficient you'll know more about what you do need from your RDBMS. Maybe SQLite will do, or maybe it won't — in any case you've saved money.
Posted by Anthony Williams
[/ database /] permanent link
Stumble It! | Submit to Reddit | Submit to DZone
If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.
Design and Content Copyright © 2005-2024 Just Software Solutions Ltd. All rights reserved. | Privacy Policy