Data is everywhere and can be sourced using numerous methods. Data in its raw form can be tedious to utilize as it is often unstructured, inconsistent and unreliable. The process of arranging this ordinary data into a more structured and consistent format that enables efficient querying, storage optimization and data integrity is termed database normalization.
Terms To Know
Primary Key: A column or columns that uniquely identify each row in a table
Composite key: A combination of multiple columns that is used to uniquely identify all the rows involved
What is Database Normalization?
Database normalization is a design technique used to optimize databases by organizing data into a structured and consistent format. The process involves transforming data from lower normal forms(least structured and consistent) to higher normal forms(more structured and consistent).
Objectives of Database Normalization
Normalizing a database improves the general state of its data by;
Minimizing Redundancy: Database redundancy occurs when the same piece of data exists in multiple places within a database. In the table below, details like customers’ address and email were repeated numerous times. Duplicate data waste storage space and can lead to inconsistencies.
product_name
customer_name
customer_email
customer_address
transaction_date
Jigsaw
John Doe
johnnyd@mail
10 Downing Str
10/12/2024
Hammer
John Doe
johnnyd@mail
10 Downing Str
10/12/2024
Lotion
John Doe
johnnyd@mail
10 Downing Str
10/12/2024
Biscuit
Mary Graham
maryg@mail
Radisson Close
8/10/2024
Lip Gloss
Nneka Obi
nenyo@mail
Upper Iweka
6/10/2024
Efficient Querying: Organized and structured data helps databases process queries more efficiently, leading to faster retrieval time. Structured data reduces the need for complex filtering or aggregation logic.
Ensure Data Integrity: Normalizing a database improves the integrity of data as it ensures that the data is consistent across all fields.
Preventing Anomalies: Database redundancy can cause anomalies like:
Update Anomaly: This kind of anomaly occurs when updating a single piece of data requires us to update multiple fields where that data is stored.
In the table above, an update anomaly would occur if John Does changes their address and we have to update all the cells where this address is stored. If during the process of updating this record, we mistakenly skip a cell, we have by this action introduced an inconsistency known as Update Anomaly to our database.
Insertion Anomaly: This occurs when inserting a new record into a table is impossible without also inserting information about a different Entity.
Example:
consider a table to store information about students and their courses;
In the table below, if we create a new course -Mathematics- we can not directly add it to the table without first associating it with a new student.
Student Id | Student Name | Course ID | Course Name |
1 | Ike Umeh | 103 | Chemistry |
2 | Sandra Drea | 104 | Physics |
3 | Nancy Owens | 105 | Biology |
- Deletion Anomaly: this occurs when deleting a piece of data causes the deletion of a separate entity. For Example, in the table above, if we decide to remove Physics from our list of courses, the record for the student “Sandra Drea” will also be deleted.
First Normal Form(1NF):
This is the lowest level of normalization. This level of normalization is built upon the following principles:
Each row in the table must have a unique identifier i.e a primary key
Each value in the table must be indivisible(atomic)
Example:
The table below records the purchase of items at a grocery store. This table obviously does not adhere to the principles of 1st normal form as one can clearly see that the items purchased by John are saved as a list which violates the rule that states that each value must be indivisible.
Secondly, the rows of this table are not unique as there are no primary identifiers of each row in this database.
Customer Name | Item Purchased | Contact Address | Customer email | Transaction Date |
John | Jigsaw, hammer, lotion | 10 Downing Str | johnnyd@mail | 10/12/2024 |
Mary | Biscuit | Radisson Close | maryg@mail | 8/12/2024 |
Nneka | Lip Gloss | Upper Iweka | nnenyg@mail | 6/10/2024 |
To organize the table above in 1NF, we first introduce a unique identifier in the form of customer ID, then de-itemize the Item Purchased column by saving each individual item in a cell of its own as shown below;
Table In First Normal Form:
Customer id | Customer name | Item purchased | Contact address | Customer email | Transaction date |
1 | John | Jigsaw | 10 Downing Str | johnnyd@mail | 10/12/2024 |
1 | John | Hammer | 10 Downing Str | johnnyd@mail | 10/12/2024 |
1 | John | Lotion | 10 Downing Str | johnnyd@mail | 10/12/2024 |
2 | Mary | Biscuit | Radisson Close | maryg@mail | 8/12/2024 |
3 | Nneka | Lip Gloss | Upper Iweka | nnenyg@mail | 6/10/2024 |
Second Normal Form(2NF):
Second normal form seeks to eliminate partial dependencies.
Its rules are as follows:
Must already be in 1st normal form(1NF)
Non-primary attributes (non-key columns) should depend entirely on the primary key, not just part of it.
Example:
Now that we have a table in 1NF in the example above, we can look to see if it satisfies the constraints of 2NF.
To normalize the table above into 2NF, we need to identify the primary composite keys and any partial dependencies.
Our 1NF table above has a primary composite key of Customer ID and Items Purchased as each transaction is uniquely identified by this combination.
Partial Dependency:
- Customer Name, Customer Email, and Customer Address depend only on the Customer ID, not on the entire primary key(Customer ID and Item Purchased).
We can split the table into separate two tables(Transactions Table and Customer Table) to eliminate partial dependency:
Transactions Table
Customer ID | Item Purchased | Transaction Date |
1 | Jigsaw | 10/12/2024 |
1 | Hammer | 10/12/2024 |
1 | Lotion | 10/12/2024 |
2 | Biscuit | 8/12/2024 |
3 | Lip Gloss | 6/12/2024 |
Customer Table
Customer ID | Customer Name | Customer Email | Customer Address |
1 | John | johnnyd@mail | 10 Downing Str |
2 | Mary | maryo@mail | |
3 | Nneka | nennyo@mail |
Now we have successfully eliminated Partial dependency by ensuring the customer information depends only on the customer ID.
Third Normal Form(3NF):
3rd normal form aims to eliminate transitive dependencies.
It’s rules are as follows:
Must first be in second normal form
No transitive dependencies: non-primary attributes must not depend on other non-primary attributes.
Example:
The 2NF table above already exists in 3NF form as there are no transitive dependencies.
Consider the Employee table below;
Employee ID | Name | Department | Manager |
1 | Chisom | Finance | Jake |
2 | Kene | IT | Jide |
Transitive Dependency:
Employee ID → Name: The Name attribute depends directly on the Employee ID(primary attribute) which is a valid dependency.
Employee ID → Department: The Department depends directly on the Employee ID which is also a valid dependency.
The Manager does not depend directly on the Employee ID. Instead it depends on the Department which in turn depends on the Employee ID. Thus the transition goes like this; Employee ID → Department → Manager. This type of dependency is transitive in nature.
To eliminate transitivity in the table above into two separate tables- Employee and Department Tables;
Employee Table
Employee ID | Name | Department |
1 | Chisom | Finance |
2 | Kene | IT |
Department Table
Department | Manager |
Finance | Jake |
IT | Jide |
Conclusion
Database Normalization is a very important design technique for creating reliable and efficient databases. Normalization minimizes redundancy, ensures data integrity and prevents anomalies. From the foundational First Normal Form (1NF) to the more advanced Third Normal Form (3NF), each level of normalization builds on the previous to enhance database performance and reliability. While normalization brings significant advantages, it’s also essential to balance it with practical considerations, such as query complexity and performance in specific use cases.
By understanding and applying normalization principles, developers and data engineers can build databases that stand the test of time, supporting scalable and consistent data management for applications of any size.