Database Normalization For Beginners

Database Normalization For Beginners

Data is everywhere and can be sourced using numerous methods. Data in its raw form can be tedious to utilize as it is often unstructured, inconsistent and unreliable. The process of arranging this ordinary data into a more structured and consistent format that enables efficient querying, storage optimization and data integrity is termed database normalization.

Terms To Know

  • Primary Key: A column or columns that uniquely identify each row in a table

  • Composite key: A combination of multiple columns that is used to uniquely identify all the rows involved

What is Database Normalization?

Database normalization is a design technique used to optimize databases by organizing data into a structured and consistent format. The process involves transforming data from lower normal forms(least structured and consistent) to higher normal forms(more structured and consistent).

Objectives of Database Normalization

Normalizing a database improves the general state of its data by;

  1. Minimizing Redundancy: Database redundancy occurs when the same piece of data exists in multiple places within a database. In the table below, details like customers’ address and email were repeated numerous times. Duplicate data waste storage space and can lead to inconsistencies.

    product_name

    customer_name

    customer_email

    customer_address

    transaction_date

    Jigsaw

    John Doe

    johnnyd@mail

    10 Downing Str

    10/12/2024

    Hammer

    John Doe

    johnnyd@mail

    10 Downing Str

    10/12/2024

    Lotion

    John Doe

    johnnyd@mail

    10 Downing Str

    10/12/2024

    Biscuit

    Mary Graham

    maryg@mail

    Radisson Close

    8/10/2024

    Lip Gloss

    Nneka Obi

    nenyo@mail

    Upper Iweka

    6/10/2024

  2. Efficient Querying: Organized and structured data helps databases process queries more efficiently, leading to faster retrieval time. Structured data reduces the need for complex filtering or aggregation logic.

  3. Ensure Data Integrity: Normalizing a database improves the integrity of data as it ensures that the data is consistent across all fields.

  4. Preventing Anomalies: Database redundancy can cause anomalies like:

  • Update Anomaly: This kind of anomaly occurs when updating a single piece of data requires us to update multiple fields where that data is stored.

    In the table above, an update anomaly would occur if John Does changes their address and we have to update all the cells where this address is stored. If during the process of updating this record, we mistakenly skip a cell, we have by this action introduced an inconsistency known as Update Anomaly to our database.

  • Insertion Anomaly: This occurs when inserting a new record into a table is impossible without also inserting information about a different Entity.

    Example:

    consider a table to store information about students and their courses;

    In the table below, if we create a new course -Mathematics- we can not directly add it to the table without first associating it with a new student.

Student IdStudent NameCourse IDCourse Name
1Ike Umeh103Chemistry
2Sandra Drea104Physics
3Nancy Owens105Biology
  • Deletion Anomaly: this occurs when deleting a piece of data causes the deletion of a separate entity. For Example, in the table above, if we decide to remove Physics from our list of courses, the record for the student “Sandra Drea” will also be deleted.

First Normal Form(1NF):

This is the lowest level of normalization. This level of normalization is built upon the following principles:

  • Each row in the table must have a unique identifier i.e a primary key

  • Each value in the table must be indivisible(atomic)

Example:

The table below records the purchase of items at a grocery store. This table obviously does not adhere to the principles of 1st normal form as one can clearly see that the items purchased by John are saved as a list which violates the rule that states that each value must be indivisible.

Secondly, the rows of this table are not unique as there are no primary identifiers of each row in this database.

Customer NameItem PurchasedContact AddressCustomer emailTransaction Date
JohnJigsaw, hammer, lotion10 Downing Strjohnnyd@mail10/12/2024
MaryBiscuitRadisson Closemaryg@mail8/12/2024
NnekaLip GlossUpper Iwekannenyg@mail6/10/2024

To organize the table above in 1NF, we first introduce a unique identifier in the form of customer ID, then de-itemize the Item Purchased column by saving each individual item in a cell of its own as shown below;

Table In First Normal Form:

Customer idCustomer nameItem purchasedContact addressCustomer emailTransaction date
1JohnJigsaw10 Downing Strjohnnyd@mail10/12/2024
1JohnHammer10 Downing Strjohnnyd@mail10/12/2024
1JohnLotion10 Downing Strjohnnyd@mail10/12/2024
2MaryBiscuitRadisson Closemaryg@mail8/12/2024
3NnekaLip GlossUpper Iwekannenyg@mail6/10/2024

Second Normal Form(2NF):

Second normal form seeks to eliminate partial dependencies.

Its rules are as follows:

  • Must already be in 1st normal form(1NF)

  • Non-primary attributes (non-key columns) should depend entirely on the primary key, not just part of it.

Example:

Now that we have a table in 1NF in the example above, we can look to see if it satisfies the constraints of 2NF.

To normalize the table above into 2NF, we need to identify the primary composite keys and any partial dependencies.

Our 1NF table above has a primary composite key of Customer ID and Items Purchased as each transaction is uniquely identified by this combination.

Partial Dependency:

  • Customer Name, Customer Email, and Customer Address depend only on the Customer ID, not on the entire primary key(Customer ID and Item Purchased).

We can split the table into separate two tables(Transactions Table and Customer Table) to eliminate partial dependency:

Transactions Table

Customer IDItem PurchasedTransaction Date
1Jigsaw10/12/2024
1Hammer10/12/2024
1Lotion10/12/2024
2Biscuit8/12/2024
3Lip Gloss6/12/2024

Customer Table

Customer IDCustomer NameCustomer EmailCustomer Address
1Johnjohnnyd@mail10 Downing Str
2Marymaryo@mail
3Nnekanennyo@mail

Now we have successfully eliminated Partial dependency by ensuring the customer information depends only on the customer ID.

Third Normal Form(3NF):

3rd normal form aims to eliminate transitive dependencies.

It’s rules are as follows:

  • Must first be in second normal form

  • No transitive dependencies: non-primary attributes must not depend on other non-primary attributes.

Example:

The 2NF table above already exists in 3NF form as there are no transitive dependencies.

Consider the Employee table below;

Employee IDNameDepartmentManager
1ChisomFinanceJake
2KeneITJide

Transitive Dependency:

  • Employee ID → Name: The Name attribute depends directly on the Employee ID(primary attribute) which is a valid dependency.

  • Employee ID → Department: The Department depends directly on the Employee ID which is also a valid dependency.

  • The Manager does not depend directly on the Employee ID. Instead it depends on the Department which in turn depends on the Employee ID. Thus the transition goes like this; Employee ID → Department → Manager. This type of dependency is transitive in nature.

To eliminate transitivity in the table above into two separate tables- Employee and Department Tables;

Employee Table

Employee IDNameDepartment
1ChisomFinance
2KeneIT

Department Table

DepartmentManager
FinanceJake
ITJide

Conclusion

Database Normalization is a very important design technique for creating reliable and efficient databases. Normalization minimizes redundancy, ensures data integrity and prevents anomalies. From the foundational First Normal Form (1NF) to the more advanced Third Normal Form (3NF), each level of normalization builds on the previous to enhance database performance and reliability. While normalization brings significant advantages, it’s also essential to balance it with practical considerations, such as query complexity and performance in specific use cases.
By understanding and applying normalization principles, developers and data engineers can build databases that stand the test of time, supporting scalable and consistent data management for applications of any size.