Content uploaded by Nicola Antonio Roberto Amato
Author content
All content in this area was uploaded by Nicola Antonio Roberto Amato on Oct 07, 2023
Content may be subject to copyright.
1
Mastering database normalization:
A comprehensive exploration of normal forms
By Nicola Amato
https://www.researchgate.net/profile/Nicola-Antonio-Roberto-Amato
https://scholar.google.com/citations?user=9T6sjlYAAAAJ&hl=it
https://apps.ualberta.ca/directory/person/nicolaan
http://nicola-amato.blogspot.com/
Database normalization is a process used in relational database design to organize data efficiently and
reduce data redundancy while ensuring data integrity. It involves breaking down large tables into smaller,
related tables and defining relationships between them. The main goals of database normalization are to
eliminate data anomalies, reduce data duplication, and make the database more manageable.
The normalization process is typically divided into several normal forms, each with its own rules and
requirements. The most commonly used normal forms are:
1. First Normal Form (1NF):
• Each table must have a primary key, which uniquely identifies each row.
• Each column in the table must contain atomic (indivisible) values.
• The values in each column must be of the same data type.
2. Second Normal Form (2NF):
• The table must be in 1NF.
• All non-key attributes (columns) must be fully functionally dependent on the entire
primary key. This means that every non-key attribute must depend on the entire primary
key, not just a part of it.
3. Third Normal Form (3NF):
• The table must be in 2NF.
• There should be no transitive dependencies, meaning that non-key attributes should not
depend on other non-key attributes.
4. Boyce-Codd Normal Form (BCNF):
• The table must be in 3NF.
• Every non-key attribute must be functionally dependent on the superkey, which is any
set of attributes that uniquely identifies a row.
5. Fourth Normal Form (4NF):
• The table must be in BCNF.
• It deals with multi-valued dependencies, ensuring that no non-key attribute is dependent
on other non-key attributes in a way that creates unnecessary duplication of data.
6. Fifth Normal Form (5NF):
• The table must be in 4NF.
• It addresses join dependencies, ensuring that tables are organized to minimize the need
for complex joins in queries.
2
Database designers use these normal forms as guidelines to structure their databases efficiently,
depending on the specific requirements of their applications. While achieving higher normal forms can
reduce data redundancy and improve data integrity, it's essential to strike a balance between normalization
and performance, as highly normalized databases may require more complex queries and joins. In
practice, the level of normalization applied to a database depends on the specific use case and
performance considerations.
Let's now delve into the intricacies of the normal forms.
First Normal Form (1NF)
1NF is the fundamental step in database normalization. To meet the requirements of 1NF, a table
must satisfy the following criteria:
1. Unique Rows: Each row in the table must be unique, meaning there should be a way to
distinguish one row from another. This requirement is typically fulfilled by having a primary key,
which is a column (or a combination of columns) with unique values for each row.
2. Atomic Values: Each column in the table should contain atomic (indivisible) values. In other
words, a column should not contain multiple values or complex data types. It should only hold
simple, single-valued data.
3. Homogeneous Data Types: All values in a column should have the same data type. For
example, if you have a column for "Age," all values in that column should be numeric integers.
If you have a column for "Names," all values should be text strings.
Example:
Consider a table that stores information about students and their courses. Here's an example of a table
that violates 1NF:
Student_ID
Student_Name
Courses
1
Alice
Math, Physics, Chemistry
2
Bob
Biology, History
3
Carol
Math, History
In this example:
1. Unique Rows: The rows are unique because each row has a distinct Student_ID.
2. Atomic Values: The "Courses" column violates 1NF because it contains multiple values (e.g.,
"Math, Physics, Chemistry" in the first row). This column should only contain atomic values.
To bring this table into 1NF, you can split the "Courses" column into separate rows for each course,
creating a new table for courses:
3
Students Table (1NF):
Student_ID
Student_Name
1
Alice
2
Bob
3
Carol
Courses Table (1NF):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
3
Math
3
History
Now, both tables are in 1NF. The "Courses" column in the original table was non-atomic and had to
be split into separate rows in the "Courses" table to meet the 1NF requirements. This separation allows
for more efficient data storage and querying while maintaining data integrity.
Second Normal Form (2NF)
2NF builds upon the foundation of 1NF and introduces a new requirement related to the relationship
between non-key attributes and the primary key of a table. To achieve 2NF, a table must satisfy the
following conditions:
1. The table must already be in 1NF.
2. All non-key attributes (columns) must be fully functionally dependent on the entire primary key.
This means that every non-key attribute should depend on the entire primary key, not just a part
of it.
Example:
Let's continue with the student and course registration example from 1NF:
4
Students Table (1NF):
Student_ID
Student_Name
1
Alice
2
Bob
3
Carol
Courses Table (1NF):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
3
Math
3
History
In this example, the "Courses" table has a composite primary key consisting of both "Student_ID"
and "Course." However, it's not yet in 2NF because the "Course" column depends only on "Student_ID"
and not on the entire primary key (i.e., not on both "Student_ID" and "Course").
To bring the "Courses" table into 2NF, we need to split it into two tables:
Students Table (1NF - unchanged):
Student_ID
Student_Name
1
Alice
2
Bob
3
Carol
Enrollments Table (2NF):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
5
Student_ID
Course
2
Biology
2
History
3
Math
3
History
In this new structure, the "Enrollments" table has a composite primary key of both "Student_ID" and
"Course." Now, the "Course" column depends on both parts of the primary key, which satisfies the
requirements of 2NF. This design eliminates partial dependencies and ensures that each non-key attribute
depends on the entire primary key.
Third Normal Form (3NF)
3NF is the next step in the process of database normalization. To achieve 3NF, a table must satisfy
the following conditions:
1. The table must already be in 2NF.
2. There should be no transitive dependencies, meaning that non-key attributes should not depend
on other non-key attributes.
In simpler terms, 3NF ensures that data is organized in a way that prevents redundant storage of
information related to non-key attributes.
Example:
Let's continue with our student and course registration example from the previous normal forms:
Students Table (1NF - unchanged):
Student_ID
Student_Name
1
Alice
2
Bob
3
Carol
Enrollments Table (2NF - unchanged):
Student_ID
Course
1
Math
1
Physics
6
Student_ID
Course
1
Chemistry
2
Biology
2
History
3
Math
3
History
Now, let's say we want to store additional information about the courses, such as the course instructor
and the course schedule. We might create a new table like this:
Course Details Table:
Course
Instructor
Schedule
Math
Dr. Smith
MWF 9-10
Physics
Dr. Johnson
TTh 11-12
Chemistry
Dr. Davis
MWF 1-2
Biology
Dr. Wilson
TTh 9-10
History
Dr. Clark
MWF 3-4
In this case, the "Course Details" table contains information about courses, including their instructors
and schedules. However, this design introduces a transitive dependency between the "Course" column
and the "Instructor" column, both of which are non-key attributes. The "Instructor" column depends on
the "Course" column, which itself depends on the "Student_ID" in the "Enrollments" table.
To bring this into 3NF, we need to create a separate table for course details, linking it to the
"Enrollments" table through the "Course" column:
Enrollments Table (2NF - unchanged):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
3
Math
3
History
7
Course Details Table (3NF):
Course
Instructor
Schedule
Math
Dr. Smith
MWF 9-10
Physics
Dr. Johnson
TTh 11-12
Chemistry
Dr. Davis
MWF 1-2
Biology
Dr. Wilson
TTh 9-10
History
Dr. Clark
MWF 3-4
Now, the "Course" column in the "Enrollments" table is solely responsible for identifying the course,
and the "Course Details" table contains information about courses, satisfying the requirements of 3NF.
This normalization ensures that there are no transitive dependencies among non-key attributes.
Boyce-Codd Normal Form (BCNF)
BCNF is a higher level of normalization that builds upon the concepts of 1NF, 2NF, and 3NF. To
achieve BCNF, a table must satisfy the following conditions:
1. The table must already be in 3NF.
2. Every non-key attribute (column) must be functionally dependent on the superkey, which is any
set of attributes that uniquely identifies a row.
In simpler terms, BCNF ensures that there are no partial dependencies, and all non-key attributes are
fully dependent on the entire primary key or a superkey.
Example:
Let's continue with our student and course registration example from the previous normal forms:
Students Table (1NF - unchanged):
Student_ID
Student_Name
1
Alice
2
Bob
3
Carol
8
Enrollments Table (2NF - unchanged):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
3
Math
3
History
Course Details Table (3NF - unchanged):
Course
Instructor
Schedule
Math
Dr. Smith
MWF 9-10
Physics
Dr. Johnson
TTh 11-12
Chemistry
Dr. Davis
MWF 1-2
Biology
Dr. Wilson
TTh 9-10
History
Dr. Clark
MWF 3-4
In the 3NF structure, we have a primary key in the "Enrollments" table, which consists of both
"Student_ID" and "Course." However, this table still contains a partial dependency. For example, the
"Instructor" and "Schedule" attributes in the "Course Details" table depend only on the "Course"
attribute, which is part of the primary key. This dependency violates BCNF because non-key attributes
should be fully dependent on the entire primary key.
To bring this into BCNF, we need to create a separate table for course details and link it to the
"Enrollments" table through a superkey that uniquely identifies each row in the "Course Details" table.
Here's how the tables might look after applying BCNF:
Enrollments Table (3NF - unchanged):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
9
Student_ID
Course
3
Math
3
History
Course Information Table (BCNF):
Course
Instructor
Schedule
Math
Dr. Smith
MWF 9-10
Physics
Dr. Johnson
TTh 11-12
Chemistry
Dr. Davis
MWF 1-2
Biology
Dr. Wilson
TTh 9-10
History
Dr. Clark
MWF 3-4
In this BCNF structure, the "Course Information" table has its own primary key, and the "Course"
column in the "Enrollments" table is now fully dependent on the primary key of the "Enrollments" table
(i.e., both "Student_ID" and "Course"). This design eliminates partial dependencies and ensures that all
non-key attributes are fully dependent on the superkey, meeting the requirements of BCNF.
Fourth Normal Form (4NF)
4NF is a further level of normalization that builds upon the concepts of 1NF, 2NF, 3NF, and BCNF.
To achieve 4NF, a table must satisfy the following conditions:
1. The table must already be in BCNF.
2. It addresses multi-valued dependencies, ensuring that there are no non-trivial multi-valued
dependencies among non-key attributes.
In simpler terms, 4NF ensures that there are no sets of non-key attributes that are functionally
dependent on the primary key and exhibit multi-valued dependencies.
Example:
Let's continue with our student and course registration example from the previous normal forms:
Students Table (1NF - unchanged):
Student_ID
Student_Name
1
Alice
2
Bob
10
Student_ID
Student_Name
3
Carol
Enrollments Table (2NF - unchanged):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
3
Math
3
History
Course Information Table (3NF - unchanged):
Course
Instructor
Schedule
Math
Dr. Smith
MWF 9-10
Physics
Dr. Johnson
TTh 11-12
Chemistry
Dr. Davis
MWF 1-2
Biology
Dr. Wilson
TTh 9-10
History
Dr. Clark
MWF 3-4
In the 3NF and BCNF structures, there are no partial dependencies or functional dependencies that
violate those normal forms. However, let's introduce a new requirement to demonstrate 4NF. Suppose
we want to track additional information about the textbooks used in each course, including their titles
and authors. Here's an example:
Textbooks Table:
Course
Textbook_Title
Textbook_Author
Math
Calculus
Dr. Johnson
Physics
Physics Fund.
Dr. Davis
Chemistry
Chem Principles
Dr. Smith
Biology
Biology 101
Dr. Wilson
History
World History
Dr. Clark
11
In this "Textbooks" table, the "Course" column contains values that are functionally dependent on
the primary key (i.e., the combination of "Course" and "Textbook_Title"). However, there is a multi-
valued dependency between "Course" and "Textbook_Author" because a course can have multiple
textbooks, each with its own author.
To bring this into 4NF, we need to create separate tables for courses, textbooks, and textbook authors
while maintaining the relationships between them. Here's how the tables might look after applying 4NF:
Enrollments Table (BCNF - unchanged):
Student_ID
Course
1
Math
1
Physics
1
Chemistry
2
Biology
2
History
3
Math
3
History
Course Information Table (BCNF - unchanged):
Course
Instructor
Schedule
Math
Dr. Smith
MWF 9-10
Physics
Dr. Johnson
TTh 11-12
Chemistry
Dr. Davis
MWF 1-2
Biology
Dr. Wilson
TTh 9-10
History
Dr. Clark
MWF 3-4
Textbooks Table (4NF):
Course
Textbook_Title
Math
Calculus
Physics
Physics Fund.
Chemistry
Chem Principles
Biology
Biology 101
History
World History
12
Textbook Authors Table (4NF):
Textbook_Title
Textbook_Author
Calculus
Dr. Johnson
Physics Fund.
Dr. Davis
Chem Principles
Dr. Smith
Biology 101
Dr. Wilson
World History
Dr. Clark
Now, the "Textbooks" table is in 4NF. Each table has its own primary key, and there are no non-
trivial multi-valued dependencies among non-key attributes. This separation allows us to maintain data
integrity and avoid data redundancy while meeting the requirements of 4NF.
Fifth Normal Form (5NF)
Fifth Normal Form (5NF), also known as Project-Join Normal Form (PJ/NF), is a higher level of
database normalization that addresses join dependencies. Achieving 5NF ensures that tables are
organized to minimize the need for complex joins in queries. To reach 5NF, a table must meet the
following requirements:
1. It must already be in 4NF.
2. It should not have any join dependencies, meaning that it should not rely on decomposing tables
and joining them to retrieve data.
In simpler terms, 5NF aims to eliminate any need for joining tables to obtain information. This level
of normalization is particularly relevant in scenarios where minimizing data redundancy and maintaining
data integrity are paramount.
Let's illustrate 5NF with an example:
Consider a database for a library that tracks information about books, authors, and publishers. Here
are three tables:
Books Table:
Book_ID
Title
Author_ID
Publisher_ID
1
"Book 1"
1
1
2
"Book 2"
2
2
3
"Book 3"
1
3
13
Authors Table:
Author_ID
Author_Name
1
"Author A"
2
"Author B"
Publishers Table:
Publisher_ID
Publisher_Name
1
"Publisher X"
2
"Publisher Y"
3
"Publisher Z"
In the above structure, we have normalized the data up to 4NF by separating information about books,
authors, and publishers while eliminating multi-valued and other dependencies. However, we still need
to join these tables to retrieve comprehensive book information.
To achieve 5NF, we can create a new table that eliminates the need for joins and stores all the relevant
information:
Book Information Table (5NF):
Book_ID
Title
Author_Name
Publisher_Name
1
"Book 1"
"Author A"
"Publisher X"
2
"Book 2"
"Author B"
"Publisher Y"
3
"Book 3"
"Author A"
"Publisher Z"
In this 5NF structure, we've combined data from the Books, Authors, and Publishers tables into a
single table, "Book Information." This eliminates the need for joins to obtain comprehensive book
information, as all relevant data is now available within one table.
By achieving 5NF, we've reduced complexity and redundancy in the database design, making it easier
to query and maintain while ensuring data integrity. However, it's important to note that achieving 5NF
may not always be necessary or practical in every database design, as it can lead to larger, denormalized
tables that may have performance trade-offs. The level of normalization should be balanced with the
specific requirements of the application.
14
Conclusion
In conclusion, the journey through the various Normal Forms (NFs) of database normalization has
provided us with a comprehensive understanding of the principles and techniques that underpin efficient
and well-structured relational databases. From the foundational First Normal Form (1NF) to the
advanced Fifth Normal Form (5NF), each level of normalization addresses specific aspects of data
organization, integrity, and redundancy.
We have learned that normalization is a crucial process in database design, striking a delicate balance
between the reduction of data redundancy and the maintenance of data integrity. By adhering to these
normalization principles, we can create database structures that minimize anomalies, ensure data
accuracy, and facilitate efficient querying and maintenance.
While the journey through the NFs has shown us the importance of progressively refining our data
structures, it's important to recognize that the choice of which normalization level to pursue depends on
the specific requirements of each database system. Over-normalization can lead to complexity, increased
storage requirements, and potential performance trade-offs, especially in scenarios where read-heavy
operations are prevalent.
In the ever-evolving landscape of database management, it's essential for database professionals to
remain adaptable and pragmatic in their approach. The NFs provide us with a valuable toolkit, allowing
us to design databases that strike the right balance between structure and performance, tailored to the
unique demands of each application.
In summary, normalization is not just a theoretical exercise but a practical discipline that empowers
us to create databases that are both robust and efficient. By leveraging the lessons learned from the NFs,
we can ensure that our data remains an asset, enabling better decision-making, improved scalability, and
a solid foundation for future growth in the ever-expanding world of data-driven applications.
References
Codd, E. F. (June 1970). A Relational Model of Data for Large Shared Data Banks, Communications of the
ACM. 13 (6): 377–387. doi:10.1145/362384.362685. S2CID 207549016.
Carlos Coronel, Steven Morris, Database Systems: Design, Implementation, & Management, Cengage
Learning; 13th edition (January 1, 2018)
Connolly, T.M. & Begg, C.E. (2015). Database Systems: A Practical Approach to Design, Implementation and
Management. Pearson, Global (Sixth) Edition
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of
the ACM, vol. 26, pp. 120–125
MariaDB, Database Normalization: 5th Normal Form and Beyond. KnowledgeBase.