This post delves into the database design process by examining what an entity-relationship (E-R) model is, and the different ways that it can be designed.
The design process
One of the most important parts of database design is designing the schema. For any business domain, there are many different ways this schema can be designed. While there is usually no schema that is objectively the best, some are much more effective than others.
There are 2 major pitfalls that should be avoided during the design process:
- Data redundancy: a schema with redundant data has the same information stored in multiple tables. Every time you repeat information, you increase the likelihood that you update one piece of data but forget to update it somewhere else. Redundant data makes it harder to ensure that the data is accurate and consistent. There are some cases where it is desirable to store the same data in multiple locations, but it should be done intentionally.
- Incompleteness: incomplete schemas make certain business requirements very difficult or impossible to perform. Before designing the database schema, it’s important to gather the business requirements and make sure they’re taken into account when creating the design.
Even if you avoid these pitfalls, it is still possible that you still have many other good alternatives to choose from. Creating E-R models for these good schemas can help highlight the advantages and disadvantages of each alternative, making it easier to make the most informed choice.
Entities and relationships
E-R stands for entity-relationship. An E-R model is a way of modeling a database by describing the entities that are stored in the database and the relationships between those entities. It’s worth spending some time to learn about this type of model because relational schemas end up mapping very easily from E-R models.
What is an entity? It is nothing more than some real-world object. Customers, books, and orders are all examples of entities that could exist in a business model.
In a database, each entity is represented as a set of attributes. One or more of these attributes is used to uniquely distinguish each entity from all the others. For example, a customer might be represented by the following attributes: name, social security number, and date of birth. In this case, the social security number would be the attribute that uniquely identifies each customer. The unique identifier is called the key.
A relationship is straightforward: it represents an association among multiple entities. Suppose we have the following entities:
Suppose that Loan 1 and Loan 2 belong to Tom, Loan 3 belongs to Sally, and Tom does not own any loans. From the entity sets above, we can’t represent this information. However, we can use the following relationship set to represent it:
From this example, we can see how the combination of both entities and relationships can be used to model a wide variety of business domains.
Almost all businesses have a number of constraints on how they represent entities or the relationships between those entities. Let’s take the example of customers and loans again. What if the business only allowed each customer to have one loan at a time, but any given loan could be owned by multiple customers? This represents a mapping cardinality constraint, meaning that it puts a restriction on how many other entities a given entity can be associated with.
There are 4 types of mapping cardinalities:
- One-to-one: an entity in A can be associated with at most one entity in B, and vice-versa.
- One-to-many: an entity in A is associated with any number of entities in B, but an entity in B can only be associated with a single entity in A.
- Many-to-one: an entity in A can only be associated with a single entity in B, but an entity in B can be associated with any number of entities in A.
- Many-to-many: there is no restriction on how many entities either A or B can be related to.
(Note: it’s useful to understand this terminology because it is used quite often in different areas of software development)
Now, suppose the business requires that every loan must belong to at least one customer, but customers don’t necessarily need to have a loan. This is another example of a constraint that exists in the relationship between different entities. However, this type of constraint is called a participation constraint. A participation constraint is either total or partial. From our example, loans have total participation in the borrower relation because every loan must belong to a customer, meaning that every loan will show up at least once in the borrower relation. On the other hand, customers have only partial participation because not every customer needs to show up in the borrower relation since some of them might not have loans.
Now that we have an idea of what entities, relationships, and constraints are, we can show what a basic E-R diagram looks like using the example above.
The example above shows the basic building blocks of an E-R diagram:
- Rectangles: represent entity sets
- Ellipses: represent attributes
- Diamonds: represent relationship sets
- Arrows: specify the mapping cardinality of a relationship, i.e. one-to-many, many-to-many, etc.
In the above example, we can see that customers have a one-to-many relationship with loans. This is because of the arrow that points from the borrower relationship to the customer entity set.
E-R diagrams can get more complicated than the one above, and there are other symbols that represent different details about entities and relationships. The point of this post isn’t to go into all these nuances, however.
E-R design issues
As mentioned at the beginning of this post, there are many different ways in which entities and relationships can be represented. There are three major design issues to keep in mind when constructing an E-R model.
Using entity sets vs attributes
The first design issue is the use of entity sets vs attributes in representing data. Take an example of employees and their telephone number(s). The following E-R diagram shows one way in which we could represent these data.
In this diagram, we are adding the telephone number as an attribute on the employee entity. This could be a feasible option in some business cases, but in others it might not be. Couldn’t a telephone be its own distinct entity? If we decided that it was, we would probably want to use the following model.
In this case, we’re representing telephone as its own entity set and then relating it back to an employee using the emp_telephone relationship.
Let’s compare the two models above. Even though the differences seem simple, there are some important implications to these different models.
In the first model, it is implied that each employee has only one telephone number. The model wouldn’t work well if employees could have more than one telephone. Additionally, if an employee didn’t have a telephone number, we would have to use the null value for that attribute, which is not ideal. Finally, what if we are interested in some other information about the telephone? Should those telephone attributes really belong on the employee entity set?
If the business cases were such that all employees had exactly one telephone number and no other information about the phones was of interest, the above model could work and would have the advantage of simplicity. However, one guarantee of software development is that things change. The first model lacks the flexibility to accommodate changes in telephone information.
In the second model, we treat telephones as a separate entity from employees. This provides us with much more flexibility than the first model. If an employee has multiple telephone numbers, we could easily represent that with the help of the emp_telephone relationship. If an employee didn’t have a phone at all, we wouldn’t need any null values – that employee simply wouldn’t show up in the emp_telephone relationship set. Finally, having telephone as its own entity makes it easy to represent any additional information about telephones that we might need.
Using entity sets vs relationship sets
Let’s move on to the second common design issue. This is the issue of determining whether to represent something as an entity or a relationship. Take the example of customers and loans that was used above. In a banking system, customers and loans both belong to certain banks. Instead of representing loans as an entity, couldn’t we represent them as a relationship between customers and banks? That type of model would look like the following:
Clearly, loans could work as either entities or relationships. Which one is preferable? Consider the case where a loan can be held jointly by multiple customers. This would require multiple entries in the loan relationship, one for each customer, which would mean that the loan_number and amount attributes would have to be duplicated across each of these entries. This has the problem of the data redundancy pitfall mentioned at the beginning of this post. So even though it’s possible to represent things as entities or relationships, sometimes it is preferable to choose one over the other.
The guideline offered is that relationships should describe actions between different entities. If it doesn’t describe an action, it might be better represented as an entity.
Using binary vs n-ary relationship sets
All E-R relationships can be broken down into binary relationships. Sometimes it’s desirable to break down relationships that involve more than 2 entities into multiple relationships, but sometimes it’s not.
Consider the case where you wanted to relate a child to his/her parents. You could choose to use a binary relationship between child and parent. In this case, the relationship would look like the following:
Notice what happens if a child does not have a mother or father: we have to represent their value as null. How can we avoid this? Instead, we could split this into 2 different relationships: child_mother and child_father. This gives us the same ability to show how a child relates to his/her parents without having to use null values. This is an example where it is preferable to break down a relationship into binary form.
Let’s take a different example. Suppose we have employees who work at certain branches of their company and who have certain jobs. One way of representing these relationships is as follows:
As we can see, this is a ternary relationship between employee, branch and job. Should we break it down into binary relationships between employee/branch and employee/job? If we took that approach, we would not be able to express that employee Bob works at Branch A as a manager and works at Branch B as an accountant, but is not a manager at Branch B and is not an accountant at Branch A. Therefore, in this situation it would be better for us to keep the ternary relationship instead of breaking it down into binary relationships like we did with the last one.
These 2 examples illustrate the importance of considering whether relationships should involve more than 2 entities, or whether they should be broken down.
Weak entity sets
It is possible to have entity sets whose attributes are not sufficient to form a primary key. As mentioned in the previous post, a primary key is one or more attributes which uniquely identify each entity within an entity set, or each relationship within a relationship set.
Take a look at the following E-R diagram representing this situation:
In this example, payment is a weak entity set. When you think of a payment on a loan, it would usually have a sequential numbering (i.e. the first payment is #1, the second is #2, etc.). That means that we could have multiple entries in the payment set which have the same payment number but belong to different loans. As we can see, no combination of the attributes in the payment set can act as a primary key. That’s why payment is a weak entity set.
How do we know what payments are associated with what loans, then? We need to use a relationship set which can associate payments with loans. When an entity set is dependent upon another entity set to identify its entities, it is owned by that set. In this case, payments are owned by loans. Loan is called the owner entity set and loan_payment is called the identifying relationship. The nature of weak entity sets is that they have total participation in their identifying relationship. In other words, every entry in the payment set must show up in the loan_payment set.
While it’s true that weak entity sets don’t have a primary key, how do we differentiate weak entities that are all associated with the same owner entity? Let’s rephrase this same question using an example to make it less abstract. Say we have made 5 payments on Loan #1. How do we distinguish those 5 payments from each other? When we look at it this way, we can see that the payment_number is sufficient to distinguish these 5 payments from each other. Even though there might be multiple entries in the whole payments set that have a payment number of 1, an individual loan is only ever going to have one payment with number 1. We call the payment number the discriminator. Even though the discriminator is not a unique identifier in the entire set, it is a unique identifier for the particular owner entity that owns it.
Note: These are my notes from Chapter 3 of Database System Concepts, which is about database systems. The examples here are based off of that book, they’re not my examples.