Entity Modelling

www.entitymodelling.org - entity modelling introduced from first principles - relational database design theory and practice - dependent type theory


Data Modelling

The term 'Data Modelling' we use to cover both database design and also the specification of the structure of messages in a broad sense. Entity modelling though traditionally used as a precursor to relational data modelling is equally suitable as method of specifying hierarchical data structures — in fact, supporting relational, hierarchical and network models of data from a single specification was the raison d'etre of the notation as originally proposed by Chen in the paper 'Unified Model of Data' in 19761. This is particularly well illustrated in the Barker Entity Modelling book2.

ER modelling can be used to specify data at three distinct, increasingly prescriptive levels:

  • conceptual — entities and relationships only,
  • logical — entities, relationships and non-referential attributes,
  • physical — as logical but with message structure completed by the addition of attributes representing relationships.

The most prescriptive level, that of a physical model, is precisely a data model and such models can be classified as relational or hierarchical; most significantly, and in accord with the proposal of Chen, one of each can be generated algorithmically from a well-formulated logical model. Looking back, it seemed that hierarchical structured data took an intellectual back seat for a while during the theoretical development and popularisation of the relational model of data but it has made a come back subsequently through widespread adoption of the structured markup language XML. Via appropriate physical models both relational Data Definition Language (DDL) and hierarchical XML schemas (DTD, XSD or the like) can be generated automatically from a single logical ER model3.

Historically, E.F.Codd's meta theory that was presented as the relational model of data by Codd in 19704, emerged fully formed — the meta concepts of table, column and primary key are defined as is that of a foreign key enabling one table to cross reference the rows of another. His is a theory of what data is and this theory has come to underpin the majority of corporate databases. Each such database, in accord with Codd's prescriptions, holds a meta-description of its own units of storage — the tables, columns and keys — what their names are and how they fit together to enable navigation through the data; this description is the core of what is described as a relational schema. The development of the relational model of data was strongly influenced by the predicate calculus representation of formal logic but arguably this meta-mathematics that influenced Codd has been overtaken by later 20th century meta-mathematics in the form of type theory and category theory; these are more diagrammatic in form and lead not to the relational model of data but to versions of the binary entity relationship model. It is these other meta-mathematical disciplines that influence this presentation and lead to meaningful improvements in relational design methodology. Paradoxically, each such improvement in relational design methodology undermines the pre-eminence enjoyed by the relational model.

Codd has described various tests of goodness of a schema, applicable, it must be remembered, only with cognisance to the possibilities among the data that it is designed to hold i.e. the intended usage. In the first instance three tests were described and successively a schema said to be in 1st normal form, 2nd normal form or 3rd normal form depending on its success in passing the tests. A process for fixing deficient schemas is described as normalisation of the schema. Normalisation is therefore a method for converting or transforming one relational schema into another that is deemed more suitable for the purpose at hand.

Subsequently, the relations of Codd's model are more abstractly presented, as either entities or as n-ary relationships, in Chen's entity-relationship model of data; in the approach of Chen there is emphasis on a diagrammatic representation of the model. Chen describes a method for constructing a relational schema (in the sense of Codd) from an entity-relationship schema (ER-schema). He states that normalisation of the relational schema might be required after construction from an ER-schema — though why this might be is not explained. We will explain in a later section the fundamental reason why this is so.

After Chen's 1976 paper, coming into and through the 1980's, came the development, concurrently, of Computer Aided Software Engineering (CASE) tools, including Meta-CASE tools, and semi-formalised and, in some instances, standardised official methodologies and notations; these supporting structured systems analysis and development. Universally in the methodologies from this time the terms entity and relationship introduced in Chen's paper were retained within a logical modelling phase and Chen's transformation step into relational database design, inclusive of a normalisation step, is likewise retained. Though the terms and the overall shape of the process is retained the concepts behind these terms are adjusted. Most noticeably relationships are now binary relationships and at an early stage in these methodologies many-many relationships are eliminated in favour of many-one relationships. At this point there has been a conceptual volte face for a many-one binary relationship, implementation considerations aside, is a thinly disguised pointer between records of a file, such as in a VSAM file system, or a link between records in the network database model, and it can be conceptualised, abstractly, as a function between sets of like-typed entities — leading some authors to describe a functional model of data5. The entity-relationship diagrams of these software analysis methods and the accompanying CASE tools that emerged in the 80's bear more resemblance to notation that preceded the work of Codd and Chen such as Bachman's data structure diagrams from 19736 than to the diagrams of Chen. Among the many, and as summarised in the book of Rosemary Rock-Evans7, there are three variants of binary entity relationship diagram that stand out, those found, respectively, in SSADM/Barker-Ellis (now adopted by Oracle), in Clive Finkelstein and James Martin's Information Engineering, and in IDEF.

Chen's paper introduced the idea of entities being dependent on binary relationships with others for both their identification and their existence:

Theoretically, any kind of relationship may be used to identify entities. For simplicity, we shall restrict ourselves to the use of only one kind of relationship: the binary relationships with 1:n mapping in which the existence of the n entities on one side of the relationship depends on the existence of one entity on the other side of the relationship. For example, one employee may have n ( = 0, 1, 2, . . .) dependants, and the existence of the dependants depends on the existence of the corresponding employee. This method of identification of entities by relationships with other entities can be applied recursively until the entities which can be identified by their own attribute values are reached. For example, the primary key of a department in a company may consist of the department number and the primary key of the division, which in turn consists of the division number and the name of the company.

In many cases, software methodologies and supporting CASE tools introduced an intermediate step between the ER model and the relational model naming the intermediary model the physical design model to contrast with the logically descriptive model that precedes it. By a significant methodological improvement described in later sections we follow this approach but are able to eliminate the normalisation step.

Figure 8
Traditional methodology for relational data design includes a manual normalisation step

Following PCTE8 we use the term composition relationship for Chen's binary relationships with 1:n mapping in which the existence of the n entities on one side ... depends on the existence of one entity on the other side and we use the term reference relationship for binary relationships which are neither composition relationships nor their inverses. We shall also describe the inverses of composition relationships as being dependency relationships. Earlier than this a similar distinction had been made by the designers of the CAIS9 specification but in which the two kinds of relationship were distinguished as primary and secondary - their rationale for the distinction was as follows10:

[Entities] and relationships may form a general graph or bowl of spaghetti. However, this raises various practical problems of deletion and garbage collection, long term naming, and unconnected sub-graphs. CAIS therefore designates certain relationships as primary (and all others as secondary) and requires that all [Entities] and primary relationships in the database form a single tree structure.
This distinction between composition and reference made by both CAIS and then PCTE served the goal of modelling computer file systems within a database framework, see figure 9 for example.
Figure 9
An ER model of folder system modelling the hierarchical structure as a recursive composition relationship and shortcuts as reference relationships.

In this presentation we shall not assume that all composition relationships are identifying nor, vice-versa, that only composition relationships may be identifying. To depict ER-schemas we use a variant of the Barker-Ellis notation. Figure 10 is a meta-model of this notation — it is an ER schema describing ER schemas.

In cases where we wish to distinguish composition relationships from reference relationships then we draw the diagram top down: an anonymous root entity type (the 'absolute' ) is introduced at the top of the diagram, relationships leaving the lower edges of boxes are composition relationships and they always meet the top edge of the box representing the dependent type, reference relationships meet boxes from one side or the other. We note that there is a structural resemblance to diagrams drawn by Bachman. To summarise, for composition relationships the crows feet point down; at this point the notation converges with that of SSADM for which one explanation says: 'there are no dead crows' . Our diagrams also have reference relationships and for these the crows feet are pointing sideways (the crows, presumably, at rest). The entity types which have the least numbers of instances occur at the top of our diagrams whereas in what seems an odd choice they occur to the bottom right in the diagrams style promoted in Barker's Entity Modelling book.

Figure 10
The logical ER meta-model. A simple version of the logical ER model of a logical ER model.

1 Chen, Peter Pin-Shan. The Entity-relationship Model --- Toward a Unified View of Data. ACM Trans. Database Syst., 1(1):9--36, March 1976.
2 Barker, Richard. Case Method: Entity Relationship Modelling. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1990.
3We will return to this theme later but it has been said that relational data models generated in this way will naturally tend to be well-formulated data models (i.e. to be in normal form). This is definitely not the case unless account is taken of reference scope constraints as described here in later sections.
4 Codd, E. F.. A Relational Model of Data for Large Shared Data Banks. Commun. ACM, 13(6):377--387, June 1970.
5 Buneman, Peter and Frankel, Robert E.. FQL: A Functional Query Language. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, SIGMOD '79, 52--58, New York, NY, USA, 1979, ACM. ,Shipman, David W.. The Functional Data Model and the Data Languages DAPLEX. ACM Trans. Database Syst., 6(1):140--173, March 1981.
6 Bachman, Charles W.. The Programmer As Navigator. Commun. ACM, 16(11):653--658, Nov. 1973.
7 Rock-Evans, Rosemary. An Introduction to Data and Activity Analysis. QED Information Sciences, Inc., Wellesley, MA, USA, 1989.
8 Boudier, Gerard and Gallo, Ferdinando and Minot, Regis and Thomas, Ian. An Overview of PCTE and PCTE+. In Proceedings of the Third ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, SDE 3, 248--257, New York, NY, USA, 1988, ACM.
9 Oberndorf, Patricia A.. The Common Ada Programming Support Environment (APSE) Interface Set (CAIS).. IEEE Trans. Software Eng., 14(6):742-748, 1988.
10 Munck, Robert and Oberndorf, Patricia and Ploedereder, Erhard and Thall, Richard. An Overview of DOD-STD-1838A (Proposed) the Common APSE Interface Set: Revision. In Proceedings of the Third ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, SDE 3, 235--247, New York, NY, USA, 1988, ACM.