Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Databases

GitHub last commit

A database is “a persistent organised store of related data”. Databasese are a way of collating data together, storing different values about many items.

TermsDefinitions
TableA collection of records
RecordOne row of a table
AttributeHeading of a column, a characteristic of an entity
FieldOne individual element in the table
EntityThe thing that the data is about in the real world
Primary keyA unique identifier for each record in a table. Often things like UIDs
Secondary keyA field that is unique but used for searching, rather than identification
Foreign keyAn attribute of an entity that links to another entity. Often times its a primary key from one table being used as a field in a second table

Databases will often contain multiple tables, where one entity correlates to one table.
Each record within the table will correlate to a single instance of that entity.

This can be visualised as this table representing an arbitrary entity:
Table: “Custom Name 1”

Attribute 1Attribute 2
FieldField

Flat file databases

Flat file databases are a type of database where all data is stored in one massive table.
FFDBs are considered inefficient, due to the fact they suffer from data duplication and result in large sets of unorganised data.

Data duplication can be dealt with using normalisation

Relational databases

Relational databases solve the problem of data duplication.
A relatioal database is where each entity has a separate table and relationships beteen the entities are modelled.

Relationships

There are different types of relationships:

  • One to one
  • One to many
  • Many to many

Many to many should be avoided where possible, since this is the main cause of data duplication in databases.
Relationships can be represented using an entity relationship diagram.

One to one relationships are represented with a single line.

_____          _____
|   | _________|   |
|   |          |   | 
-----          -----

One to many relationships are represented as:

_____           _____
|   |        /- |   |
|   | ______/___|   |
|   |       \   |   | 
-----        \- -----

many to many relationships are represented as

_____            _____
|   | -\       /- |   |
|   |___\_____/___|   |
|   |   /     \   |   | 
----- -/       \- -----

Relational database design

Splitting different entities into multiple databases and then using relationships to link them instead of creating one big table can help reduce the chance of data duplication.

Populating databaes

For a database to be useful, the data that is captured must be accurate and error-free.

Note

This spec is too old bro

Manual data capture

When the data collected is not automated, it has to be captured manually.
Manual data capture traditionally involved paper forms which were manually entered into the database.

Automated data capture

Data that is collected automatically is often done using Optical Character Recognition (OCR) or Optical Mark Recognition (OMR).
OCR/OMR uses specifically formatted forms to make it easier to recognise characters on a page.
The outputs from OCR/OMR can then be fed into the database.

Modern forms of ADC include things such as Google Forms, which can perform different validation techniques for us.

Validation and verification

When inputting data into the database, it’s important to ensure that the data is complete and accurate.

Validation is the act of ensuring that the data is of the correct type.

Verification is the act of ensuring that the data is correct and accurate.
This generally has to be done manually, due to how open ended the data can be.

Exchanging data

Databases can be accessed through an API in order to read/write to the database.
Often, this will be done by passing an API key along with the request, as this identifies you.
The API key will often have permissions, so you can only read databases, or access specific DBs.

Data can also be exchanged using many different formats.
The following formats are highly common:

  • CSV
  • JSON
  • TOML
  • XML
  • YAML

Normalisation

Normalisation a technique that can reduce data duplication within a database by reorganising the contents of a relational database.
Doing this will reduce redundancy and improve data integrity because there will be fewer places where the same data is stored.

All normalisation problems can be solved by subdividing the data into more smaller tables.

First Normal Form

First Normal Form (1NF) is the first stage of normalisation. At this point:

  • Each field contains a single atomic value
  • The value cannot reasonably be split any further
  • There are no repeated attributes (subject1, subject2, subject3, all as different headers in the same table)
  • Each row is uniquely identifiable (commonly with something like a UID)

Second Normal Form

Second Normal Form (2NF) is the second stage of normalisation. There are two rules:

  • The database must be in 1NF
  • All non key fields (the non unique values) are fully dependent on the primary key

A composite primary key is where a unique identifier is achieved by combining value from more than one column.
They work since while each key isn’t unique on their own, they can be combined to produce a unique key.

Third Normal Form

Third normal form (3NF) depends on:

  • Being in 2NF
  • It must have no non-key dependencies

If one column’s depends on the data stored within another column, such as a postcode in column 2 and a city in column 3, this is considered a dependency and therefore this DB is not in 3NF.

Handling and Interacting with Databases

To use a database, you need a way of interacting with databases. This is often done through a GUI or an API.

The interface must be able to perform CRUD actions:

  • Create
  • Retrive
  • Update
  • Delete

DBMS are necessary since they act as an abstraction/access layer over the raw database.

App1   App2    App3
 |      |       |
-------DBMS------
        ||
        ||
=====DATABASE====

Doing this means that multiple applications can access the data, while reducing the risk of data racing.
DBMS also acts as a way of ensuring consistent functionality.
Every app accesses an access layer instead of the raw DB, so they never have access to the raw data. This can improve the integrity of the database since a misbehaving app has to make its requests through the DBMS.

DBMS is also responsible for:

  • Backups and atomic transactions
  • Access permissions
  • Supporting queries through request languages like SQL
  • Enabling fearless concurrency
  • Enforcing relational integrity

Important

Atomic transactions are requests that are made as one singular request. There is only a success where it fully completes successfully or a fail where the entire request is rolled back.
Fearless concurrency is a term from the Rust community, where data can be mutated without race conditions or any other errors when multiple parties are accessing or modifying the same data.

Transactions

A transaction is any change made to a database.

Consider an order from an online marketplace (we’ll use Temu because its funny)
Making an order from Temu involves several transactions because we are making multiple requests to the database. A non exhaustive list of transactions could include:

  1. Logging into the website
  2. Fetching the product
  3. Purchasing the product

ACID

ACID is an acronym statig what a transaction should be able to do and what conditions it should meet.

Atomicity

Atomicity is the concept where each transaction must either fully complete or fully fail. In the case of a fail, all changes made during the transaction must be reverted to before the transaction was made.
Atomicity ensures that there are no half changes made which could lead to corrupted data.

Consistency

Consistency

Isolation

Durability

Record locking

Transactions must also utilise Record Locking in order to prevent multiple users/transactions from mutating data at the same time.
We cannot have multiple users accessing the data at the same time, so the DBMS must let one user go through the data at a time.

When a user is actively performing a transaction on a record in a database, it becomes locked, where no other transaction can be made to that data while the lock is in place. After the user is done, the lock is unlocked, and someone else can lock the data to mutate it.
When multiple users are trying to access the data at the same time, they must wait for their turn to lock the data.

There is also a risk of deadlocks or cyclical dependencies. The DBMS should manage this.

Note

If this concept is difficult, try to learn about it in the form of memory concurrency. Things such as Mutex locks etc.