Databases
A database is “a persistent organised store of related data”. Databasese are a way of collating data together, storing different values about many items.
| Terms | Definitions |
|---|---|
| Table | A collection of records |
| Record | One row of a table |
| Attribute | Heading of a column, a characteristic of an entity |
| Field | One individual element in the table |
| Entity | The thing that the data is about in the real world |
| Primary key | A unique identifier for each record in a table. Often things like UIDs |
| Secondary key | A field that is unique but used for searching, rather than identification |
| Foreign key | An attribute of an entity that links to another entity. Often times its a primary key from one table being used as a field in a second table |
Databases will often contain multiple tables, where one entity correlates to one table.
Each record within the table will correlate to a single instance of that entity.
This can be visualised as this table representing an arbitrary entity:
Table: “Custom Name 1”
| Attribute 1 | Attribute 2 |
|---|---|
| Field | Field |
Flat file databases
Flat file databases are a type of database where all data is stored in one massive table.
FFDBs are considered inefficient, due to the fact they suffer from data
duplication and result in large sets of unorganised data.
Data duplication can be dealt with using normalisation
Relational databases
Relational databases solve the problem of data duplication.
A relatioal database is where each entity has a separate table and relationships beteen the entities are modelled.
Relationships
There are different types of relationships:
- One to one
- One to many
- Many to many
Many to many should be avoided where possible, since this is the main cause of data duplication in databases.
Relationships can be represented using an entity relationship diagram.
One to one relationships are represented with a single line.
_____ _____
| | _________| |
| | | |
----- -----
One to many relationships are represented as:
_____ _____
| | /- | |
| | ______/___| |
| | \ | |
----- \- -----
many to many relationships are represented as
_____ _____
| | -\ /- | |
| |___\_____/___| |
| | / \ | |
----- -/ \- -----
Relational database design
Splitting different entities into multiple databases and then using relationships to link them instead of creating one big table can help reduce the chance of data duplication.
Populating databaes
For a database to be useful, the data that is captured must be accurate and error-free.
Note
This spec is too old bro
Manual data capture
When the data collected is not automated, it has to be captured manually.
Manual data capture traditionally involved paper forms which were manually entered into the database.
Automated data capture
Data that is collected automatically is often done using Optical Character Recognition (OCR) or Optical Mark Recognition (OMR).
OCR/OMR uses specifically formatted forms to make it easier to recognise characters on a page.
The outputs from OCR/OMR can then be fed into the database.
Modern forms of ADC include things such as Google Forms, which can perform different validation techniques for us.
Validation and verification
When inputting data into the database, it’s important to ensure that the data is complete and accurate.
Validation is the act of ensuring that the data is of the correct type.
Verification is the act of ensuring that the data is correct and accurate.
This generally has to be done manually, due to how open ended the data can be.
Exchanging data
Databases can be accessed through an API in order to read/write to the database.
Often, this will be done by passing an API key along with the request, as this identifies you.
The API key will often have permissions, so you can only read databases, or access specific DBs.
Data can also be exchanged using many different formats.
The following formats are highly common:
- CSV
- JSON
- TOML
- XML
- YAML
Normalisation
Normalisation a technique that can reduce data duplication within a database
by reorganising the contents of a relational database.
Doing this will reduce redundancy and improve data integrity because
there will be fewer places where the same data is stored.
All normalisation problems can be solved by subdividing the data into more smaller tables.
First Normal Form
First Normal Form (1NF) is the first stage of normalisation. At this point:
- Each field contains a single atomic value
- The value cannot reasonably be split any further
- There are no repeated attributes (subject1, subject2, subject3, all as different headers in the same table)
- Each row is uniquely identifiable (commonly with something like a UID)
Second Normal Form
Second Normal Form (2NF) is the second stage of normalisation. There are two rules:
- The database must be in 1NF
- All non key fields (the non unique values) are fully dependent on the primary key
A composite primary key is where a unique identifier is achieved by combining value from more than one column.
They work since while each key isn’t unique on their own, they can be combined to produce a unique key.
Third Normal Form
Third normal form (3NF) depends on:
- Being in 2NF
- It must have no non-key dependencies
If one column’s depends on the data stored within another column, such as a postcode in column 2 and a city in column 3, this is considered a dependency and therefore this DB is not in 3NF.
Handling and Interacting with Databases
To use a database, you need a way of interacting with databases. This is often done through a GUI or an API.
The interface must be able to perform CRUD actions:
- Create
- Retrive
- Update
- Delete
DBMS are necessary since they act as an abstraction/access layer over the raw database.
App1 App2 App3
| | |
-------DBMS------
||
||
=====DATABASE====
Doing this means that multiple applications can access the data, while reducing the risk of data racing.
DBMS also acts as a way of ensuring consistent functionality.
Every app accesses an access layer instead of the raw DB, so they never have access to the raw data. This can improve the integrity of the database since a misbehaving app has to make its requests through the DBMS.
DBMS is also responsible for:
- Backups and atomic transactions
- Access permissions
- Supporting queries through request languages like SQL
- Enabling fearless concurrency
- Enforcing relational integrity
Important
Atomic transactions are requests that are made as one singular request. There is only a success where it fully completes successfully or a fail where the entire request is rolled back.
Fearless concurrency is a term from the Rust community, where data can be mutated without race conditions or any other errors when multiple parties are accessing or modifying the same data.
Transactions
A transaction is any change made to a database.
Consider an order from an online marketplace (we’ll use Temu because its funny)
Making an order from Temu involves several transactions because we are making multiple requests to the database. A non exhaustive list of transactions could include:
- Logging into the website
- Fetching the product
- Purchasing the product
ACID
ACID is an acronym statig what a transaction should be able to do and what conditions it should meet.
Atomicity
Atomicity is the concept where each transaction must either fully complete or fully fail. In the case of a fail, all changes made during the transaction must be reverted to before the transaction was made.
Atomicity ensures that there are no half changes made which could lead to corrupted data.
Consistency
Consistency
Isolation
Durability
Record locking
Transactions must also utilise Record Locking in order to prevent multiple users/transactions from mutating data at the same time.
We cannot have multiple users accessing the data at the same time, so the DBMS must let one user go through the data at a time.
When a user is actively performing a transaction on a record in a database, it becomes locked, where no other transaction can be made to that data while the lock is in place. After the user is done, the lock is unlocked, and someone else can lock the data to mutate it.
When multiple users are trying to access the data at the same time, they must wait for their turn to lock the data.
There is also a risk of deadlocks or cyclical dependencies. The DBMS should manage this.
Note
If this concept is difficult, try to learn about it in the form of memory concurrency. Things such as Mutex locks etc.