Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

1.3.1 Compression, Encryption and Hashing

GitHub last commit

Compression is the act of reducing the size of data through different encoding techniques in order to represent the same information with less bits.

Uses of compression

Compression is useful for several reasons:

  • Transferring compressed files across a network results in less data transferred overall
  • Transferring compressed files across a network results in faster transfers
  • Compressing files reduces size usage on the filesystem

Types of compression

While there are many uses of compression, there are also many ways to compress a file. These methods are grouped into 2 categories.

Lossy compression

Lossy compression is where data that is determined to be unnecessary is permanently stripped from the file.
This results in greater compression ratioes (lower file size), but a longer compression time.

This is most common in media files, where a lot of data can be removed, while still maintaining most of the quality of the original media. While some users may be able to perceive the loss of data, the aim is for it to be a minimal loss in quality, or completely negligible.

Formatwhy
MPEGscertain pixel data might be omitted if it doesn’t change often
JPGsThings like plain backgrounds can have some pixels stripped of their data if its just a blob of a single colour

Lossless compression

Lossless compression is where the data is compressed, while ensuring that the original file can be extracted from the compressed file. This involves rearranging the data into some form that is more efficient.
Lossless compression results in significantly less compression ratios than lossy, as it must be able to recover the original file.

This compression type will analyse the sequenceing for the data within the file to find some kind of repetition that can be reproduced in a shorter way. The two main algorithms are:

  • Run length encoding
  • Dictionary encoding

Run length encoding is where frequently repeated data sequentially is converted into a singular phrase that can be repeated.

Consider the following sequence:

QUUIURWWWWWWIOOOOOOPIUUEW
Some long chains of text can be compressed to reduce the file size.

QUUIUR$6WI$6O$PIUUEW
This takes up less space, while still being able to represent the same data.

You can also recursively apply this.
Consider the following sequence:

UUUUU_UUUUU_UUUUU_UUUUU_A

This can be compressed to:

$4($5(U)_)A

The repeat is repeated, and short sequences that are repeated are also able to be compressed in this manner.

Dictionary encoding is another lossless technique where instead of replacing sequences with repetition, it creates a dictionary and substitutes the repetants (idk) for a reference to the dictionary entry that contains that word or sequence.

Consider the sentence:
Rust is my favourite language ever. Rust is very Rusty and I love to utilise Rust's ecosystem. Rust in general is the best programming language ever! Rust is way better than C++!

A dictionary could be created that contains the word Rust.
The above sentence could become:
$1 is my favourite language ever. $1 is very $1y and I love to utilise $1's ecosystem. $1 in general is the best programming language ever! $1 is way better than C++!

Dictionary encoding relies on repetition across the file, and sometimes will result in increased file sizes depending on the amount of repetition in the file.

Encryption

Encryption is the process of converting plaintext into ciphertext to prevent data from being interpreted/understood.
Modern encryption makes use of complicated ciphers (algorithms) in order to prevent the ciphertext from being interpreted by third parties.

Cipher strength is the number of possibilities for how a message can be encrypted. This is the metric used to determine how resistant an encryption algorithm is to brute force attacks.
Generally, the strength of a cipher will depend on the length of the key, where more options for keys means that it would be harder to pick the correct the right key.

Cryptanalysis is where a code is being broken without the key being known. Cryptanalysis generally utilises a brute force approach, however it is not completely uncommon for exploits to be found within algorithms.

Symmetric key encryption

Symmetrical key encryption is where both parties

Asymmetric key encryption

Asymmetric encryption is where 2 different keys are generated, known as the public and private key. The public key can be distributed and does not have to be kept secret. The private key must be kept to the individual.
Data encrypted by the public key can only be decrypted by the private key. Since only one key (the private key) can decrypt and it is never shared, it’s significantly more secure than symmetric keys.

The private key can also generate the public key, however the private key is almost completely impossible to be derived from the public key.

Note

Asymmetric key encryption incredibly common, particularly for thing such as digital signatures. This will be extended upon in hashing

Hashing

Hashing is where an algorithm is applied to data to mathematically produce a fixed length string based on a set of given data. It is a one way algorithm, and the original data cannot be recovered from the hash.

Important

Hashing and encryption both produce unreadable strings, however encryption is a reversible process while hashing isn’t

Hashing can be used to ensure data integrity, as if the data changes in any way, no matter how small, the hash will also change.

Good hashing algorithms will:

  • Have a sufficient hash length
  • Be able to produce a fixed hash length given any sized input
  • Low chance of collision
  • Be relatively fast/efficient
  • Irreversible

Hashing in authentication

Hashing is generally utilised in cases where we want to compare sensitive information to each other. For example, instead of comparing rawtext passwords together, we can instead compare the hashes of two passwords.

Doing this means that on account creation, the server can instead store the hash of the password, rather than the password itself. This has the benefit of protecting the password in case of a server breach, where hackers would only get the hash of the password, which is irreversible and mostly useless.
When the user logs in, the hash of their input will be compared with the password hash, therefore the rawtext password is only visible during account creation, and the user’s device when they log in.

In addition to storing the hash on the server, the data can also be salted before it is hashed to further protect the original password.
Salting is where additional random data is added ontop of the password to change the hash. This means that even if the same passsword is used, the added salt will change the final hash.

The salt is stored on the server with the hash so that it can be added to the password when the user tries to log in.

Hashing in data structures

Hashes can speed up access to records within data structures.
Instead of searching through large databases, we can instead hash our input to produce the numerical index of the data, therefore allowing us to access the data in O(1) time.

Digital signatures

Digital signatures are a method of ensuring authenticity.

If a file is digitally signed, there will be additional data on the end of the file that is essentially a file hash, produced by the sender’s private key. The sender will encrypt the hash, producing an encrypted message digest. This message digest is the signature.

This signature can then be decrypted by the sender’s public key, proving that the original data came from the sender’s private key. To further ensure integrity, the decrypted file hash can then be compared against the sent file to ensure that the file and sender are both correct.

The process of creating a signature is:

  1. The file is hashed using a standard hashing algorithm like SHA256
  2. The hash of the file is encrypted with the sender’s private key, producing the encrypted message digest (digital signature)
  3. The digital signature is sent along with the file, or appended to the file
  4. The recipient can then use the sender’s public key to decrypt the digital signature, outputting the file’s hash.
  5. The file can then be compared to the file hash, therefore confirming the file and the sender’s integrity.

In the case of the wrong digital signature being decrypted by what is assumed to be the correct public key, the attempt at decrypting will fail, or produce a garbage output. Therefore, the sender cannot be confirmed and the sent data should be treated with caution.