If you’re going to be compressing and encrypting some data, you should do the compression first. Why? There are several reasons:
- Compressing it last won’t reduce the file size much. Good encryption should make any input data (especially redundant data) appear random. But compression works by removing redundancy, and doesn’t work well on random data. You can see a good example of this here, where encrypting a file and then compressing it actually made it larger than the original!
- Compressing it should decrease the effectiveness of some attacks. Compression works by reducing the redundancy in the data. A common cryptanalysis method is frequency analysis, which relies on finding repeated data. Compressing it should reduce its effectiveness!
- Brute force attacks will take longer. Brute force attacks work by trying various keys and decrypting the data and checking if the output data makes any sense. By compressing it first, an attacker must decrypt the data and then decompress it before seeing if the output data makes any sense. This takes much longer, and if an attacker doesn’t know you’re compressing the data at all, they might never break the encryption.
I wanted to see how effective the third point was, so I wrote a Python script that encrypted a short message and used a brute force attack to break it. Then I repeated the experiment, but compressed it using gzip before encrypting it. Here’s how long it took on average, in seconds, to guess a single password:
|Password length:||Zipped:||Not zipped:|
As you can see, compressing it before encrypting it took about 9 times as long to break.
Details: A short message was chosen, specifically, “a message“, to encrypt. Because gzip is a block compression algorithm, an attacker only needs to decompress the first bytes rather than the whole file, so I wanted to keep the message short to simulate this. I used 128-bit AES, using a password with only lower case letters. In each iteration, a random password was chosen and both the zipped and unzipped versions were tested. The test was run 1000 times for each password length.