Base 64 Encoding Implementation in C++
This tutorial will discuss encoding in base_64
in C++.
First, we will discuss base_64
encoding and why and where it is required. Later, we will discuss encoding/decoding base_64
in C++.
Encoding Scheme Base_64
Base_64
is an addition to encoding schemes. It is similar to binary-to-text encoding in that it represents binary data in an ASCII string.
The difference is that base_64
encoding uses translation into a radix-64. The Base_64
encoding name is from the mathematical definition of bases.
The base represents the basis digits of the number system. Like base_2
has only 2 basic digits, 0 and 1.
Base_8
, an octal number system, has 8 basic digits from 0 to 7.
Similarly, base_16
has 16 basic digits from 0 to 15, where we use A
to F
to represent 10
to 15
. In base_64
, there are 64 basic digits consisting of:
- 26 capital alphabet
- 26 small alphabets
- 10 digits, 0 to 9
- 2 signs
+
and/
Base_64
encoding is commonly used to transfer data over media, designed to deal with ASCII. The base_64
tries to maintain the integrity of data transmitted over the media.
The main application is email via MIME and storing complex data in XML. Base_64
is also called Privacy enhanced Electronic mail (PEM).
Encoding Base_64
Steps
For encoding, we have data as binary string, where we have to operate on each character in the string. We have to perform the following steps to encode in base_64
.
-
Take the ASCII value of each character.
-
Find the 8-bit binary of the ASCII values.
-
Convert the 8 bits (obtained in step 2) into 6 bits by rearranging the digits (required some manipulation including some bit operations (to be discussed later))
-
Convert the 6-bit binaries to their corresponding decimal values
-
Using
base_64
(basic digits inbase_64
already discussed), assign the respectivebase_64
character against each decimal value.
Here, we will discuss the details of step 3, which is the conversion from 8-bit groups to 6-bit groups.
Procedure to Convert 8-Bits Groups to 6-Bits Groups
In Base_64
encoding, as discussed at the start, we have 64
primary characters/digits, whereas usually, we read/write data in bytes. 1 byte has 8 bits, where we can store 0
to 255
, meaning we can represent 256
unique in a byte.
6 bits can represent 64
unique values, where we have to keep the last 2 bits 0 so that each byte stores only 1 digit/character of the Base_64
encoding scheme.
Each character/ASCII value takes 8 bits. Therefore, adjusting 2 bits of each byte requires more storage than the original data.
For Base_64
encoding, we must convert them into 6 bits without any data loss.
If we take the LCM of 8 and 6, we get 24. 3 bytes have 24 bits, but if we use 6 out of 8 bits (the last 2 bits are not used), we need 4 bytes for 24 bits. Hence, without any data loss, we can convert each of the 3 8-bits groups into 4 6-bits groups.
The first step is to group data into sets of 3 bytes. If the last group has lesser bytes, the group is completed by adding bytes with a 0 value.
Next, each set of 3 bytes is grouped into 4 bytes using the following operations. Let’s consider a set of 3 bytes as t1, t2 & t3
and 4 as f1, f2, f3 & f4
.
f1 = ( t1 & 0xfc ) >> 2
Consider mask 0xfc
(equivalent to binary 11111100
), apply bit-wise and operation between the first byte of the set and the mask. Next, use the right shift twice to the result of bit-wise and operation.
The shift operation will transfer the left 6 bits to the right, and the last 2 bits will become 0.
The mask 0xfc
has the first 2 bits 0; where an operation makes the first 2 bits of the first byte of the set to 0 (which means the last 6 bits of the first byte are considered), the first 2 bits (ignored in this operation) will be considered in the following process.
f2 = ( ( t1 & 0x03 ) << 4 ) + ( ( t2 & 0xf0 ) >> 4 )
Here the mask 0x03 00000011
is applied on the first byte for an operation (which means only the first 2 bits are considered, the last 6 bits are already considered in the previous operation). The shift operation will transfer the resultant 2 bits of the first byte to the left, making them the fifth & sixth bits in the expression.
The mask 0xf0 11110000
is applied on the second byte for an operation (which means only the last 4 bits are considered). The shift operation will transfer the resultant 4 bits to the right to make them the first 4 bits of the expression.
The first part of the expression has the fifth & sixth bits on, the second part has the first 4 bits on, and collectively, we have the first 6 bits on and the last bits off.
Finally, we combine them to get a byte with the last 2 bits off. In this step, we have obtained another byte of 6 bits, where the first byte is completed, and the first 4 bits of the second byte are considered.
f3 = ( ( t2 & 0x0f ) << 2 ) + ( ( t3 & 0xc0 ) >> 6 )
The mask 0x0f 00001111
is applied on the second byte for operation (which means only the first 4 bits are considered). The shift operation will transfer the resultant 4 bits to the left to make them the third, fourth, fifth, and sixth bits of the expression and create a space for the first 2 bits.
Next, the mask 0xc0 11000000
is applied on the third byte for an operation (which means only the first 2 bits are considered). The shift operation will transfer the resultant 2 bits to the right to make them the first & second bits of the expression.
Finally, both results are combined to get the third byte of 6 bits group. Again, in the set, we have completed the second byte of the set and 2 bits of the third byte.
f4 = t3 & 0x3f
Finally, the third byte only has an operation, where the mask 0x3f 00111111
has the first 6 bits on and the last 2 off. The operation with the third byte will consider the remaining 6 bits of the third byte.
We have already discussed the 64 basic digits used in base_64
. In the next step, each byte from the set of 4 bytes (obtained using bit operations) is converted into base_64
and concatenated into a string.
Let’s encode the word PLAY
. In the first step, we will make sets having 3 characters each. In the first set, we have PLA
.
In the next stage, we have Y\0\0
. Here, \0
is a null character added to complete the set.
The ASCII of each of these characters is 80 76 65 89 0 0
. The corresponding binary value is 01010000 01001000 01000001 01011001
.
Now let’s do bit operations.
f1 = 01010000 & 11111100 = 01010000 >> 2 = 00010100 = 20
01010000 & 00000011 = 0000000 << 4 = 00000000
first part of the expression01001000 & 11110000 = 01010000 >> 4 = 00000101
second part of the expressionf2 = 00000000 + 00000101 = 00000101 = 5
, adding the result of the first and second part01001000 & 00001111 = 00001000 << 2 = 00100000
first part of the expression01000001 & 11000000 = 01000000 >> 4 = 00000100
second part of the expressionf3 = 00100000 + 00000100 = 00100100 = 36
, adding the result of the first and second partf4 = 01000001 & 00000011 = 00000001 = 1
Now repeat the operation on the next set, where the second and third value is 0; therefore, the results will be:
f1 = 00010110 = 21
f2 = 00010000 = 16
f3 = 0
f4 = 0
Next, we have to convert these values into base_64
. Also, we have to place some sentinel/special characters in the last 2 bytes so that the decoding process can recognize them and decode them accordingly.
In the first set, we have f1= 20, f2 = 5, f3 = 36 & f4 = 1
. The corresponding base_64
values will be UFkB
.
The next set, we have f1 = 21, f2 = 16, f3 = 0 & f4 = 0
. Again, the corresponding base_64
values will be VQ^^
, where caret signs are used as special characters, so collectively string is UFkBV^^
.
The decoding process is simply the reverse process; you can quickly get both methods from the C++ code below.
Base_64
Encoding Implementation in C++
It is a straightforward process to do encoding in C++. We can quickly implement (the steps we have discussed) in C++.
We will discuss it in phases, and finally, we will give complete code with 2 examples.
First, for the base_64
conversion, we will define a constant string with basic digits/characters of base_64
.
const string base64_chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
Before, going to discuss, coding and decoding functions, we have some definitions at the start. Primarily there are some masks to be used in encoding and decoding.
6 of these masks are already discussed while explaining the conversion from 8 bits groups to 6 bits groups.
Some of these masks will use in the decoding process to convert from 6 bits groups to 8 bits groups, and in addition, 2 more masks will be required. In total, we have 8 masks.
typedef unsigned char UC;
typedef unsigned int UI;
#define EXTRA '^'
#define MASK1 0xfc
#define MASK2 0x03
#define MASK3 0xf0
#define MASK4 0x0f
#define MASK5 0xc0
#define MASK6 0x3f
#define MASK7 0x30
#define MASK8 0x3c
Next, consider the encoding function. We will make sets of 3 characters.
Next, we will convert them into groups of 4 characters with bit operations, already discussed in detail. Finally, we will convert each byte of our group of 4 characters and concatenate them to create an encoded string.
Here is the code:
string encode_base64(UC const* buf, UI bufLen) {
string encoded = "";
UI i = 0, j = 0, k = 0;
UC temp_a_3[3], temp_4[4];
for (i = 0; i < bufLen; i += 3) {
for (j = i, k = 0; j < bufLen && j < i + 3; j++) temp_a_3[k++] = *(buf++);
for (; k < 3; k++) temp_a_3[k] = '\0';
temp_4[0] = (temp_a_3[0] & MASK1) >> 2;
temp_4[1] = ((temp_a_3[0] & MASK2) << 4) + ((temp_a_3[1] & MASK3) >> 4);
temp_4[2] = ((temp_a_3[1] & MASK4) << 2) + ((temp_a_3[2] & MASK5) >> 6);
temp_4[3] = temp_a_3[2] & MASK6;
for (j = i, k = 0; j < bufLen + 1 && j < i + 4; j++, k++)
encoded += base64_chars[temp_4[k]];
for (; k < 4; k++) encoded += EXTRA; // sentinal value
}
return encoded;
}
The function takes 2 parameters, the first is the raw data (sent for coding), and the second is the message’s length. We have declared 2 arrays of sizes 3 and 4. Inside the loop, we store data in the first array of size 3.
Next, in case of lesser bytes in the last set, we add null characters to complete the last set. Next, we have 4 statements converting 8-bit data into 6-bit by-bit operations.
Lastly, in the second to the last loop, we convert a group of 4 6-bits characters into base_64
.
The last loop stores extra characters to complete a set of 4 bytes. Next, we have the decode function.
vector<UC> decode_base64(string const& encoded) {
UI i = 0, j = 0, k = 0, in_len = encoded.size();
UC temp_a_3[3], temp_4[4];
vector<UC> decoded;
for (i = 0; i < in_len; i += 4) {
for (j = i, k = 0; j < i + 4 && encoded[j] != EXTRA; j++)
temp_4[k++] = base64_chars.find(encoded[j]);
for (; k < 4; k++) temp_4[k++] = '\0';
temp_a_3[0] = (temp_4[0] << 2) + ((temp_4[1] & MASK7) >> 4);
temp_a_3[1] = ((temp_4[1] & MASK4) << 4) + ((temp_4[2] & MASK8) >> 2);
temp_a_3[2] = ((temp_4[2] & MASK2) << 6) + temp_4[3];
for (j = i, k = 0; k < 3 && encoded[j + 1] != EXTRA; j++, k++)
decoded.push_back(temp_a_3[k]);
}
return decoded;
}
This function takes the encoded message and does the reverse operation, which includes the following steps.
- Get the index of each character obtained from the
base_64
character set and make a set of 4 bytes. - Again, add 0 against the special characters we have stored in the encoding process.
- Next, convert a set of 4 bytes into a set of 3 bytes by reverse bit operations (Here, we are not going into details of these operations).
- Finally, combine a set of 3 bytes to get the combined decoded message.
Finally, here we have a complete code with 2 examples of coding and encoding.
#include <iostream>
#include <string>
#include <vector>
using namespace std;
typedef unsigned char UC;
typedef unsigned int UI;
#define EXTRA '^'
#define MASK1 0xfc
#define MASK2 0x03
#define MASK3 0xf0
#define MASK4 0x0f
#define MASK5 0xc0
#define MASK6 0x3f
#define MASK7 0x30
#define MASK8 0x3c
const string base64_chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
string encode_base64(UC const* buf, UI bufLen) {
string encoded = "";
UI i = 0, j = 0, k = 0;
UC temp_a_3[3], temp_4[4];
for (i = 0; i < bufLen; i += 3) {
for (j = i, k = 0; j < bufLen && j < i + 3; j++) temp_a_3[k++] = *(buf++);
for (; k < 3; k++) temp_a_3[k] = '\0';
temp_4[0] = (temp_a_3[0] & MASK1) >> 2;
temp_4[1] = ((temp_a_3[0] & MASK2) << 4) + ((temp_a_3[1] & MASK3) >> 4);
temp_4[2] = ((temp_a_3[1] & MASK4) << 2) + ((temp_a_3[2] & MASK5) >> 6);
temp_4[3] = temp_a_3[2] & MASK6;
for (j = i, k = 0; j < bufLen + 1 && j < i + 4; j++, k++)
encoded += base64_chars[temp_4[k]];
for (; k < 4; k++) encoded += EXTRA; // sentinal value
}
return encoded;
}
vector<UC> decode_base64(string const& encoded) {
UI i = 0, j = 0, k = 0, in_len = encoded.size();
UC temp_a_3[3], temp_4[4];
vector<UC> decoded;
for (i = 0; i < in_len; i += 4) {
for (j = i, k = 0; j < i + 4 && encoded[j] != EXTRA; j++)
temp_4[k++] = base64_chars.find(encoded[j]);
for (; k < 4; k++) temp_4[k++] = '\0';
temp_a_3[0] = (temp_4[0] << 2) + ((temp_4[1] & MASK7) >> 4);
temp_a_3[1] = ((temp_4[1] & MASK4) << 4) + ((temp_4[2] & MASK8) >> 2);
temp_a_3[2] = ((temp_4[2] & MASK2) << 6) + temp_4[3];
for (j = i, k = 0; k < 3 && encoded[j + 1] != EXTRA; j++, k++)
decoded.push_back(temp_a_3[k]);
}
return decoded;
}
int main() {
vector<UC> myData = {'6', '7', '8', '9'};
string encoded = encode_base64(&myData[0], myData.size());
cout << "Encoded String: " << encoded << '\n';
vector<UC> decoded = decode_base64(encoded);
cout << "Decoded Data: ";
for (int i = 0; i < decoded.size(); i++) cout << (char)decoded[i] << ' ';
cout << '\n';
myData = {4, 16, 64};
encoded = encode_base64(&myData[0], myData.size());
cout << "Encoded String: " << encoded << '\n';
decoded = decode_base64(encoded);
cout << "Decoded Data: ";
for (int i = 0; i < decoded.size(); i++) cout << (int)decoded[i] << ' ';
cout << '\n';
return 0;
}
In the main, we have 2 data sets. In the first set, we have numeric characters; in the next set, we have numeric values; therefore, in the last loop, we print the decoded message with type casting in an integer.
Output:
Encoded String: Njc4OQ^^
Decoded Data: 6 7 8 9
Encoded String: BBBA
Decoded Data: 4 16 64
The first set has 4 characters (4 bytes), and the encoded message Njc4OQ^^
has 6 characters (the last 2 characters are extra). In the second set, there are 3 bytes, and the encoded message BBBA
has 4 bytes.
In base_64
encoding, every character has a maximum of 6 bits set to 1, where we have corresponding 64 base_64
primary characters. Similarly, the encoded message requires 33% more storage than ASCII.
Despite extra storage, the advantage is to manage special characters. Therefore, this encoding scheme transfers data and keeps the integrity intact.