UTF-8 Validation in C++

Overview

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte code units. The task is to create a program to detect if data is a valid UTF-8 character or not. This is a bit manipulation problem and it involves lots of details coding implementation. In this task, I will be using C++ to carry out UTF-8 validation.

Thought Process

The condition to that a data a UTF-8 encoded was stated in the problem set and also we should note that for an n-bytes UTF-8 character, the first n-bits would be 1 followed by a 0 in the n+1 bit. Then, the next n - 1 byte would all have 10 as their most significant bits. Given the elements in our array, we process the elements one after the other to check for validation. Say for the first element, the binary representation of this element is calculated with codes and we save the first 8 bits of the binary representation for processing. As part of the conditions stated in the task, we then check the first two bits if they equal 10, and then to satisfy the second condition, we declare a separate variable to store and shift the bits of the data to the right using the bit shift operator.

This make the data shift to the next UTF-8 character so the code won't have to be reprocessing an already validated element.

Conclusion

Having a program to validate user inputed data over a set of predefined conditions can be of help in many real life applications ranging from password authentication to large scale security verification. This task is about UTF-8 validation but same design process can be applied in most validation related programs.

Complexity analysis

Time Complexity : O(N). Space Complexity: O(1).