How Computers Represent and Process Data
Bits, bytes, and the binary system
At the heart of every computer lies the binary system, a method of representing data using only two states: 0 and 1. These states are known as bits, short for binary digits. A bit is the smallest unit of data in a computer and can represent one of two values, typically interpreted as off (0) or on (1).
We’ve already discussed the hardware components that allow these values to be represented in the computer. Now we’re going to look into more detail about the kinds of data that can be represented and operations that can be performed using these methods.
A group of eight bits forms a byte, which is a fundamental unit of data storage in computing. With eight bits, a byte can represent 256 different values, ranging from 0 to 255. For example, the number 1 would be 00000001, and the number 201 would be 11001001. Bytes are the building blocks for representing various types of data, including characters, numbers, and more complex structures.
To understand how computers use the binary system to represent data, consider the example of a simple number. In the decimal system, the number 5 is represented as '5'. In binary, the same number is represented as '101'. Each position in a binary number represents a power of 2, starting from the rightmost digit.
For instance, the binary number '101' can be broken down as follows: the rightmost digit (1) represents 2⁰ (equal to 1), the middle digit (0) represents 2¹ (equal to 2), and the leftmost digit (1) represents 2² (equal to 4). Adding these values together (1 * 2² + 0 * 2¹ + 1 * 2⁰) gives us the decimal number 5.
Binary representation is not limited to numbers. It is also used to encode characters, images, and other types of data. For example, the letter 'A' is represented in binary as '01000001'. This encoding is part of a standardized system known as ASCII (American Standard Code for Information Interchange), which assigns a unique binary value to each character. By converting characters into binary, computers can store and manipulate text efficiently.
In addition to representing data, the binary system is crucial for data processing. Computers use binary arithmetic to perform calculations, logical operations, and data manipulation. Binary arithmetic involves operations such as addition, subtraction, multiplication, and division, all performed using binary numbers. These operations are carried out by the computer's central processing unit (CPU), which is designed to handle binary data at high speeds.
Logical operations, such as AND, OR, and NOT, are also performed using binary values. These operations are fundamental to decision-making processes in computer programs. For example, an AND operation takes two binary inputs and produces a result of 1 only if both inputs are 1. Otherwise, the result is 0. As we’ve previously discussed, these logical operations enable computers to execute complex algorithms and make decisions based on binary data.
Booleans are a simple data type that can hold one of two values: true or false. Therefore they closely represent the concept of bits. Booleans are used in logical operations and control flow statements, such as if-else conditions and loops. They are essential for decision-making processes in programs, allowing the computer to execute different actions based on certain conditions.
Variables and data types
Variables are named storage locations in a computer's memory. They allow computers to store, manipulate, and retrieve information efficiently. A data type is a classification for a variable, which specifies the type of data that a variable can hold.
For example, your computer might store a variable called ‘password’. It stores your password in that allocated part of memory, accessing it when needed. When you want to change your password, the data value within that variable will be changed, but the variable ‘password’ will continue to exist.
Data types are like labels for variables, which describe what kind of data they contain. In the case of the ‘password’ example given above, the data type would be a sequence of raw text, which we call a ‘string’. A string stored as the password might look like this: “Il0vekinnu123”. The ‘password’ variable is labelled by the computer as a string, meaning it knows to treat it as a line of raw text, rather than something else.
So, we’d have a variable, which would be the ‘password’ container, a data type, which would be ‘string’, and a value that’s stored in that variable, ‘IlOvekinnu123’.
In addition to the string data type we mentioned, there are several basic data types that are used in most computer programs.
Integers are whole numbers and are one of the most commonly used data types in computing. Any regular, whole number is an integer. They are typically stored in a fixed number of bits, such as 8, 16, 32, or 64 bits.
Often, integers exist as signed and unsigned versions. Signed data types can represent both positive and negative values, while unsigned data types can only represent non-negative values.
Floating-point numbers, on the other hand, are used to represent real numbers that have a fractional component, such as 3.14 or -0.001. These numbers are stored in a format that includes a sign bit (used to indicate positive or negative), an exponent (to define the scale of the number), and a mantissa (which represents the precise numerical value of the number).
Another basic data type is a Boolean. This is a data value that can either be ‘True’ or ‘False’, and is very useful for many programming tasks. For example, a program might have a Boolean variable called ‘user_is_authorised’, and the value could be either ‘True’ or ‘False’. The program can access that variable to check whether a given user is allowed to view something.
In addition to the basic data types, many programming languages support more complex data structures, such as arrays, lists, and objects. These are often collections of other data types or specifically designed custom data types.
While data types seem like a basic concept, their misuse can lead to some surprising and sometimes catastrophic errors. One famous example is the "Nuclear Gandhi bug" in the 1991 game Civilization. In Civilization, leaders (the players' enemies) are assigned an aggression level. Remember, unsigned integers can only have positive values.
When a leader changes their government to a democracy the aggression level gets reduced by 2, with Gandhi starting with a level of 1. Because the resulting level of -1 cannot be stored in a range of 0-255, it was stored as 255.
This increased aggression level made the peace-loving leader suddenly attack other leaders with nuclear weapons. Though the founder of the game later stated that this was an urban legend, the story is still used to explain integer overflow.
The Gandhi bug may be an urban legend, but there are other, real-life cases where integer overflow errors have caused huge issues.
One famous example is the crash of the Ariane 5 rocket during its maiden flight in 1996, which can be predominantly attributed to an integer overflow error. This was a very simple error, caused by an attempted data conversion from a 64-bit floating-point value to a 16-bit signed integer.
These are just a few examples, and similar errors involving data types can lead to financial miscalculations, incorrect scientific results, or even malfunctions in critical systems. Careful selection and understanding of data types are essential for ensuring the reliability and accuracy of any program.
Character encoding, ASCII and unicode
When you type a letter on your keyboard, how does your computer know which letter you pressed? The answer lies in character encoding, a system that translates characters into a format that computers can understand and process. One of the most fundamental and widely used character encoding schemes is ASCII, which stands for American Standard Code for Information Interchange.
ASCII was developed in the early 1960s and has since become a cornerstone of computer systems. It uses a 7-bit binary number to represent each character, allowing for 128 unique symbols. These symbols include the English alphabet (both uppercase and lowercase), digits, punctuation marks, and control characters like the newline or carriage return. For example, the uppercase letter 'A' is represented by the binary number 01000001, which is 65 in decimal form. Similarly, the lowercase 'a' is represented by 01100001, or 97 in decimal.
The beauty of ASCII lies in its simplicity and universality. Because it uses only 7 bits, it fits neatly into the 8-bit byte, the basic unit of data in most computer systems. This makes it highly efficient for storage and transmission. ASCII's straightforward mapping of characters to numbers also makes it easy to implement in hardware and software, ensuring compatibility across different systems and platforms.
However, ASCII has its limitations. With only 128 possible characters, it cannot accommodate the vast array of symbols used in languages other than English. To address this, various extensions and alternative encoding schemes have been developed.
One of the most notable is Unicode, which aims to provide a unique representation (called “code point”) for every character in every language. Unicode can use up to 32 bits per character, allowing for over a million unique symbols. Despite its broader scope, Unicode maintains backward compatibility with ASCII, meaning that the first 128 characters in Unicode are identical to those in ASCII.
Character encoding is not just about representing letters and numbers; it also plays a crucial role in data integrity and security. When data is transmitted over a network, it is often encoded to ensure that it arrives at its destination without errors.
Encoding schemes like Base64 are used to convert binary data into a text format that can be easily transmitted over text-based protocols like HTTP and SMTP. This is particularly important for sending binary files, such as images or executable programs, over the internet.
Moreover, character encoding is essential for data storage and retrieval. When you save a text file, the characters are encoded into a binary format that can be written to disk. When you open the file, the binary data is decoded back into characters that you can read. If the encoding scheme is not correctly specified, you may end up with garbled text, a common issue when dealing with files created on different systems or in different languages.