Data Representation in Computers

Peter Ladkin

Computers `store' numbers in hardware as digits: for example, the number `12493' consists of five digits, each of which is between 0 and 9. We call it a `decimal representation' (decimal = based on 10) because there are ten such digits, and because this number represents

1x10,000 + 2x1,000 + 4x100 + 9x10 + 3
in other words, the number obtained by adding one times ten thousand to two times one thousand, the result to 4 times one hundred, the result to two times ten, and finally to 3. The numbers 10,000, 1,000, 100, 10 are obtained by multiplying 10 by itself repeatedly. The number 10 is called the base of the number system.

Computer hardware is built to store numbers in `binary'. There are two digits, 0 and 1. So computer hardware works mostly on base two. In base 2, a number 10010 represents

1x16 + 0x8 + 0x4 + 1x2 + 0
The numbers 16, 8, 4, 2 are obtained by multiplying 2 by itself repeatedly. Notice this is similar to base 10, but using 2 instead of 10. The reason for using binary notation in computer hardware is that it is very easy to implement physically. A single binary digit (0 or 1) is called a `bit' when it is implemented in computer hardware.

Computer hardware must be able to represent not only numbers, but also characters (for text) and all kinds of other `data'. To do this, the other kind of data is `coded', as I shall now explain.

First, I must describe `bytes'. The bits are grouped together physically into `bytes'. A byte has a fixed number of bits. Typical sizes for bytes are 8 bits, and 16 bits. An example of an 8-bit byte is 10011101 and of a 16-bit byte is 1001001101001001. There are 256 different 8-bit bytes and many thousands of possible 16-bit bytes. The size of a byte is determined when the computer hardware (the chip) is designed, and is physically fixed. It cannot be altered. But of course different chips can have different byte sizes. It is a convenience, and not a mathematical necessity, that bytes are mostly designed to have 8 bits.

Bytes are used as codes in the following way. A character between `a' and `z' is assigned a certain specific fixed binary number, say between 0 and 255, as an identifier. Say, a lower case `k' is assigned the number 107 (01101011 as an 8-bit byte), and an upper case `K' the number 75 (01001011). If a computer program or hardware `knows' that it should be seeing a character, it can read the number and interpret it as the intended character. A code between 0 and 255 can be represented in an 8-bit byte. Examples of codes are EBCDIC (an IBM code from the 1960's), ASCII (the `American standard' character set, from which the example with `k' is taken) and ISO-8856 (the International standard character set). ASCII actually uses numbers only between 0 and 127 to represent the rather restricted range of characters used in the US (no ü, é, ç or ß). Because numbers between 0 and 127 can be represented using 7 bits only, this is often referred to as a `7-bit code'. The ISO-8856 standard is an 8-bit code, using the numbers between 0 and 255 as codes.

With Email, one often hears of `7-bit' and `8-bit' transmission problems: in many older computer communications, only 7 bits of a byte were used to code a digit, because ASCII required only 7 bits. So if one sends a message using ISO-8856 (as is likely if one is communicating in a language other than English), one bit of information will be lost, which means that another character could be read, different from the one that was sent, and the message thereby garbled. More recent Email communication standards (`MIME') also specify a way to code 8-bit data as 7 bits, so that it may be decoded by the receiver and correctly read. This is indeed a complicated but necessary fix to a design problem (deciding that 7 bits were enough) that could have been avoided. But we're stuck with it now. It's somewhat like the Year 2000 problem.

Bytes themselves are grouped into `words', consisting of usually of two, four, or eight bytes. Words are used for storing computer instructions and other data, such as numbers, for which more than the size of a byte is required for coding. One hears of `16-bit', `32-bit' and `64-bit' computers (meaning processor chips). This means the word size is 16, 32 or 64 respectively. One also hears, in communications, of 8-bit or 16-bit (or 32-bit or...) data paths. This means that data is transferred in this size. If you have a 32-bit processor with 8-bit data paths, it means that every word in the processor must be broken up into 4 8-bit bytes to be tranferred to, say, disk, and each of these 4 bytes are send separately.

The byte size and word size of computer hardware are fixed. They cannot be altered. That means that any data which doesn't standardly fit into this format must be coded so that it does. This is usually done by programs (that is, software). One way in which this can be done for dates is to code each decimal digit into bytes. One usually uses a single 8-bit byte for two decimal digits, using 4 bits for each digit. Here are the codings:

So for example 93 will be coded as `9 followed by 3': 10010011 This method of coding decimal digits directly is used by many financial computing systems, such as those produced by IBM in the 1960's through 1980's.

Using this method, a date represented as DD-MM-YY, say 30-10-98, can be coded in three bytes, as 00110000, 00010000, 10011000. A date represented as DD-MM-YYYY; better, as YYYY-MM-DD conforming to the International Standard for dates, ISO-8601 (also EN 286001 and DIN 5008), would be 1998-10-30, thus 00011001, 10011000, 00010000, 00110000.