Representation of Characters in Computers

by Herbert J. Bernstein

© Copyright 2000 Herbert J. Bernstein

Computers are designed to work internally with numbers. In order to handle characters, we need to choose a number for each character. There are many ways to do this. For example, we might choose the approach followed by Control Data Corporation (CDC) in designing "Display Code" in the early 1960's (see, for example, the "Scope 3.4 Reference Manual" Control Data Corporation, Minneapolis, 1976, publication 60307200.):

CDC Display Code (ASCII variant)
SymbolOctalDecimalHex SymbolOctalDecimalHex
:00005403220
A01116413321
B02227423422
C03338433523
D04449443624
E0555+453725
F0666-463826
G0777*473927
H1088/504028
I1199(514129
J1210A)52422A
K1311B$53432B
L1412C=54442C
M1513D  (SP)55452D
N1614E,56462E
O1715F.57472F
P201610#604830
Q211711[614931
R221812]625032
S231913%635133
T242014"645234
U252115_655335
V262216!665436
W272317&675537
X302418?705638
Y3125196715739
Z32261A>72583A
033271B<73593B
134281C@74603C
235291D\75613D
336301E^76623E
437311F;77633F

This is not a currently used character set, but helps us to understand some of the issues in the representation of characters in computers.

Packing characters into words

Most choices of character sets have many fewer symbols than the range of numbers that could fit into a computer word. For example, there are only 64 symbols in the Display Code table, above, but the machines on which it was used has a byte size of 12 bits, which could hold 4096 distinct numbers, and a full word of 60 bits. Notice that the 64 characters can be represented by all the two-digit octal numbers, 00 through 77, using just 6 bits. This made it natural to break up each 12-bit byte into two 6-bit fields and to pack 10 characters into each 60-bit word. For machines with different numbers of characters and different numbers of bits per work, as we shall see, different packings of characters into words are done, but the common practice is to allocate some number or bits per character in some order.

Character ordering

When we write English, we write the characters from left to right. In some other languages characters may be written from right to left. We face similar decisions in placing groups of characters into words in a computer, and in writing numbers into the bytes and words of a computer. If we place a number into a sequence of bytes forming a word with the most significant binary digits in the first bytes in sequence, we call that big-endian. If we place a number onto a sequence of bytes forming a word with the least significant digits in the first bytes of the sequence, we call it little-endian. Similarly, we could pack characters into words that hold numbers with the first character in a string aligned to the most signficant digits or to the least significant digits or in some other order. Sometime to deal with characters from one machine on another, we have to swap every pair of characters, or reorganize them in groups of four, etc.

Thus the string "HERBERT " placed in a 60-bit word could be in any of the following orders of characters and octal representation depending on the choices made in the hardware and software design of the machine:

HERBERT   
10052202052224555555
   TREBREH
55555524220502220510
EHBRRE T  
05100222220555245555
  T ERRBHE
55552455052222021005

Collating sequence

In sorting character strings, the numeric values assigned to characters can be helpful in sorting the strings into lexicographic (dictionary) order, provided we assign numeric values to letters of the alphabet in dictionary order without gaps. Many earlier character sets did not have such a design. CDC display code did. However, the introduction of lower case letters then complicated the issue, since we then face the decision of whether to place upper and lower case letters together or in separate runs.

The most common current decision is to choose a code in which the letters of the alphabet are assigned numbers in alphabetic order for each case and font. This then facilitates such operations as lexicographic sorting. We will see this in the ISO character set, the dominant character set in use today.

The ISO character set

The most commonly used character setis called the ISO character set. The U.S. national variant is called ASCII (American Standard Code for Information Interchange). An excellent reference, with bibliography, can be found in "A view of the history of the ISO character code," by R.W. Bemer in the (now defunct) Honeywell Computer Journal, vol 6, #4, 1972, pp 274-286. This character set has been greatly extended and augmented in recent years, and is the basis of a rich variety of character sets, including some which use 16-bit characters or multiple character sequences to represent characters in languages which require thousands of symbols.

It is very important to have a clear understanding of the original ISO character set, since it is taken for granted in most computing and communications systems, and in the declining number of systems that use other sets, e.g. some EBCDIC or BCD based systems, functional equivalents to the ISO set can be found.

The original ISO set contains 128 characters, grouped into 8 columns of 16 characters each, all numbered from 0. Columns 0 and 1 are used for various communications control characters and data delimiters, such as carriage return (CR) and line feed (LF). Columns 3 through 7 are used, with one exception (DEL), for printable characters. The digits are in column 3, the upper case alphabet in columns 4 and 5, and the lower case alphabet in columns 6 and 7. Some characters are allowed to vary from country to country to allow for special national symbols. For example, column 2, row 3, is # in the United States, but the symbol for the pound in sterling areas.

 Column
 01234567
Row 
0NULDLESP0@P`p
1SOHDC1!1AQaq
2STXDC2"2BRbr
3ETXDC3#3CScs
4EOTDC4$4DTdt
5ENQNAK%5EUeu
6ACKSYN&6FVfv
7BELETB'7GWgw
8BSCAN(8HXhx
9HTEM)9IYiy
10LFSUB*:JZjz
11VTESC+;K[k{
12FFFS,<L\l|
13CRGS-=M]m}
14SORS.>N^n~
15SIUS/?O_oDEL

The binary code for any entry can be found by composing the bits of the column number with the bits of the row number. For example, CR is in column 0, row 13, and thus has the binary code 0001101 = 13 decimal = 15 octal = 0D hexadecimal. We will refer to the entries by column/row. When writing character the representation of character strings, hexadecimal is often used.

The first two columns of control characters provide us with a practical example of encoding data for efficiency. At the time this set was formed, these were the most common communications functions, so each was assigned a single character instead of some string of characters. The final control character, DEL, at the end of the set had to be left there to conform to the practicalities of handling paper tape.

NUL (0/0) stands for null and is usually used as a time fill character with no other effect.

SOH (0/1) stands for start of header, and is usually used to mark the beginning of addressing and control information in a message.

STX (0/2) stands for start of text, and is intended to flag the actual start of end-user information in a message.

ETX (0/3) stands for end of text to signal the end of a message. This is a bit like "over" in radio communications.

EOT (0/4) stands for end of transmission, and is intended to signal a final disconnect (like "over and out").

ENQ (0/5) stands for enquire, and is intended to be used to ask for some sort of status or identifying message in return.

ACK (0/6) stands for acknowledgement, to signal proper reception of a message

BEL (0/7) stands for bell, and is intended to generate some sort of attention signal, usually for an operator.

BS (0/8) stands for backspace, as on a typewriter.

HT (0/9) stands for horizontal tab, as on a typwriter. In the absence of a clear definition of the tab stops HT is a useless code. A Digital Equipment Corporation convention is one tab stop every 8 columns.

LF (0/10) stands for line feed, i.e. advance one line on a terminal. In some systems this implies a return to the left margin. In other systems there is no such implication.

VT (0/11) stands for vertical tab, similar to horizontal tab, except in a downward direction. On a few systems it has been used to space upwards instead.

FF (0/12) stands for form feed, i.e. advance to the top of the next page.

CR (0/13) stands for carriage return, i.e. return to the first column on a terminal, usually without a line advance to permit overprinting. SO (0/14) and SI (0/15) stand for shift out and shift in, for an escape to another printing character set (SO) and return to the standard set (SI). When more than one alternate character set is required, an escape sequence with ESC should be used.

DLE (1/0) stands for data link escape. It is intended as an escape character to provide additional communications control characters, e.g. additional ACKs.

DC1 (1/1), DC2 (1/2), DC3 (1/3) and DC4 (1/4) are device controls 1 - 4. They are intended for such functions as starting and stopping auxilliary equipment like paper tape. It is rather common practice to use DC3 to stop transmission and DC1 to restart it. They are control-S and control-Q, respectively, on most keyboards.

NAK (1/5) stands for negative acknowledge, usually to signal a garbled message.

SYN (1/6) stands for synchronous idle. It is used in systems that transmit continuous streams of bits without character delimiters, both to start character framing synchronization and to fill time without stopping transmission by sending a character which is supposed to be ignored.

ETB (1/7) stands for end of transmission block. It is similar to ETX, but marks a block termination in the middle of a message. The message would continue with more blocks.

CAN (1/8) stands for cancel, i.e. cancel the current message, but is sometimes used in place of NAK.

EM (1/9) stands for end of medium, e.g. the end of a roll of paper tape or of a reel of magnetic tape.

SUB (1/10) stands for substitute character. It was intended to hold the place of a character garbled in transmission, but it is usually used simply as a flag for some special sort of message, e.g. logical end of file on some Digital Equipment Corporation systems.

ESC (1/11) stands for escape and is usually used as the first character of a group of characters to perform some expansion of the data delimiter control character set, e.g. ESC @ for character insert, ESC A for cursor up, ESC B for cursor down, ESC C for cursor right, and ESC D for cursor left.

FS (1/12), GS (1/13), RS (1/14) and US (1/15), also known as IS4, IS3, IS2 and IS1, are intended to be information separators, marking file, group, record and unit boundaries respectively.

DEL (7/15) stands for delete. Historically, it was intended to work for paper tape as an erasure -- punch out all the holes over a bad character -- but in practice it is often used to mean delete the previous character.

It would be an interesting exercise to examine current communications traffic to see how many of these characters still deserve their place in the character set.

This leaves 95 printable characters. For most purposes, remembering the following landmarks is sufficient. The first printable character is SP, for space, in column 2, row 0, 40 octal, 20 hexadecimal, 32 decimal. The digits 0 through 9 are in column 3, starting in row 0, with the digit 0 having code 60 octal, 30 hexadecimal, 48 decimal. The upper case alphabet starts in column 4, row 1, with A having code 101 octal, 41 hexadecimal, 65 decimal. The lower case alphabet starts in column 6, row 1, with a having code 141 octal, 61 hexadecimal, 97 decimal. The characters #, $, @, [, @\, ], ^, `, {, |, } and ~ are allowed to have national variations. The set given is for the United States. This set is not organized for minimal entropy, but for clarity, a decision which makes the ISO set less than optimal for secure systems.

As an example of using the ISO set, consider the sentence, "I love you." Let us use the sequence CR-LF as a line terminator. Then the sequence of characters to be sent (not including the quotation marks) is:

4/9 2/0 6/12 6/15 7/6 6/5 2/0 7/9 6/15 7/5 2/14 0/13 0/10

As a hexadecimal sequence, this would be

49206C6F766520796F752E0D0A

As we stated earlier, in the ISO set the upper case alphabet and the lower case alphabet are each given in unbroken sequences in order. This was not always the case. The early Baudot code (actually the Murray code) used the ordering:

BLANK T CR O SPACE H N M LF L R G I P C V E Z D B S Y F X A W J FS U Q K LS

where FS means figure shift, like SO, and LS means letter shift, like SI. One figure shift set was

" 5 CR 9 SPACE # , . LF ) 4 & 8 0 : ; 3 " $ ? BEL 6 ! / - 2 FS 7 1 ( LS

EBCDIC, an IBM code, while not as bad, also has breaks in the alphabet, and has the lower case letters before the upper case letters, while the ISO set has the upper before the lower. There were also such codes as BCD, FIELDATA, EXCESS-THREE, CDC DISPLAY, etc. It is fortunate that we are leaving this tower of Babel behind. (This may be wishful thinking, but it is the right attitude none the less).



Portions of this web page were derived from
H. J. Bernstein and M. G. Goldstein, "G22.2262 Data Communications Lecture Notes, Fall 1983", New York University, Computer Science Department, Courant Institute of Mathematical Sciences, NY, 1984
Last Updated on 15 February 2000
By Herbert J. Bernstein
Email:
yaya@bernstein-plus-sons.com