Python-Ref > String manipulation > Unicode > Unicode introduction
 
 

^^->
Klíčová slova
Moduly
Knihovní funkce

Unicode introduction

The difference between normal and unicode strings.
Normal strings in Python are encoded in a 8-bit encoding. That means that one byte corresponds to one character. The relationship between a character and its numerical value is described by an "encoding".
The 8-bit nature of normal strings means that the number of different characters that might be expressed in one encoding is limited to 256. This is of course problem for texts that have to deal with mixed character sets (latin with greek and mathematical symbols just to name one typical example) or character sets by nature larger then 256 characters.
To resolve this problem of 8-bit encodings, unicode[1] was conceived which maps all possible characters from different alphabets into one system and gives each of the characters a unique number. Of course unicode characters are can no longer be 8-bit and different methods (encodings) to deal with this fact have been devised [2], [3], [4], [5].
By default Python will assume that strings are given in ASCII and uses this encoding to represent normal strings. When a different encoding is used, one has to decode the string to convert it into the proper internal unicode representation. This will result in a unicode string instead of a normal string.
If you don't try to do anything with the text, there will be no problem with the input and output.
Expand/Shrink
iso8859-2.1.txt [iso8859-2]
ěščřřžžýýááíů

Text v kódování iso-8859-2.
Zdroj: (unicode1-1.py)
  1   f = file( "iso8859-2.1.txt", "r")
  2   print f.read()
  3   f.close()
stdout: [iso8859-2]
ěščřřžžýýááíů

Text v kódování iso-8859-2.
Doba běhu: 19.0 ms
Expand/Shrink
utf-8.1.txt [utf-8]
Czech (česky)		Dobrý den
Danish (Dansk)		Hej, Goddag
English			Hello
Esperanto		Saluton (Eĥoŝanĝo ĉiuĵaŭde)
Estonian		Tere, Tervist
FORTRAN			PROGRAM
Finnish (Suomi)		Hei
French (Français)	Bonjour, Salut
German (Deutsch Nord)	Guten Tag
German (Deutsch Süd)	Grüß Gott
Greek (Ελληνικά)	Γειά σας
Hebrew			שלום
Zdroj: (unicode1-2.py)
  1   f = file( "utf-8.1.txt", "r")
  2   print f.read()
  3   f.close()
stdout: [utf-8]
Czech (česky)		Dobrý den
Danish (Dansk)		Hej, Goddag
English			Hello
Esperanto		Saluton (Eĥoŝanĝo ĉiuĵaŭde)
Estonian		Tere, Tervist
FORTRAN			PROGRAM
Finnish (Suomi)		Hei
French (Français)	Bonjour, Salut
German (Deutsch Nord)	Guten Tag
German (Deutsch Süd)	Grüß Gott
Greek (Ελληνικά)	Γειά σας
Hebrew			שלום

Doba běhu: 18.9 ms
Once you try to manipulate the string, you will find that, by default, Python only knows what to do with ASCII characters. For the other it cannot know what they are because he does not know the encoding. Once you convert the strings to unicode strings using the decode method with the right encoding, it will work without a problem.
Expand/Shrink
iso8859-2.1.txt [iso8859-2]
ěščřřžžýýááíů

Text v kódování iso-8859-2.
Zdroj: (unicode1-3.py)
  1   f = file( "iso8859-2.1.txt", "r")
  2   text = f.read()
  3   f.close()
  4   
  5   # we simply use the text
  6   print text.upper()
  7   
  8   print "------------------------------"
  9   
 10   # we decode it, do the transformation and encode it back
 11   text2 = text.decode( 'iso8859-2')
 12   print text2.upper().encode('iso8859-2')
 13   
 14   print "------------------------------"
 15   print "type( text) =", type( text)
 16   print "type( text2)=", type( text2)
stdout: [iso8859-2]
ěščřřžžýýááíů

TEXT V KóDOVáNí ISO-8859-2.
------------------------------
ĚŠČŘŘŽŽÝÝÁÁÍŮ

TEXT V KÓDOVÁNÍ ISO-8859-2.
------------------------------
type( text) = <type 'str'>
type( text2)= <type 'unicode'>
Doba běhu: 18.7 ms
When working with unicode strings, we may use the same ord function to determine its number. To convert a number to a character, we use the function unichr instead of chr.
Expand/Shrink
iso8859-2.1.txt [iso8859-2]
ěščřřžžýýááíů

Text v kódování iso-8859-2.
Zdroj: (unicode1-4.py)
  1   f = file( "iso8859-2.1.txt", "r")
  2   text = f.read()
  3   f.close()
  4   
  5   # we decode it, do the transformation and encode it back
  6   text2 = text.decode( 'iso8859-2')
  7   for char in text2:
  8     to_print = "%s %4d  %s" % (char, ord( char), unichr( ord( char)))
  9     print to_print.encode('utf-8')
stdout: [utf-8]
ě  283  ě
š  353  š
č  269  č
ř  345  ř
ř  345  ř
ž  382  ž
ž  382  ž
ý  253  ý
ý  253  ý
á  225  á
á  225  á
í  237  í
ů  367  ů

   10  


   10  

T   84  T
e  101  e
x  120  x
t  116  t
    32   
v  118  v
    32   
k  107  k
ó  243  ó
d  100  d
o  111  o
v  118  v
á  225  á
n  110  n
í  237  í
    32   
i  105  i
s  115  s
o  111  o
-   45  -
8   56  8
8   56  8
5   53  5
9   57  9
-   45  -
2   50  2
.   46  .
Doba běhu: 21.8 ms