Python-Ref > Regular expressions (regexp) > Matching and data extraction
 
 

<-^^->
Moduly
Knihovní funkce

Matching and data extraction

How to check if a string conforms to a specified prescription and optionaly extract some data from it.
One of the basic uses of regexps is to check input data for a specific feature. In this case you can for instance read a file line by line and process only those lines that conform to a specific regexp or process differently formatted lines in different parts of your program.
For this purpose we use the match method of a corresponding compiled regular expression. Match tries to match the regexp onto the string from the beginning. This is different behaviour than in case of Searching.
The following code demonstrates a short program that checks data in a file line-by-line.
Expand/Shrink
ACD1356 56.9
AMK951 606.0
BJK008 599
aBC121 899.9
CAD35989645 89
CD468 75.3
ALLL14 79.5
ARR3687 X.3
ADD9863 36.
Zdroj: (regexp3-1.py)
  1   """this short program will check if all data in a file conform to a simple
  2   standard. On each line, there is an ID (lets say of a product) in a specific form
  3   of 3 uppercase letters and 3-6 numbers and a price"""
  4   
  5   import re
  6   
  7   f = file( "regexp3-1.txt", "r")
  8   regexp = "^[A-Z]{3}[0-9]{3,6} [0-9]+(\.[0-9]+)?$"
  9   # alternatives - "[A-Z]{3}[0-9]{3,6} [0-9]+(\.[0-9]+)?" - no start and end
 10   #                (will allow the last line in the file)
 11   # alternatives - "^[A-Z]{3}\d{3,6} \d+(\.\d+)?$" - uses \d
 12   #                (shorter, but may be more difficult to read)
 13   
 14   for line in f:
 15       if not re.match( regexp, line):
 16           print "!!", line.strip()
 17       else:
 18           print "ok", line.strip()
 19   f.close()
stdout:
ok ACD1356 56.9
ok AMK951 606.0
ok BJK008 599
!! aBC121 899.9
!! CAD35989645 89
!! CD468 75.3
!! ALLL14 79.5
!! ARR3687 X.3
!! ADD9863 36.
Doba běhu: 23.1 ms
The following modification of the previous program introduces brackets as the basic means of grouping and data-extraction.
Expand/Shrink
ACD1356 56.9
AMK951 606.0
BJK008 599
aBC121 899.9
CAD35989645 89
CD468 75.3
ALLL14 79.5
ARR3687 X.3
ADD9863 36.
Zdroj: (regexp3-2.py)
  1   """this short program will check if all data in a file conform to a simple
  2   standard. On each line, there is an ID (lets say of a product) in a specific form
  3   of 3 uppercase letters and 3-6 numbers and a price.
  4   In case of matching item, it will parse the data and print a sum of prices of
  5   all such items"""
  6   
  7   import re
  8   
  9   f = file( "regexp3-1.txt", "r")
 10   price_sum = 0
 11   # we use groups to be able to extract the data
 12   regexp = "^([A-Z]{3}[0-9]{3,6}) ([0-9]+(\.[0-9]+)?)$"
 13   
 14   for line in f:
 15       m = re.match( regexp, line)  # we get a 'match' object
 16       if not m:
 17           print "!!", line.strip()
 18       else:
 19           print "ok", line.strip()
 20           price = float( m.group( 2))
 21           price_sum += price
 22   f.close()
 23   
 24   print "Total price of valid items is", price_sum
stdout:
ok ACD1356 56.9
ok AMK951 606.0
ok BJK008 599
!! aBC121 899.9
!! CAD35989645 89
!! CD468 75.3
!! ALLL14 79.5
!! ARR3687 X.3
!! ADD9863 36.
Total price of valid items is 1261.9
Doba běhu: 24.1 ms
The last modification of this program shows usage of precompiled regexp objects. It can save the computer some time and the programmer some effort when many regexps are used in different parts of a program.
Expand/Shrink
ACD1356 56.9
AMK951 606.0
BJK008 599
aBC121 899.9
CAD35989645 89
CD468 75.3
ALLL14 79.5
ARR3687 X.3
ADD9863 36.
Zdroj: (regexp3-3.py)
  1   """this short program will check if all data in a file conform to a simple
  2   standard. On each line, there is an ID (lets say of a product) in a specific form
  3   of 3 uppercase letters and 3-6 numbers and a price.
  4   In case of matching item, it will parse the data and print a sum of prices of
  5   all such items.
  6   This version uses a precompiled regexp object.
  7   """
  8   
  9   import re
 10   
 11   f = file( "regexp3-1.txt", "r")
 12   price_sum = 0
 13   # we use groups to be able to extract the data
 14   regexp = re.compile( "^([A-Z]{3}[0-9]{3,6}) ([0-9]+(\.[0-9]+)?)$")
 15   
 16   for line in f:
 17       m = regexp.match( line)  # we get a 'match' object
 18       if not m:
 19           print "!!", line.strip()
 20       else:
 21           print "ok", line.strip()
 22           price = float( m.group( 2))
 23           price_sum += price
 24   f.close()
 25   
 26   print "Total price of valid items is", price_sum
stdout:
ok ACD1356 56.9
ok AMK951 606.0
ok BJK008 599
!! aBC121 899.9
!! CAD35989645 89
!! CD468 75.3
!! ALLL14 79.5
!! ARR3687 X.3
!! ADD9863 36.
Total price of valid items is 1261.9
Doba běhu: 23.5 ms
The next program extracts references from an HTML file in a relatively strict format.
Expand/Shrink
<html>
  <head>
    <title>References</title>
  </head>
  <body>
    <ul>
      <li>Preston P. N., Tennant G.: <i>Chem. Rev.</i> <b>1972</b>, <i>72</i>, 627.</li>
      <li>Smith D. M. in: <i>Chemistry of Heterocyclic Compounds</i> (P. N. Preston, Ed.), Vol. 40, pp. 287–329. John Wiley &amp; Sons, New York 1981.</li>
      <li>Kirby A. J.: <i>Adv. Phys. Org. Chem.</i> <b>1980</b>, <i>17</i>, 183.</li>
      <li>Mandolini L. J.: <i>Adv. Phys. Org. Chem.</i> <b>1986</b>, <i>22</i>, 1.</li>
      <li>Page M. I., Jencks W. P.: <i>Gazz. Chim. Ital.</i> <b>1987</b>, <i>117</i>, 455.</li>
    </ul>
  </body>
</html>
Zdroj: (regexp3-4.py)
  1   """this short program will parse references in HTML."""
  2   
  3   import re
  4   
  5   f = file( "references.html", "r")
  6   for line in f:
  7       if line.strip().startswith("<li>"):
  8           m = re.match( "^<li>([^:]+):\s+<i>(.*)</i>\s+<b>(\d+)</b>, <i>(\d+[A-z]?)</i>, (\d+)\.?</li>$", line.strip())
  9           if not m:
 10               print "!!", line.strip()
 11           else:
 12               print m.groups()
 13   f.close()
stdout:
('Preston P. N., Tennant G.', 'Chem. Rev.', '1972', '72', '627')
!! <li>Smith D. M. in: <i>Chemistry of Heterocyclic Compounds</i> (P. N. Preston, Ed.), Vol. 40, pp. 287–329. John Wiley &amp; Sons, New York 1981.</li>
('Kirby A. J.', 'Adv. Phys. Org. Chem.', '1980', '17', '183')
('Mandolini L. J.', 'Adv. Phys. Org. Chem.', '1986', '22', '1')
('Page M. I., Jencks W. P.', 'Gazz. Chim. Ital.', '1987', '117', '455')
Doba běhu: 22.6 ms