Python-Ref > Regular expressions (regexp) > Looking around
 
 

<-^^->
Klíčová slova
Moduly
Knihovní funkce

Looking around

How to write a regexp that allows us to look around for specific conditions but does not consume any of the string.
It is possible to use parts of regular expressions that do not consume any of the processed string and are used only to specify the context of another regular expression.
These are used for look-ahead and look-behind assertions.
The program below shows usage of a positive lookbehind assertion.
Expand/Shrink
Zdroj: (regexp8-1.py)
  1   import re
  2   
  3   text = '''Follow <a href="address">this link</a> to the HTML tutorial.
  4   It will teach you how to create HTML links using the tag "a"
  5   with the attribute href="something".'''
  6   
  7   # we would like to find all links in a text, but ignore false alarms,
  8   # such as the one in the previous example
  9   
 10   # this naive way does not work here
 11   print re.findall( 'href=".*?"', text)
 12   
 13   # use lookbehind assertion (?<=XXX)
 14   print re.findall( '(?<=<a\s)href=".*?"', text)
stdout:
['href="address"', 'href="something"']
['href="address"']
Doba běhu: 23.8 ms
The modified version below shows a negative lookbehind assertion for the opposite result.
Expand/Shrink
Zdroj: (regexp8-3.py)
  1   import re
  2   
  3   text = '''Follow <a href="address">this link</a> to the HTML tutorial.
  4   It will teach you how to create HTML links using the tag "a"
  5   with the attribute href="something".'''
  6   
  7   # positive lookbehind assertion (?<=XXX)
  8   print re.findall( '(?<=<a\s)href=".*?"', text)
  9   
 10   
 11   # negative lookbehind assertion (?<!XXX)
 12   print re.findall( '(?<!<a\s)href=".*?"', text)
stdout:
['href="address"']
['href="something"']
Doba běhu: 24.7 ms
The following code demonstrates another kind of problem that a lookahead assertion can solve for us. It is taken from the slide Splitting strings.
Expand/Shrink
Zdroj: (regexp9-2.py)
  1   """Split sentences, version 2. Approx. complicates the problem"""
  2   
  3   import re
  4   
  5   text = """I have approx. 2.5 kg of meat. I am going to cook lunch for 8 people.
  6   They will come at 12:30. I need to hurry."""
  7   
  8   # dot and whitespace
  9   print re.split( "\.\s", text)
 10   
 11   # only dot and whitespace that is before a uppecase letter
 12   # but it cuts of the uppecase letter
 13   print re.split( "\.\s[A-Z]", text)
 14   
 15   # lookahead assertion fixes the problem
 16   print re.split( "(?<=\.)\s(?=[A-Z])", text)
stdout:
['I have approx', '2.5 kg of meat', 'I am going to cook lunch for 8 people', 'They will come at 12:30', 'I need to hurry.']
['I have approx. 2.5 kg of meat', ' am going to cook lunch for 8 people', 'hey will come at 12:30', ' need to hurry.']
['I have approx. 2.5 kg of meat.', 'I am going to cook lunch for 8 people.', 'They will come at 12:30.', 'I need to hurry.']
Doba běhu: 23.2 ms