Python-Ref > Regular expressions (regexp) > Greedy regexp
 
 

<-^^->
Klíčová slova
Moduly
Knihovní funkce

Greedy regexp

How much of a string will a general regexp match and how to influence it.
Regular expressions are by default greedy. It means that a regular expression will match as much of a string as possible. This can lead to some unwanted (and for beginners often unexpected) results.
Expand/Shrink
Zdroj: (regexp4-5.py)
  1   import re
  2   
  3   text = "abcdef"
  4   
  5   print re.match( "[a-z]*", text).group(0)
  6   print re.match( "[a-z]*?", text).group(0)
  7   print re.match( "[a-z]+?", text).group(0)
  8   print "--------------------"
  9   
 10   m = re.match( "([a-z]*)([a-z]*)", text)
 11   print "%s:%s" % (m.group(1), m.group(2))
 12   
 13   m = re.match( "([a-z]*?)([a-z]*)", text)
 14   print "%s:%s" % (m.group(1), m.group(2))
 15   
 16   m = re.match( "([a-z]+)([a-z]+)", text)
 17   print "%s:%s" % (m.group(1), m.group(2))
 18   
 19   m = re.match( "([a-z]+?)([a-z]+)", text)
 20   print "%s:%s" % (m.group(1), m.group(2))
 21   
 22   m = re.match( "([a-z]+?)([a-z]+?)", text)
 23   print "%s:%s" % (m.group(1), m.group(2))
stdout:
abcdef

a
--------------------
abcdef:
:abcdef
abcde:f
a:bcdef
a:b
Doba běhu: 31.8 ms
The interesting input in the following example is the last string. At first sight you might expect it to split the string to ('I have ', '2', ' dogs and 3 cats'). However, because the first ".*" is greedy and the "." matches even a number, it will match as much of the string is possible while still allowing a match for the rest of the regexp.
Expand/Shrink
Zdroj: (regexp4-1.py)
  1   import re
  2   
  3   regexp = re.compile( "(.*)([0-9]+)(.*)")  # finds a number in a string
  4   strings = ["I have 3 dogs.", "I had 2 hot dogs", "2 white snakes", "I have 2 dogs and 3 cats"]
  5   
  6   for string in strings:
  7       m = regexp.match( string)
  8       print m.groups()
stdout:
('I have ', '3', ' dogs.')
('I had ', '2', ' hot dogs')
('', '2', ' white snakes')
('I have 2 dogs and ', '3', ' cats')
Doba běhu: 23.6 ms
The solution to this problem is to use the multiplication character (be it * or any other) in combination with "?" (see Syntax).
Expand/Shrink
Zdroj: (regexp4-2.py)
  1   import re
  2   
  3   regexp = re.compile( "(.*?)([0-9]+)(.*)")  # finds a number in a string, non-greedy
  4   strings = ["I have 3 dogs.", "I had 2 hot dogs", "2 white snakes", "I have 2 dogs and 3 cats"]
  5   
  6   for string in strings:
  7       m = regexp.match( string)
  8       print m.groups()
stdout:
('I have ', '3', ' dogs.')
('I had ', '2', ' hot dogs')
('', '2', ' white snakes')
('I have ', '2', ' dogs and 3 cats')
Doba běhu: 23.8 ms
The problem of greediness is even more pronounced when one tries to process HTML code.
Expand/Shrink
Zdroj: (regexp4-3.py)
  1   import re
  2   
  3   regexp = re.compile( "(<.*>)(.*)(</.*>)") # remove markup from a string
  4   strings = ["<h1>Title</h1>", "<p>I have <b>12</b> dogs.</p>"]
  5   
  6   for string in strings:
  7       m = regexp.match( string)
  8       print m.groups()
stdout:
('<h1>', 'Title', '</h1>')
('<p>I have <b>12</b>', ' dogs.', '</p>')
Doba běhu: 22.8 ms
Expand/Shrink
Zdroj: (regexp4-4.py)
  1   import re
  2   
  3   regexp = re.compile( "(<.*?>)(.*)(</.*>)") # remove markup from a string
  4   strings = ["<h1>Title</h1>", "<p>I have <b>12</b> dogs.</p>"]
  5   
  6   for string in strings:
  7       m = regexp.match( string)
  8       print m.groups()
stdout:
('<h1>', 'Title', '</h1>')
('<p>', 'I have <b>12</b> dogs.', '</p>')
Doba běhu: 22.7 ms

Cvičení

  1. What happens when we make both the first and the second "*" in the last example non-greedy?