Klíčová slova
Moduly
Knihovní funkce

## Greedy regexp

How much of a string will a general regexp match and how to influence it.
Regular expressions are by default greedy. It means that a regular expression will match as much of a string as possible. This can lead to some unwanted (and for beginners often unexpected) results.
Zdroj: (regexp4-5.py)
```  1   import re
2
3   text = "abcdef"
4
5   print re.match( "[a-z]*", text).group(0)
6   print re.match( "[a-z]*?", text).group(0)
7   print re.match( "[a-z]+?", text).group(0)
8   print "--------------------"
9
10   m = re.match( "([a-z]*)([a-z]*)", text)
11   print "%s:%s" % (m.group(1), m.group(2))
12
13   m = re.match( "([a-z]*?)([a-z]*)", text)
14   print "%s:%s" % (m.group(1), m.group(2))
15
16   m = re.match( "([a-z]+)([a-z]+)", text)
17   print "%s:%s" % (m.group(1), m.group(2))
18
19   m = re.match( "([a-z]+?)([a-z]+)", text)
20   print "%s:%s" % (m.group(1), m.group(2))
21
22   m = re.match( "([a-z]+?)([a-z]+?)", text)
23   print "%s:%s" % (m.group(1), m.group(2))```
stdout:
```abcdef

a
--------------------
abcdef:
:abcdef
abcde:f
a:bcdef
a:b
```
Doba běhu: 31.8 ms
The interesting input in the following example is the last string. At first sight you might expect it to split the string to ('I have ', '2', ' dogs and 3 cats'). However, because the first ".*" is greedy and the "." matches even a number, it will match as much of the string is possible while still allowing a match for the rest of the regexp.
Zdroj: (regexp4-1.py)
```  1   import re
2
3   regexp = re.compile( "(.*)([0-9]+)(.*)")  # finds a number in a string
4   strings = ["I have 3 dogs.", "I had 2 hot dogs", "2 white snakes", "I have 2 dogs and 3 cats"]
5
6   for string in strings:
7       m = regexp.match( string)
8       print m.groups()```
stdout:
```('I have ', '3', ' dogs.')
('I had ', '2', ' hot dogs')
('', '2', ' white snakes')
('I have 2 dogs and ', '3', ' cats')
```
Doba běhu: 23.6 ms
The solution to this problem is to use the multiplication character (be it * or any other) in combination with "?" (see Syntax).
Zdroj: (regexp4-2.py)
```  1   import re
2
3   regexp = re.compile( "(.*?)([0-9]+)(.*)")  # finds a number in a string, non-greedy
4   strings = ["I have 3 dogs.", "I had 2 hot dogs", "2 white snakes", "I have 2 dogs and 3 cats"]
5
6   for string in strings:
7       m = regexp.match( string)
8       print m.groups()```
stdout:
```('I have ', '3', ' dogs.')
('I had ', '2', ' hot dogs')
('', '2', ' white snakes')
('I have ', '2', ' dogs and 3 cats')
```
Doba běhu: 23.8 ms
The problem of greediness is even more pronounced when one tries to process HTML code.
Zdroj: (regexp4-3.py)
```  1   import re
2
3   regexp = re.compile( "(<.*>)(.*)(</.*>)") # remove markup from a string
4   strings = ["<h1>Title</h1>", "<p>I have <b>12</b> dogs.</p>"]
5
6   for string in strings:
7       m = regexp.match( string)
8       print m.groups()```
stdout:
```('<h1>', 'Title', '</h1>')
('<p>I have <b>12</b>', ' dogs.', '</p>')
```
Doba běhu: 22.8 ms
Zdroj: (regexp4-4.py)
```  1   import re
2
3   regexp = re.compile( "(<.*?>)(.*)(</.*>)") # remove markup from a string
4   strings = ["<h1>Title</h1>", "<p>I have <b>12</b> dogs.</p>"]
5
6   for string in strings:
7       m = regexp.match( string)
8       print m.groups()```
stdout:
```('<h1>', 'Title', '</h1>')
('<p>', 'I have <b>12</b> dogs.', '</p>')
```
Doba běhu: 22.7 ms

#### Cvičení

1. What happens when we make both the first and the second "*" in the last example non-greedy?