Python-Ref > Regular expressions (regexp) > Multiline matches and searches
 
 

<-^^->
Klíčová slova
Moduly
Knihovní funkce

Multiline matches and searches

How to search for text that spans multiple lines.
By default the "." character in Regexps matches everything with the exception of newline character "\n". This means that searching for parts of text that span multiple lines is not possible by simply using the ".".
Expand/Shrink
<html>
<head>
  <title>Example HTML</title>
</head>
<body>
  <p>Single line paragraph.</p>

  <p>Paragraph that
  was split to several
  lines.</p>
  
  <p>Another paragrapth split to more lines, this time it is not so obvious.
</p>
</body>
</html>
Zdroj: (regexp10-1.py)
  1   import re
  2   
  3   f = file( "re-example.html", "r")
  4   text = f.read()
  5   f.close()
  6   
  7   for para in re.finditer( "<p>.*</p>", text):
  8     print "paragraph:", para.group( 0)
  9     print 
stdout:
paragraph: <p>Single line paragraph.</p>

Doba běhu: 22.6 ms
To be able to find all paragraphs in the previous example, we must include the newline into our pattern. This may be done either by implicitly including it somehow..
Expand/Shrink
<html>
<head>
  <title>Example HTML</title>
</head>
<body>
  <p>Single line paragraph.</p>

  <p>Paragraph that
  was split to several
  lines.</p>
  
  <p>Another paragrapth split to more lines, this time it is not so obvious.
</p>
</body>
</html>
Zdroj: (regexp10-2.py)
  1   import re
  2   
  3   f = file( "re-example.html", "r")
  4   text = f.read()
  5   f.close()
  6   
  7   for para in re.finditer( "<p>(\n|.)*?</p>", text):
  8     # we must use non-greedy version, otherwise the whole text will be
  9     # eaten in one go
 10     print "paragraph:", para.group( 0)
 11     print 
stdout:
paragraph: <p>Single line paragraph.</p>

paragraph: <p>Paragraph that
  was split to several
  lines.</p>

paragraph: <p>Another paragrapth split to more lines, this time it is not so obvious.
</p>

Doba běhu: 22.7 ms
..or altering the default behaviour of the "." using the option re.S (re.DOTALL). This makes the "." match even the newline character. This second approach is usually preferred, because it makes the regexp easier, especially for more complicated patterns.
Expand/Shrink
<html>
<head>
  <title>Example HTML</title>
</head>
<body>
  <p>Single line paragraph.</p>

  <p>Paragraph that
  was split to several
  lines.</p>
  
  <p>Another paragrapth split to more lines, this time it is not so obvious.
</p>
</body>
</html>
Zdroj: (regexp10-3.py)
  1   import re
  2   
  3   f = file( "re-example.html", "r")
  4   text = f.read()
  5   f.close()
  6   
  7   for para in re.finditer( "<p>.*?</p>", text, re.S):
  8     # we must use non-greedy version, otherwise the whole text will be
  9     # eaten in one go
 10     print "paragraph:", para.group( 0)
 11     print 
stdout:
paragraph: <p>Single line paragraph.</p>

paragraph: <p>Paragraph that
  was split to several
  lines.</p>

paragraph: <p>Another paragrapth split to more lines, this time it is not so obvious.
</p>

Doba běhu: 23.1 ms