Python-Ref > Regular expressions (regexp) > Searching
 
 

<-^^->

Searching

How to find a particular information in a string.
It is not always possible or desirable to describe the structure of a whole string in a regexp, in this case, searching for a particular part of data might be useful.
Python provides three different functions for regexp search - search, findall and finditer. The first one returns a match object for the first match, the second returns a list of matching strings and the last allows iteration over all matches in form of match objects.
Expand/Shrink
Zdroj: (regexp5-1.py)
  1   import re
  2   
  3   text = "Mary and Jane went to the cinema, while we remained at home with John a played Quake 7"
  4   name_regexp = "[A-Z][a-z]+"
  5   
  6   print "--- search ---"
  7   m = re.search( name_regexp, text)
  8   print m
  9   print m.group( 0)
 10   
 11   print "--- findall ---"
 12   for m in re.findall( name_regexp, text):
 13       print m
 14   
 15   print "--- finditer ---"
 16   for m in re.finditer( name_regexp, text):
 17       print m
 18       print m.group(0)
stdout:
--- search ---
<_sre.SRE_Match object at 0x2b2b419df7b0>
Mary
--- findall ---
Mary
Jane
John
Quake
--- finditer ---
<_sre.SRE_Match object at 0x2b2b419df7b0>
Mary
<_sre.SRE_Match object at 0x2b2b419df800>
Jane
<_sre.SRE_Match object at 0x2b2b419df7b0>
John
<_sre.SRE_Match object at 0x2b2b419df800>
Quake
Doba běhu: 22.9 ms
The findall method changes its behaviour slightly when groups are present in the regexp string. In this case the method returns only the content of the groups. In case of one group, string is returned, if more groups are present a tuple of all groups is returned.
Expand/Shrink
Zdroj: (regexp5-2.py)
  1   import re
  2   
  3   text = "We need a piece of string 15 cm long, about 2in of duck tape and 3m of rope."
  4   
  5   total = 0
  6   for m in re.findall( "([0-9]+)\s*([a-z]{1,2})", text):
  7       print m
  8       length, unit = m
  9       if unit == "cm":
 10           total += float( length)
 11       elif unit == 'm':
 12           total += float( length)*100
 13       elif unit == "in":
 14           total += float( length)*2.54
 15       else:
 16           print "unknown unit %s" % unit
 17   
 18   print "We need a total of %.1f cm of stuff." % total
stdout:
('15', 'cm')
('2', 'in')
('3', 'm')
We need a total of 320.1 cm of stuff.
Doba běhu: 22.9 ms
The following code demonstrates the searching capabilities of regexps on a simple URL extraction function. It could be used as a base of a simple and naive web-spider.
Expand/Shrink
Zdroj: (regexp5-3.py)
  1   import re
  2   import urllib2   # library for accessing the web
  3   
  4   def get_text_from_URL( url):
  5       """tries to get the content of a URL"""
  6       try:
  7           f = urllib2.urlopen( url)   # try to open the URL for reading
  8       except urllib2.URLError:        # if an error occurs...
  9           return None                 # ...the page is not available, return None
 10   
 11       # the code below will be executed only if the URL was successfully opened
 12       text = f.read()                # read the content
 13       f.close()                      # close the URL
 14       return text
 15   
 16   
 17   def get_links_from_text( text):
 18       """returns all links from a string, sorts links into 4 categories:
 19       http-links - absolute links, using http protocol,
 20       non-http-links - absolute links, using other that http protocol,
 21       relative-links - links relative to the current document,
 22       local-links - links to other parts of the same document"""
 23       http_urls = []
 24       other_abs_urls = []
 25       relative_urls = []
 26       local_urls = []
 27       for address in re.findall( 'href="(.*?)"', text, re.IGNORECASE):
 28           # findall returns only the groups from regexp if they are present
 29           if re.match( 'http://.*', address):      # http protocol
 30               http_urls.append( address)
 31           elif re.match( '[a-z]+://.*', address):  # non-http protocol
 32               other_abs_urls.append( address)
 33           elif re.match( '#.*', address):          # local link
 34               local_urls.append( address)
 35           else:                                    # must be relative link
 36               relative_urls.append( address)
 37       return http_urls, other_abs_urls, relative_urls, local_urls
 38   
 39   
 40   if __name__ == "__main__":
 41       import sys
 42       if len( sys.argv) <= 1:
 43           url = "http://www.python.org"
 44       else:
 45           url = sys.argv[1]
 46   
 47       t = get_text_from_URL( url)
 48       if t:
 49           print "*** reading %s ***" % url
 50           hl, ol, rl, ll = get_links_from_text( t)
 51           print "--- absolute http links ---"
 52           for a in hl:
 53               print a
 54           print "--- absolute non-http links ---"
 55           for a in ol:
 56               print a
 57           print "--- relative links ---"
 58           for a in rl:
 59               print a
 60           print "--- local links ---"
 61           for a in ll:
 62               print a
 63       else:
 64           print "!! URL could not be opened - %s" % url
stdout:
*** reading http://www.python.org ***
--- absolute http links ---
http://www.python.org/channews.rdf
http://aspn.activestate.com/ASPN/Cookbook/Python/index_rss
http://python-groups.blogspot.com/feeds/posts/default
http://www.showmedo.com/latestVideoFeed/rss2.0?tag=python
http://www.awaretek.com/python/index.xml
http://pyfound.blogspot.com/feeds/posts/default
http://www.python.org/dev/peps/peps.rss
http://docs.python.org/
http://pypi.python.org/pypi
http://wiki.python.org/moin/PythonWebsiteCreatingNewTickets
http://wiki.python.org/moin/WebProgramming
http://wiki.python.org/moin/CgiScripts
http://www.zope.org/
http://www.djangoproject.com/
http://www.turbogears.org/
http://pyxml.sourceforge.net/topics/
http://wiki.python.org/moin/DatabaseProgramming/
http://www.egenix.com/files/python/mxODBC.html
http://sourceforge.net/projects/mysql-python
http://wiki.python.org/moin/GuiProgramming
http://wiki.python.org/moin/WxPython
http://wiki.python.org/moin/TkInter
http://wiki.python.org/moin/PyGtk
http://wiki.python.org/moin/PyQt
http://wiki.python.org/moin/NumericAndScientific
http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
http://www.pentangle.net/python/handbook/
http://www.ibiblio.org/obp/pyBiblio/
http://osl.iu.edu/~lums/swc/
http://www.amk.ca/python/howto/sockets/
http://twistedmatrix.com/trac/
http://buildbot.sf.net
http://www.edgewall.com/trac/
http://roundup.sourceforge.net/
http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
http://www.pygame.org/news.html
http://www.alobbs.com/pykyra
http://www.vrplumber.com/py3d.py
http://pycon.org/
http://www.swa.hpi.uni-potsdam.de/dls/dls08/
http://jython.eventwax.com/jython-sprint
http://wiki.python.org/jython/RoadMap
http://wiki.python.org/moin/PythonBugDay
http://bugs.python.org
http://us.pycon.org/2008/registration/financial-aid/
http://www.swa.hpi.uni-potsdam.de/s3/
http://www.xs4all.com/
http://www.pollenation.net/
--- absolute non-http links ---
--- relative links ---
/styles/screen-switcher-default.css
/styles/netscape4.css
/styles/print.css
/styles/largestyles.css
/styles/defaultfonts.css
/search-pysite.xml
/search-pywiki.xml
/search-pybooks.xml
/search-pydocs.xml
/search-pymodules.xml
/search-pycheese.xml
/search-pythonlist.xml
/
/search
/about/
/news/
/doc/
/download/
/community/
/psf/
/dev/
/links/
/download/releases/2.5.2
/ftp/python/2.5.2/python-2.5.2.msi
/ftp/python/2.5.2/Python-2.5.2.tar.bz2
/community/jobs
/psf/donations/
/about/success/usa
about/success/rackspace
about/success/ilm
about/success/astra
about/success/honeywell
about/success
/about/quotes
/community/sigs/current/edu-sig
/about/apps
/about/apps
/about/apps
/about/apps
/psf/license
/psf
/about
/download
/download/releases/2.5.2/
/download/releases/2.5.2/
/channews.rdf
/about/website
/about/legal
--- local links ---
#left-hand-navigation
#content-body
Doba běhu: 218.0 ms