Python-Ref > Regular expressions (regexp) > Splitting strings
 
 

<-^^->
Moduly
Knihovní funkce

Splitting strings

How to split strings using regular expressions.
Expand/Shrink
Zdroj: (regexp9-3.py)
  1   import re
  2   
  3   text = '''The occupations of A, B, and C are many and varied. In
  4   the older arithmetics they contented themselves with
  5   doing "a certain piece of work." This statement of the
  6   case however, was found too sly and mysterious, or possibly
  7   lacking in romantic charm. It became the fashion to define
  8   the job more clearly and to set them at walking matches,
  9   ditch-digging, regattas, and piling cord wood. At times,
 10   they became commercial and entered into partnership,
 11   having with their old mystery a "certain" capital. Above
 12   all they revel in motion. When they tire of
 13   walking-matches--A rides on horseback, or borrows a
 14   bicycle and competes with his weaker-minded associates
 15   on foot. Now they race on locomotives; now they row; or
 16   again they become historical and engage stage-coaches;
 17   or at times they are aquatic and swim. If their occupation
 18   is actual work they prefer to pump water into cisterns,
 19   two of which leak through holes in the bottom and one of
 20   which is water-tight. A, of course, has the good one; he
 21   also takes the bicycle, and the best locomotive, and the
 22   right of swimming with the current. Whatever they do they
 23   put money on it, being all three sports. A always wins.
 24   '''
 25   
 26   # this leaves many strange words - some with punctuation
 27   # some with hyphens, etc;
 28   words = text.split()
 29   print "Normal:", [word for word in words if not re.match("^[A-z]+$", word)]
 30   print
 31   
 32   words2 = re.split("[- \n.,;]", text)
 33   print "Regexp:", [word for word in words2 if word and not re.match("^[A-z]+$", word)]
 34   print
 35   print words2
stdout:
Normal: ['A,', 'B,', 'varied.', '"a', 'work."', 'however,', 'mysterious,', 'charm.', 'matches,', 'ditch-digging,', 'regattas,', 'wood.', 'times,', 'partnership,', '"certain"', 'capital.', 'motion.', 'walking-matches--A', 'horseback,', 'weaker-minded', 'foot.', 'locomotives;', 'row;', 'stage-coaches;', 'swim.', 'cisterns,', 'water-tight.', 'A,', 'course,', 'one;', 'bicycle,', 'locomotive,', 'current.', 'it,', 'sports.', 'wins.']

Regexp: ['"a', '"', '"certain"']

['The', 'occupations', 'of', 'A', '', 'B', '', 'and', 'C', 'are', 'many', 'and', 'varied', '', 'In', 'the', 'older', 'arithmetics', 'they', 'contented', 'themselves', 'with', 'doing', '"a', 'certain', 'piece', 'of', 'work', '"', 'This', 'statement', 'of', 'the', 'case', 'however', '', 'was', 'found', 'too', 'sly', 'and', 'mysterious', '', 'or', 'possibly', 'lacking', 'in', 'romantic', 'charm', '', 'It', 'became', 'the', 'fashion', 'to', 'define', 'the', 'job', 'more', 'clearly', 'and', 'to', 'set', 'them', 'at', 'walking', 'matches', '', 'ditch', 'digging', '', 'regattas', '', 'and', 'piling', 'cord', 'wood', '', 'At', 'times', '', 'they', 'became', 'commercial', 'and', 'entered', 'into', 'partnership', '', 'having', 'with', 'their', 'old', 'mystery', 'a', '"certain"', 'capital', '', 'Above', 'all', 'they', 'revel', 'in', 'motion', '', 'When', 'they', 'tire', 'of', 'walking', 'matches', '', 'A', 'rides', 'on', 'horseback', '', 'or', 'borrows', 'a', 'bicycle', 'and', 'competes', 'with', 'his', 'weaker', 'minded', 'associates', 'on', 'foot', '', 'Now', 'they', 'race', 'on', 'locomotives', '', 'now', 'they', 'row', '', 'or', 'again', 'they', 'become', 'historical', 'and', 'engage', 'stage', 'coaches', '', 'or', 'at', 'times', 'they', 'are', 'aquatic', 'and', 'swim', '', 'If', 'their', 'occupation', 'is', 'actual', 'work', 'they', 'prefer', 'to', 'pump', 'water', 'into', 'cisterns', '', 'two', 'of', 'which', 'leak', 'through', 'holes', 'in', 'the', 'bottom', 'and', 'one', 'of', 'which', 'is', 'water', 'tight', '', 'A', '', 'of', 'course', '', 'has', 'the', 'good', 'one', '', 'he', 'also', 'takes', 'the', 'bicycle', '', 'and', 'the', 'best', 'locomotive', '', 'and', 'the', 'right', 'of', 'swimming', 'with', 'the', 'current', '', 'Whatever', 'they', 'do', 'they', 'put', 'money', 'on', 'it', '', 'being', 'all', 'three', 'sports', '', 'A', 'always', 'wins', '', '']
Doba běhu: 17.0 ms
If the basic String splitting is not powerful enough for the task at hand, one may use the regexp split.
Expand/Shrink
Zdroj: (regexp9-1.py)
  1   """Split sentences"""
  2   
  3   import re
  4   
  5   text = """I have 2.5 kg of meat. I am going to cook lunch for 8 people.
  6   They will come at 12:30. I need to hurry."""
  7   
  8   # normal string split
  9   print text.split( ".")
 10   
 11   # slightly better version
 12   print text.split( ". ")
 13   
 14   # regexp version
 15   print re.split( "\.\s", text)
 16   
 17   # better regexp version - preserves the dots (.)
 18   print re.split( "(?<=\.)\s", text)
stdout:
['I have 2', '5 kg of meat', ' I am going to cook lunch for 8 people', '\nThey will come at 12:30', ' I need to hurry', '']
['I have 2.5 kg of meat', 'I am going to cook lunch for 8 people.\nThey will come at 12:30', 'I need to hurry.']
['I have 2.5 kg of meat', 'I am going to cook lunch for 8 people', 'They will come at 12:30', 'I need to hurry.']
['I have 2.5 kg of meat.', 'I am going to cook lunch for 8 people.', 'They will come at 12:30.', 'I need to hurry.']
Doba běhu: 22.2 ms
The following code shows an even subtle problem. It is again fixable by using the Looking around.
Expand/Shrink
Zdroj: (regexp9-2.py)
  1   """Split sentences, version 2. Approx. complicates the problem"""
  2   
  3   import re
  4   
  5   text = """I have approx. 2.5 kg of meat. I am going to cook lunch for 8 people.
  6   They will come at 12:30. I need to hurry."""
  7   
  8   # dot and whitespace
  9   print re.split( "\.\s", text)
 10   
 11   # only dot and whitespace that is before a uppecase letter
 12   # but it cuts of the uppecase letter
 13   print re.split( "\.\s[A-Z]", text)
 14   
 15   # lookahead assertion fixes the problem
 16   print re.split( "(?<=\.)\s(?=[A-Z])", text)
stdout:
['I have approx', '2.5 kg of meat', 'I am going to cook lunch for 8 people', 'They will come at 12:30', 'I need to hurry.']
['I have approx. 2.5 kg of meat', ' am going to cook lunch for 8 people', 'hey will come at 12:30', ' need to hurry.']
['I have approx. 2.5 kg of meat.', 'I am going to cook lunch for 8 people.', 'They will come at 12:30.', 'I need to hurry.']
Doba běhu: 23.2 ms