String management is one of Python’s most important features. String manipulation in most languages can be a complex and tedious process. But with Python we can easily perform quite complex functions on strings with only a small amount of code. For example, consider a situation where we want to remove punctuation character from a string. We can remove punctuation Python program style in a vast number of ways. But let’s begin with a fairly straightforward regular expression example.
punctuatedString = "This, string, just has-way too many punctuation marks in it."
punctuationMarks = """!()-[]{};:'"\,./?@#$%^&*_~"""
punctRemoved = ""
for char in punctuatedString:
if char not in punctuationMarks:
punctRemoved = punctRemoved + char
print(punctRemoved)
We begin this example with a string named punctuatedString that has far too many punctuation character marks within it. Next, we create a given string called punctuationMarks filled with various punctuation word marks. We’ll also create an empty Python string to use as a receptacle for words as we strip punctuation filter from the sentence. We’ll then use a loop to remove the special character in punctuatedString as it appears- whether that be an apostrophe, exclamation mark, parenthesis, square bracket, or some other type of punctuation mark.
Of course, we might want to remove other punctuation string marks, and multiple characters at a time from the original string module with a punctuation filter. In that case we’d simply need to add additional characters to the punctuation string punctuationMarks. At the end of the loop we could also reassign the contents of punctRemoved back to the originating point of punctuatedString. But in any case, once we print punctRemoved to screen we can see how well this has worked.
Of course, Python being Python there’s a lot of other ways to go about this type of string manipulation. One of the more flexible options involves using a function for regex or regular expressions. Regex essentially works through pattern matching. We can instantly see how that type of Python code could be used for text cleaning. Consider how we use regular expressions in the following example.
import re
punctuatedString = "This, string, just has-way too many punctuation marks in it."
newString = re.sub(r'[^\w\s]','',punctuatedString)
print(newString)
We begin by importing re to give us access to regular expressions. Next, we return to the same punctuatedString used within our first example. We create a new string, newString, and assign the result of a regex to it. The sub function substitutes text data when matched. We simply pass in our punctuatedString as a parameter and it returns the newly stripped string. Finally we print out the results. This method is also useful for working with stop words in libraries like nltk that involve strict text rules.
The standard Python library also possesses a wealth of string related functions in its string library. We can see how powerful this functionality is when looking at the following example.
import string
punctuatedString = "This, string, just has-way too many punctuation marks in it."
print(punctuatedString.translate(str.maketrans('', '', string.punctuation)) )
Note that if we exclude the import and string declaration that we’ve essentially performed everything in a single line of code. In this instance, a lot of the power comes from the string translate function in the string module. This function can replace string elements based on provided parameters. In this instance, we can use another string function to provide a translation table. This additional functionality is provided by a special string function called punctuation.
The punctuation function simply gives us a series of string characters that are used as punctuation. This frees us from the necessity of thinking up a list of punctuation marks on our own. It all comes pre-defined through string.punctuation. We can pair this with list comprehension to easily mix and match punction. The maketrans function simply converts the punctuation data provided by string.punctuation into a map usable with the translate function. Anything in string.punctuation is mapped to None. And we use a bit of a trick with the first two parameters. The first string provided is a replacement for the second. But since we use the same data there’s no actual impact on the final translate function. Instead, we just have a string with all of the punctuation removed.