Master Data Transformation in Pandas with These Three Useful Techniques

A dive into filtering, manipulating, and functioning

Photo by Milad Fakurian on Unsplash

Think back to the last time you worked with a nicely formatted data set. Well-named columns, minimal missing values, and proper organization. It’s a nice feeling — almost freeing — to be blessed with data that you don’t need to clean and transform.

Well, it’s nice until you snap out of your daydream and resume tinkering away at the hopeless shamble of broken rows and nonsensical labels in front of you.

There’s no such thing as clean data (in its original form). If you’re a data scientist, you know this. If you’re just starting out, you should accept this. You will need to transform your data in order to work with it effectively.

Let’s talk about three ways to do so.

Filtering — but Explained Properly

Let’s talk about filtering — but a little more deeply than you may be used to doing. As one of the most common and useful data transformation operations, filtering effectively is a must-have skill for any data scientist. If you know Pandas, it’s likely one of the first operations you learned to do.

Let’s review, using my favorite, oddly versatile example: a DataFrame of student grades, aptly called grades:

Image By Author

We’re going to filter out any scores below 90, because on this day we’ve decided to be poorly trained educators who only cater to the top students (please don’t ever actually do this). The standard line of code for accomplishing this is as follows:

grades[grades['Score'] >= 90]
Image By Author

That leaves us with Jack and Hermione. Cool. But what exactly happened here? Why does the above line of code work? Let’s dive a little deeper by looking at the output of the expression inside of the outer brackets above:

grades['Score'] >= 90
Image By Author

Ah, okay. That makes sense. It appears that this line of code returns a Pandas Series object that holds Boolean ( True / False ) values determined by what >= 90 returned for each individual row. This is the key intermediate step. Afterward, it is this Series of Booleans which gets passed into the outer brackets, and filters all the rows accordingly.

For the sake of completion, I’ll also mention that the same behavior can be achieve using the loc keyword:

grades.loc[grades['Score'] >= 90]
Image By Author

There are a number of reasons we might choose to use loc (one of which being that it actually allows us to filter rows and columns through a single operation), but that opens up a Pandora’s Box of Pandas operations that is best left to another article.

For now, the important learning goal is this: when we filter in Pandas, the confusing syntax isn’t some kind of weird magic. We simply need to break it down into its two component steps: 1) getting a Boolean Series of the rows which satisfy our condition, and 2) using the Series to filter out the entire DataFrame.

Why is this useful, you might ask? Well, generally speaking, it’s likely to lead to confusing bugs if you just use operations without understanding how they actually work. Filtering is a useful and incredibly common operation, and you now know how it works.

Let’s move on.

The Beauty of Lambda Functions

Sometimes, your data requires transformations that simply aren’t built-in to the functionality of Pandas. Try as you might, no amount of scouring Stack Overflow or diligently exploring the Pandas documentation reveals a solution to your problem.

Enter lambda functions — a useful language feature that integrates beautifully with Pandas.

As a quick review, here’s how lambdas work:

>>> add_function = lambda x, y: x + y
>>> add_function(2, 3)
5

Lambda functions are no different than regular functions, excepting the fact that they have a more concise syntax:

  • Function name to the left of the equal sign
  • The lambda keyword to the right of the equal sign (similarly to the def keyword in a traditional Python function definition, this lets Python know we’re defining a function).
  • Parameter(s) after the lambda keyword, to the left of the colon.
  • Return value to the right of the colon.

Now then, let’s apply lambda functions to a realistic situation.

Data sets often have their own formatting quirks, specific to variations in data entry and collection. As a result, the data you’re working with might have oddly specific issues that you need to address. For example, consider the simple data set below, which stores people’s names and their incomes. Let’s call it monies.

Image By Author

Now, as this company’s Master Data Highnesses, we have been given some top-secret information: everyone in this company will be given a 10% raise plus an additional $1000. This is probably too specific of a calculation to find a specific method for, but straightforward enough with a lambda function:

update_income = lambda num: num + (num * .10) + 1000

Then, all we need to do is use this function with the Pandas apply function, which lets us apply a function to every element of the selected Series:

monies['New Income'] = monies['Income'].apply(update_income)
monies
Image By Author

And we’re done! A brilliant new DataFrame consisting of exactly the information we needed, all in two lines of code. To make it even more concise, we could even have defined the lambda function inside of apply directly — a cool tip worth keeping in mind.

I’ll keep the point here simple.

Lambdas are extremely useful, and thus, you should use them. Enjoy!

Series String Manipulation Functions

In the previous section, we talked about the versatility of lambda functions and all the cool things they can help you accomplish with your data. This is excellent, but you should be careful not to get carried away. It’s incredibly common to get so caught up in one familiar way of doing things that you miss out on simpler shortcuts Python has blessed programmers with. This applies to more than just lambdas, of course, but we’ll stick with that for the moment.

For example, let’s say that we have the following DataFrame called names which stores people’s first and last names:

Image By Author

Now, due to space limitations in our database, we decide that instead of storing a person’s entire last name, it’s more efficient to simply store their last initial. Thus, we need to transform the 'Last Name' column accordingly. With lambdas, our attempt at doing so might look something like the following:

names['Last Name'] = names['Last Name'].apply(lambda s: s[:1])
names
Image By Author

This clearly works, but it’s a bit clunky, and therefore not as Pythonic as it could be. Luckily, with the beauty of string manipulation functions in Pandas, there is another, more elegant way (for the purpose of the next line of code, just go ahead and assume we haven’t already altered the 'Last Name' column with the above code):

names['Last Name'] = names['Last Name'].str[:1]
names
Image By Author

Ta-da! The .str property of a Pandas Series lets us splice every string in the series with a specified string operation, just as if we were working with each string individually.

But wait, it gets better. Since .str effectively lets us access the normal functionality of a string through the Series, we can also apply a range of string functions to help process our data quickly! For instance, say we decide to convert both columns into lowercase. The following code does the job:

names['First Name'] = names['First Name'].str.lower()
names['Last Name'] = names['Last Name'].str.lower()
names
Image By Author

Much more straightforward than going through the hassle of defining your own lambda functions and calling the string functions inside of it. Not that I don’t love lambdas — but everything has its place, and simplicity should always take priority in Python.

I’ve only covered a few examples here, but a large collection of string functions is at your disposal [1].

Use them liberally. They’re excellent.

Final Thoughts and Recap

Here’s a little data transformation cheat sheet for you:

  1. Filter like you mean it. Learn what’s really going on so you know what you’re doing.
  2. Love your lambdas. They can help you manipulate data in amazing ways.
  3. Pandas loves strings as much as you do. There’s a lot of built-in functionality — you may as well use it.

Here’s one final piece of advice: there is no “correct” way to filter a data set. It depends on the data at hand as well as the unique problem you are looking to solve. However, while there’s no set method you can follow each time, there are a useful collection of tools worth having at your disposal. In this article, I discussed three of them.

I encourage you to go out and find some more.

References

[ 1] https://www.aboutdatablog.com/post/10-most-useful-string-functions-in-pandas

Master Data Transformation in Pandas with These Three Useful Techniques Republished from Source https://towardsdatascience.com/master-data-transformation-in-pandas-with-these-three-useful-techniques-20699f03e51d?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

Time Stamp:

More from Blockchain Consultants