Data Visualization With Pandas

प्लेटो द्वारा पुनर्प्रकाशित

अनुयायियों: 0

Pandas has been aiding us so far in the phase of Data Preprocessing. Though, in one instance, while creating Histograms, we’ve also utilized another module from Pandas – plotting.

We’ve purposefully avoided is so far, because introducing it earlier would raise more questions than it answered. Namely, Pandas और Matplotlib were such a common an ubiquitous duo, that Pandas has started integrating Matplotlib’s functionality. It भारी relies on Matplotlib to do any actual plotting, and you’ll find many Matplotlib functions wrapped in the source code. Alternatively, you can use other backends for plotting, such as Plotly और bokeh.

However, Pandas also introduces us to a couple of plots that नहीं रहे a part of Matplotlib’s standard plot types, such as KDEs, Andrews Curves, Bootstrap Plots और Scatter Matrices.

RSI plot() function of a Pandas DataFrame uses the backend specified by plotting.backend, and depending on the kind argument – generates a plot using the given library. Since a lot of these overlap – there’s no point in covering plot types such as line, bar, hist और scatter. They’ll produce much the same plots with the same code as we’ve been doing so far with Matplotlib.

We’ll only briefly take a look at the plot() function since the underlying mechanism has been explored so far. Instead, let’s focus on some of the plots that we नहीं कर सकता already readily do with Matplotlib.

This will be a very short chapter, as Pandas’ plotting and visualization capabilities pale in comparison to Matplotlib – but it’s still useful to know of some of these, as well as be aware of the ability to plot from DataFrameसीधे है।

RSI DataFrame.plot() समारोह

RSI plot() फ़ंक्शन स्वीकार करता है x और y features, and a kind argument. Alternatively, to avoid the kind argument, you can also call DataFrame.plot.kind() instead, and pass in the x और y विशेषताएं।

The accepted values for the kind तर्क हैं: line, bar, barh (horizontal bar), hist, box, kde, density (synonym for kde), area, pie, scatter और hexbin (similar to a Heatmap).

import pandas as pd
from matplotlib import pyplot as plt df = pd.read_csv('gdp_csv.csv')
df_eu = df.loc[df['Country Name'] == 'European Union'] df_eu.plot.bar(x = 'Year', y = 'Value')
df_eu.plot.line(x = 'Year', y = 'Value')
df_eu.plot.box(x='Value') plt.show()

Instead of providing the Series instances like we do for Matplotlib – it’s enough to provide the स्तंभ नाम, and since you’re calling the plot() पर समारोह DataFrame you’re visualizing, it’s easy for Pandas to map these string values to the appropriate column names.

Each of these calls makes a new Figure instance and plots the appropriate values on it:

पांडास प्लेटोब्लॉकचेन डेटा इंटेलिजेंस के साथ डेटा विज़ुअलाइज़ेशन। लंबवत खोज. ऐ.

To plot multiple axes on the same Figure, आप एक बना सकते हैं Figure और एक या अधिक Axes के माध्यम से plt.subplots() and assign the appropriate Axes को ax argument of each plot.plot_type() फोन:

import pandas as pd
from matplotlib import pyplot as plt df = pd.read_csv('gdp_csv.csv')
df_eu = df.loc[df['Country Name'] == 'European Union'] fig, ax = plt.subplots(3, 1) df_eu.plot.box(x='Value', ax = ax[0])
df_eu.plot.line(x = 'Year', y = 'Value', ax = ax[1])
df_eu.plot.bar(x = 'Year', y = 'Value', rot = 45, ax = ax[2]) plt.show()

Some of the standard arguments, such as the rotation argument are actually different in the plot() calls. For example, rotation को छोटा किया जाता है rot. This makes it a bit tricky to just switch between Matplotlib and Pandas, as you’ll most likely end up in the documentation pages, just checking which arguments can be applied and which can’t.

Now, instead of creating a new Figure or Axes instance, each of these plots will be stationed in the appropriate Axes instances we’ve supplied them with:

In general, plotting with Pandas is convenient and quick – but even so, for plotting Bar Plots, Line Plots and Box Plots, you’ll probably want to go with Matplotlib. It’s both the underlying engine Pandas inevitably uses, and also has more customization options, and you won’t have to remember a new set of arguments that you can use with Pandas plots.

That being said, for some plots, you might want to prefer पांडा, since they’d have to be manually made in Matplotlib, and some of them are such a hassle to make that it’s not worth the effort, such as KDE lines.

Pandas’ अंकन मॉड्यूल

क्या DataFrames have to offer in terms of visualization isn’t too new to us. However, the underlying module they call, pandas.plotting करता है। plotting module has several functions that we can use, such as autocorrelation_plot(), bootstrap_plot(), तथा scatter_matrix().

Each of these accept either a Series या एक DataFrame, depending on the type of visualization they’re producing, as well as certain parameters for plotting specification and styling purposes.

बूटस्ट्रैप प्लॉट

bootstrapping is the process of randomly sampling (with replacement) a dataset, and calculating measures of accuracy such as पूर्वाग्रह, झगड़ा और विश्वास अंतराल for the random samples. “With replacement”, in practical terms, means that each randomly selected element can be selected again. Without replacement means that after each randomly selected element, it’s removed from the pool for the next sample.

A बूटस्ट्रैप प्लॉट, created by Pandas bootstraps the mean, median and mid-range statistics of a dataset, based on the sample size, after which the plot is subsequently shown via plt.show(). The default arguments for size और samples रहे 50 और 500 क्रमशः.

This means that for a feature, we sample 50 values. Then, a 50-element subsample is generated (synthetic data) for those 50 values and a summary statistic (mean/median/mid-range) is calculated for them. This process is repeated 500 times, so in the end, we’ve got 500 summary statistics:

import pandas as pd
from matplotlib import pyplot as plt df = pd.read_csv('./datasets/gdp_csv.csv')
df_eu = df.loc[df['Country Name'] == 'European Union'] pd.plotting.bootstrap_plot(df_eu['Value']) plt.show()

Autocorrelation Plot

Autocorrelation Plots are used to check for data randomness, for time-series data. Multiple autocorrelations are calculated for differing timestamps, and if the data is truly random – the correlation will be near zero. If not – the correlation will be larger than zero.

Let’s plot two Autocorrelation Plots – one with our वैल्यू feature, and one with a Series filled with random values:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt df = pd.read_csv('gdp_csv.csv')
# Filter DataFrame for the EU
df_eu = df.loc[df['Country Name'] == 'European Union']
# Generate 50 random integers between 0 and 100, and turn into a Series
random_series = pd.Series(np.random.randint(0, 100, size=50)) # Plot Autocorrelation Plot for the *Value* feature
pd.plotting.autocorrelation_plot(df_eu['Value'])
# Plot Autocorrelation Plot for the *random_series*
pd.plotting.autocorrelation_plot(random_series) plt.show()

The Autocorrelation Plot for the random_series should revolve around 0, since it’s random data, while the plot for the Value feature won’t:

It’s worth noting that Autocorrelation measures one प्रपत्र of randomness, as uncorrelated, but non-random data does exist. If it’s non-random, but doesn’t have any significant correlations – the Autocorrelation Plot would indicate that the data is random.

Scatter Matrices

Scatter Matrices plot a ग्रिड of Scatter Plots for all features against all features. Since this inevitably compares each feature with itself, as well – the विकर्ण where this happens is typically replaced with a हिस्टोग्राम of that feature, rather than a Scatter Plot. Scatter Matrices के रूप में भी जाना जाता है Pair Plots, and Seaborn offers a pairplot() function just for this.

RSI scatter_matrix() फ़ंक्शन स्वीकार करता है a DataFrame and produces a Scatter Matrix for all of its numerical features, and returns a 2D array of Axes instances that comprise the Scatter Matrix. To tweak anything about them, you’ll want to iterate through them:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt df = pd.read_csv('worldHappiness2019.csv') axes = pd.plotting.scatter_matrix(df, diagonal='hist') for ax in axes.flatten(): # Rotate back to 0 degrees since they're automatically rotated by 90 ax.yaxis.label.set_rotation(0) # As to not overlap with the Axes instances, set the ticklabel # alignment to 'right' ax.yaxis.label.set_ha('right') plt.show()

This results in a rather large Scatter Matrix of all the features against all other features:

You can also pass in the diagonal argument, which accepts 'hist' or 'kde' to specify what type of distribution plot you’d like to plot on the diagonal, as well as alpha, specifying the translucency of the markers in the Scatter Plots.

समय टिकट: फ़रवरी 14, 2022