Data Visualization With Pandas

Újra kiadta Platón

Követő: 0

Pandas has been aiding us so far in the phase of Data Preprocessing. Though, in one instance, while creating Histograms, we’ve also utilized another module from Pandas – plotting.

We’ve purposefully avoided is so far, because introducing it earlier would raise more questions than it answered. Namely, Pandas és a Matplotlib were such a common an ubiquitous duo, that Pandas has started integrating Matplotlib’s functionality. It súlyosan relies on Matplotlib to do any actual plotting, and you’ll find many Matplotlib functions wrapped in the source code. Alternatively, you can use other backends for plotting, such as Teljességgel és a Bokeh.

However, Pandas also introduces us to a couple of plots that nem a part of Matplotlib’s standard plot types, such as KDEs, Andrews Curves, Bootstrap Plots és a Scatter Matrices.

A plot() function of a Pandas DataFrame uses the backend specified by plotting.backend, és attól függően, hogy kind argument – generates a plot using the given library. Since a lot of these overlap – there’s no point in covering plot types such as line, bar, hist és a scatter. They’ll produce much the same plots with the same code as we’ve been doing so far with Matplotlib.

We’ll only briefly take a look at the plot() function since the underlying mechanism has been explored so far. Instead, let’s focus on some of the plots that we nem tud already readily do with Matplotlib.

This will be a very short chapter, as Pandas’ plotting and visualization capabilities pale in comparison to Matplotlib – but it’s still useful to know of some of these, as well as be aware of the ability to plot from DataFrames közvetlenül.

A DataFrame.plot() Funkció

A plot() függvény elfogadja x és a y features, and a kind argument. Alternatively, to avoid the kind argument, you can also call DataFrame.plot.kind() instead, and pass in the x és a y jellemzők.

The accepted values for the kind argument are: line, bar, barh (horizontal bar), hist, box, kde, density (synonym for kde), area, pie, scatter és a hexbin (similar to a Heatmap).

import pandas as pd
from matplotlib import pyplot as plt df = pd.read_csv('gdp_csv.csv')
df_eu = df.loc[df['Country Name'] == 'European Union'] df_eu.plot.bar(x = 'Year', y = 'Value')
df_eu.plot.line(x = 'Year', y = 'Value')
df_eu.plot.box(x='Value') plt.show()

Instead of providing the Series instances like we do for Matplotlib – it’s enough to provide the oszlopnevek, and since you’re calling the plot() funkció a DataFrame you’re visualizing, it’s easy for Pandas to map these string values to the appropriate column names.

Each of these calls makes a new Figure instance and plots the appropriate values on it:

Adatvizualizáció a Pandas PlatoBlockchain adatintelligenciával. Függőleges keresés. Ai.

To plot multiple axes on the same Figure, készíthetsz a Figure és egy vagy több Axes keresztül plt.subplots() and assign the appropriate Axes hoz ax argument of each plot.plot_type() hívás:

import pandas as pd
from matplotlib import pyplot as plt df = pd.read_csv('gdp_csv.csv')
df_eu = df.loc[df['Country Name'] == 'European Union'] fig, ax = plt.subplots(3, 1) df_eu.plot.box(x='Value', ax = ax[0])
df_eu.plot.line(x = 'Year', y = 'Value', ax = ax[1])
df_eu.plot.bar(x = 'Year', y = 'Value', rot = 45, ax = ax[2]) plt.show()

Some of the standard arguments, such as the rotation argument are actually different in the plot() calls. For example, rotation is shortened to rot. This makes it a bit tricky to just switch between Matplotlib and Pandas, as you’ll most likely end up in the documentation pages, just checking which arguments can be applied and which can’t.

Now, instead of creating a new Figure or Axes instance, each of these plots will be stationed in the appropriate Axes instances we’ve supplied them with:

Adatvizualizáció a Pandas PlatoBlockchain adatintelligenciával. Függőleges keresés. Ai.

In general, plotting with Pandas is convenient and quick – but even so, for plotting Bar Plots, Line Plots and Box Plots, you’ll probably want to go with Matplotlib. It’s both the underlying engine Pandas inevitably uses, and also has more customization options, and you won’t have to remember a new set of arguments that you can use with Pandas plots.

That being said, for some plots, you might want to prefer pandák, since they’d have to be manually made in Matplotlib, and some of them are such a hassle to make that it’s not worth the effort, such as KDE lines.

Pandas’ ábrázolás Modulok

Mit DataFrames have to offer in terms of visualization isn’t too new to us. However, the underlying module they call, pandas.plotting csinál. Az plotting module has several functions that we can use, such as autocorrelation_plot(), bootstrap_plot()és scatter_matrix().

Each of these accept either a Series vagy DataFrame, depending on the type of visualization they’re producing, as well as certain parameters for plotting specification and styling purposes.

Bootstrap Plot

bootstrapping is the process of randomly sampling (with replacement) a dataset, and calculating measures of accuracy such as előítélet, variancia és a megbízhatósági intervallumok for the random samples. “With replacement”, in practical terms, means that each randomly selected element can be selected again. Without replacement means that after each randomly selected element, it’s removed from the pool for the next sample.

A Bootstrap Plot, created by Pandas bootstraps the mean, median and mid-range statistics of a dataset, based on the sample size, after which the plot is subsequently shown via plt.show(). The default arguments for size és a samples faliórái 50 és a 500 illetőleg.

This means that for a feature, we sample 50 values. Then, a 50-element subsample is generated (synthetic data) for those 50 values and a summary statistic (mean/median/mid-range) is calculated for them. This process is repeated 500 times, so in the end, we’ve got 500 summary statistics:

import pandas as pd
from matplotlib import pyplot as plt df = pd.read_csv('./datasets/gdp_csv.csv')
df_eu = df.loc[df['Country Name'] == 'European Union'] pd.plotting.bootstrap_plot(df_eu['Value']) plt.show()

Adatvizualizáció a Pandas PlatoBlockchain adatintelligenciával. Függőleges keresés. Ai.

Autocorrelation Plot

Autocorrelation Plots are used to check for data randomness, for time-series data. Multiple autocorrelations are calculated for differing timestamps, and if the data is truly random – the correlation will be near zero. If not – the correlation will be larger than zero.

Let’s plot two Autocorrelation Plots – one with our Érték feature, and one with a Series filled with random values:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt df = pd.read_csv('gdp_csv.csv')
# Filter DataFrame for the EU
df_eu = df.loc[df['Country Name'] == 'European Union']
# Generate 50 random integers between 0 and 100, and turn into a Series
random_series = pd.Series(np.random.randint(0, 100, size=50)) # Plot Autocorrelation Plot for the *Value* feature
pd.plotting.autocorrelation_plot(df_eu['Value'])
# Plot Autocorrelation Plot for the *random_series*
pd.plotting.autocorrelation_plot(random_series) plt.show()

The Autocorrelation Plot for the random_series should revolve around 0, since it’s random data, while the plot for the Value feature won’t:

Adatvizualizáció a Pandas PlatoBlockchain adatintelligenciával. Függőleges keresés. Ai.

It’s worth noting that Autocorrelation measures one forma of randomness, as uncorrelated, but non-random data does exist. If it’s non-random, but doesn’t have any significant correlations – the Autocorrelation Plot would indicate that the data is random.

Scatter Matrices

Scatter Matrices plot a rács of Scatter Plots for all features against all features. Since this inevitably compares each feature with itself, as well – the átlós where this happens is typically replaced with a hisztogram of that feature, rather than a Scatter Plot. Scatter Matrices más néven Pair Plots, and Seaborn offers a pairplot() function just for this.

A scatter_matrix() függvény elfogadja a DataFrame and produces a Scatter Matrix for all of its numerical features, and returns a 2D array of Axes instances that comprise the Scatter Matrix. To tweak anything about them, you’ll want to iterate through them:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt df = pd.read_csv('worldHappiness2019.csv') axes = pd.plotting.scatter_matrix(df, diagonal='hist') for ax in axes.flatten(): # Rotate back to 0 degrees since they're automatically rotated by 90 ax.yaxis.label.set_rotation(0) # As to not overlap with the Axes instances, set the ticklabel # alignment to 'right' ax.yaxis.label.set_ha('right') plt.show()

This results in a rather large Scatter Matrix of all the features against all other features:

Adatvizualizáció a Pandas PlatoBlockchain adatintelligenciával. Függőleges keresés. Ai.

You can also pass in the diagonal argument, which accepts 'hist' or 'kde' to specify what type of distribution plot you’d like to plot on the diagonal, as well as alpha, specifying the translucency of the markers in the Scatter Plots.

Időbélyeg: Február 14, 2022