Error when using Seaborn in jupyter notebook(pyspark)

pyspark jupyter seaborn

301 观看

1回复

1 作者的声誉

I am trying to visualize data using Seaborn. I have created a dataframe using SQLContext in pyspark. However, when I call lmplot it results in an error. I am not sure what I am missing. Given below is my code(I am using jupyter notebook):

import pandas as pd

from matplotlib import pyplot as plt

import seaborn as sns

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

df = sqlContext.read.load('file:///home/cloudera/Downloads/WA_Sales_Products_2012-14.csv', 
                      format='com.databricks.spark.csv', 
                      header='true',inferSchema='true')

sns.lmplot(x='Quantity', y='Year', data=df)

Error trace:
---------------------------------------------------------------------------
 TypeError                                 Traceback (most recent call last)
<ipython-input-86-2a2b43993475> in <module>()
----> 2 sns.lmplot(x='Quantity', y='Year', data=df)

/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, size, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws)
    557                        hue_order=hue_order, size=size, aspect=aspect,
    558                        col_wrap=col_wrap, sharex=sharex, sharey=sharey,
--> 559                        legend_out=legend_out)
    560 
    561     # Add the markers here as FacetGrid has figured out how many levels of the

/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, size, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws)
    255         # Make a boolean mask that is True anywhere there is an NA
    256         # value in one of the faceting variables, but only if dropna is True
--> 257         none_na = np.zeros(len(data), np.bool)
    258         if dropna:
    259             row_na = none_na if row is None else data[row].isnull()

TypeError: object of type 'DataFrame' has no len()

Any help or pointer is appreciated. Thank you in advance:-)

作者: Neo 的来源 发布者: 2017 年 12 月 27 日

回应 (1)


0

490 作者的声誉

sqlContext.read.load(...) returns a Spark-DataFrame. I am not sure, whether seaborn can automatically cast a Spark-DataFrame into a Pandas-Dataframe.

Try:

sns.lmplot(x='Quantity', y='Year', data=df.toPandas())

df.toPandas() returns the the pandas-DF from the Spark-DF.

作者: DrEigelb 发布者: 27.12.2017 06:22
32x32