Use Pandas DataFrames in scikit-learn FeatureUnion and Pipelines

27 April 2018 marrrcin python , scikit-learn , machine-learning , pandas

Pandas is a popular data wrangling library among data engineers and data scientists who use Python. When it comes to machine learning with Python, scikit-learn is the top pick for writing not only Jupyter-based experiments but also for full machine learning pipelines. Unfortunately, Pandas and scikit-learn does not play well together when you try to use scikit-learn's Pipeline and FeatureUnion abstractions. This post is here to help you deal with those problems using just a few lines of code.

TL;DR

How to use pandas DataFrame with scikit-learn FeatureUnion and Pipelines.

Environment

  • Python 3.6.4 (should work on 2.7.* too!)
  • scikit-learn 0.19.1
  • pandas 0.22.0

The problem

Problem with using DataFrames with scikit-learn starts to emerge when you want to preserve abilities that pandas provide i.e column names, ease of indexing, mapping and filtering. By default, scikti-learn does suport using DataFrames, however it strips them down to plain numpy arrays, which lack of programmers favourite DataFrame features.

Consider the following code:

raw_data = load_iris()
data = pd.DataFrame(raw_data["data"], columns=raw_data["feature_names"])

pipeline = FeatureUnion([
    ("1", make_pipeline(
        FunctionTransformer(lambda X: X.loc[:, ["sepal length (cm)"]]),
        # other transformations
    )),
    ("2", make_pipeline(
        FunctionTransformer(lambda X: X.loc[:, ["sepal width (cm)"]]),
        # other transformations
    ))
])

X = pipeline.fit_transform(data)
print(X["sepal length (cm)"].mean())
print(X["sepal width (cm)"].mean())

The idea is to apply different transformations to different columns and then merge them together for further processing / machine learning process. Looks readable and seems like it should work but if you try to run it you will get the following error:

FunctionTransformer(...),
AttributeError: 'numpy.ndarray' object has no attribute 'loc'

The reason for that is FunctionTransformer strips any non-numpy array objects to numpy arrays and we lose DataFrame objects there.

Solution part 1

Solution for transforming columns is simple, we need to have custom Transformer class which preserves object types:

class PandasTransform(TransformerMixin, BaseEstimator):
    def __init__(self, fn):
        self.fn = fn

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None, copy=None):
        return self.fn(X)

You need to change previous FunctionTransformer to PandasTransformer:

pipeline = FeatureUnion([
    ("1", make_pipeline(
        PandasTransform(lambda X: X.loc[:, ["sepal length (cm)"]]),
        # other transformations
    )),
    ("2", make_pipeline(
        PandasTransform(lambda X: X.loc[:, ["sepal width (cm)"]]),
        # other transformations
    ))
])

X = pipeline.fit_transform(data)
print(X["sepal length (cm)"].mean())
print(X["sepal width (cm)"].mean())

It's better this time, because the code executes the transform and variable X contains transformed data. The problem is that different exception occured:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Why? Because scikit-learn FeatureUnion did the same stripping to DataFrames like the FunctionTransform did previously.

Solution part 2: Pandas DataFrame Featue Union

Scikit-learn team is aware of this missing feature, however GitHub issue is still unresolved. Solution for FeatureUnion problem is just to add the support of Pandas DataFrames to it.

class PandasFeatureUnion(FeatureUnion):
    def fit_transform(self, X, y=None, **fit_params):
        self._validate_transformers()
        result = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_transform_one)(trans, weight, X, y,
                                        **fit_params)
            for name, trans, weight in self._iter())

        if not result:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        Xs, transformers = zip(*result)
        self._update_transformer_list(transformers)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

    def merge_dataframes_by_column(self, Xs):
        return pd.concat(Xs, axis="columns", copy=False)

    def transform(self, X):
        Xs = Parallel(n_jobs=self.n_jobs)(
            delayed(_transform_one)(trans, weight, X)
            for name, trans, weight in self._iter())
        if not Xs:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

The above code seems to be long, but there are only a few things goin on there:

  1. PandasFeatureUnion class extends scikit-learn built-in FeatureUnion

  2. It overrides transform and fit_transform functions

  3. Both transform and fit_transform functions were copied from scikit-learn source and the only changes were

    from:

    if any(sparse.issparse(f) for f in Xs):
        Xs = sparse.hstack(Xs).tocsr()
    else:
        Xs = np.hstack(Xs)
    

    to:

    if any(sparse.issparse(f) for f in Xs):
        Xs = sparse.hstack(Xs).tocsr()
    else:
        Xs = self.merge_dataframes_by_column(Xs)
    

Results

After combining PandasTransform and PandasFeatureUnion, pipeline should work like charm and output data will be still in DataFrame format.

pipeline = PandasFeatureUnion([
    ("1", make_pipeline(
        PandasTransform(lambda X: X.loc[:, ["sepal length (cm)"]]),
        # other transformations
    )),
    ("2", make_pipeline(
        PandasTransform(lambda X: X.loc[:, ["sepal width (cm)"]]),
        # other transformations
    ))
])

X = pipeline.fit_transform(data)
print(X["sepal length (cm)"].mean())
print(X["sepal width (cm)"].mean())

Summary

With just a few tricks and hacks you can use DataFrames alongside scikit-learn Pipelines to build your machine learning pipelines. I hope this post will help you!

Additional links & resources

Comments