Wed 02 January 2019

Using a Custom JSONEncoder for Pandas and Numpy

Recently, I had a friend ask me to glance at some data science work he was doing. He was puzzled why his output, upon attempting to send it to a remote server for processing, was crashing the entire thing. The project was using a pretty standard toolset - Pandas, Numpy, and so on. After looking at it for a minute, I realized he was running into a JSON encoding issue regarding certain data types in Pandas and Numpy.

The fix is relatively straightforward, if you know what you're looking for. I didn't see too much concrete info floating around after a cursory search, so I figured I'd throw it here in case some other wayward traveler needs it.

Creating and Using a Custom JSONEncoder

It all comes down to instructing your json.dumps() call to use a custom encoder. If you're familiar with the Django world, you've probably run into this with the DjangoJSONEncoder serializer. We essentially want to coerce Pandas and Numpy-specific types to core Python types, and then JSON generation more or less works. Here's an example of how to do so, with comments to explain what's going on.

import numpy
from json import JSONEncoder

class CustomJSONEncoder(JSONEncoder):
    def default(self, obj_to_encode):
        """Pandas and Numpy have some specific types that we want to ensure
        are coerced to Python types, for JSON generation purposes. This attempts
        to do so where applicable.
        """
        # Pandas dataframes have a to_json() method, so we'll check for that and
        # return it if so.
        if hasattr(obj_to_encode, 'to_json'):
            return obj_to_encode.to_json()

        # Numpy objects report themselves oddly in error logs, but this generic
        # type mostly captures what we're after.
        if isinstance(obj_to_encode, numpy.generic):
            return numpy.asscalar(obj_to_encode)
        
        # ndarray -> list, pretty straightforward.
        if isinstance(obj_to_encode, numpy.ndarray):
            return obj_to_encode.to_list()

        # If none of the above apply, we'll default back to the standard JSON encoding
        # routines and let it work normally.
        return super().default(obj_to_encode)

With that, it's a one-line change to use it as our JSON encoder of choice:

json.dumps({
    'my_pandas_type': pandas_value,
    'my_numpy_type': numpy_value
}, cls=CustomJSONEncoder)

Wrapping Up

Now, returning and serializing Pandas and Numpy-specific data types should "just work". If you're the Django type, you could optionally subclass DjangoJSONEncoder and apply the same approach with easy serialization of your model instances.

Ryan around the Web