Wednesday, June 12, 2024

Worst Pandas Bugs and Misfeatures

An informal collection of some of the worst experiences I've had working with pandas.


Using Timedelta for months or days

>>> datetime.datetime(201810100+ pd.to_timedelta(range(012), unit='M')
/Users/kwilliams/git/dispatcher/rush-springs-simulations/venv/lib/python3.7/site-packages/pandas/util/_decorators.py:208: FutureWarning: M and Y units are deprecated and will be removed in a future version.
  return func(*args, **kwargs)
DatetimeIndex(['2018-10-01 00:00:00''2018-10-31 10:29:06',
               '2018-11-30 20:58:12''2018-12-31 07:27:18',
               '2019-01-30 17:56:24''2019-03-02 04:25:30',
               '2019-04-01 14:54:36''2019-05-02 01:23:42',
               '2019-06-01 11:52:48''2019-07-01 22:21:54',
               '2019-08-01 08:51:00''2019-08-31 19:20:06'],
              dtype='datetime64[ns]', freq=None)


There is no conceivable universe where someone would want those results.

Similar problem with days:
>>> pd.to_datetime('2019-03-10T00:00:00').tz_localize('America/Chicago'+ pd.to_timedelta(1, unit='d')
Timestamp('2019-03-11 01:00:00-0500', tz='America/Chicago')


The root of the problem is that Timedelta objects only represent a fixed number of hours/minutes/seconds/micros/nanos (or perhaps just a number of seconds?), and a month or day interval doesn't fit into that worldview because they don't have constant lengths.

Better alternative - use pd.DateOffset (which unfortunately can't be vectorized):
>>> pd.DatetimeIndex([datetime.datetime(201810100+ pd.DateOffset(months=i) for in range(012)])
DatetimeIndex(['2018-10-01''2018-11-01''2018-12-01''2019-01-01',
               '2019-02-01''2019-03-01''2019-04-01''2019-05-01',
               '2019-06-01''2019-07-01''2019-08-01''2019-09-01'],
              dtype='datetime64[ns]', freq=None)

Using date_range() for months

>>> pd.date_range(start='2018-10-01', end='2018-12-01', freq='1M')
DatetimeIndex(['2018-10-31''2018-11-30'], dtype='datetime64[ns]', freq='M')


The beginning date isn't even what we passed as 'start'. This is because 'M' doesn't mean "months", it means "month-end frequency".

Better alternatives:

For the specific case where the desired times are the beginning of months, you can use freq='MS' for "month start frequency":
>>> pd.date_range(start='2018-10-01', end='2018-12-01', freq='MS')
DatetimeIndex(['2018-10-01''2018-11-01''2018-12-01'], dtype='datetime64[ns]', freq='MS')
For the general case where you want to get the same day/time for each repeating element:
# TODO

Using replace() 

>> dt = pd.to_datetime(datetime(2018114100)).tz_localize("America/Chicago", ambiguous=False)
>> dt
Timestamp('2018-11-04 01:00:00-0600', tz='America/Chicago')
>> dt.replace(minute=0, second=0, microsecond=0, nanosecond=0)
Timestamp('2018-11-04 01:00:00-0500', tz='America/Chicago')


Notice how making a change that shouldn't change anything actually changed the timezone offset.

Better alternative:
# TODO
One exception - it seems to be okay to use replace() to remove a `tz` attribute:
>> dt = pd.to_datetime(datetime(2018114100)).tz_localize("America/Chicago", ambiguous=False)
>> dt
Timestamp('2018-11-04 01:00:00-0600', tz='America/Chicago')
>> dt.replace(tzinfo=None)
Timestamp('2018-11-04 01:00:00')

Ingesting UTC-Offset Data



The best format for datetimes in text files (e.g. CSV files) is ISO-8601 format, with explicitly given offsets from UTC, either as a numeric offset like "-0600" or "-06:00", or "Z" for a zero-hour offset (UTC/GMT).

Unfortunately, using Pandas to read in such data can be fraught if the values have multiple offsets, e.g. if they span a Daylight-Saving-Time transition. For example:
from io import StringIO
import pandas as pd
 
data = """
x,y
2019-03-09 04:00:00-0600,42
2019-03-10 04:00:00-0500,43
2019-03-11 04:00:00-0500,44
"""
 
df = pd.read_csv(StringIO(data), parse_dates=['x'])
print(df.x)
 
# 0    2019-03-09 04:00:00-06:00
# 1    2019-03-10 04:00:00-05:00
# 2    2019-03-11 04:00:00-05:00
# Name: x, dtype: object
Notice that even though Pandas has parsed these strings as dates, it doesn't represent them as a datetime column; that's why it says "dtype: object". If you try to treat it as a datetime, you'll burst:
df.x.dt
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5458, in __getattr__
#     return object.__getattribute__(self, name)
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/accessor.py", line 180, in __get__
#     accessor_obj = self._accessor(obj)
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/indexes/accessors.py", line 494, in __new__
#     raise AttributeError("Can only use .dt accessor with datetimelike values")
# AttributeError: Can only use .dt accessor with datetimelike values
 
pd.to_datetime(df.x)
# Traceback (most recent call last):
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2085, in objects_to_datetime64ns
#     values, tz_parsed = conversion.datetime_to_datetime64(data)
#   File "pandas/_libs/tslibs/conversion.pyx", line 335, in pandas._libs.tslibs.conversion.datetime_to_datetime64
# ValueError: Array must be all same time zone
The best solution is usually to convert them to UTC, and then (optionally) convert to the timezone you want:
pd.to_datetime(df.x, utc=True).dt.tz_convert("America/Chicago")
# 0   2019-03-09 04:00:00-06:00
# 1   2019-03-10 04:00:00-05:00
# 2   2019-03-11 04:00:00-05:00
# Name: x, dtype: datetime64[ns, America/Chicago]
Unfortunately this requires you to know independently what timezone you want to apply, even though the data already specified its own zone offset information. Pandas just doesn't have a way to represent a datetime column with multiple offsets, so it converts it to a different type altogether.

Incidentally, this problem won't manifest until your data has multiple offsets - you get a different "dtype" if the offsets all happen to be the same:
data = """
x,y
2019-05-09 04:00:00-0500,42
2019-05-10 04:00:00-0500,43
2019-05-11 04:00:00-0500,44
"""
 
df = pd.read_csv(StringIO(data), parse_dates=['x'])
print(df.x)
# 0   2019-03-09 04:00:00-05:00
# 1   2019-03-10 04:00:00-05:00
# 2   2019-03-11 04:00:00-05:00
# Name: x, dtype: datetime64[ns, pytz.FixedOffset(-300)]

Storing arrays in Pandas cells

You may want to store a Numpy array (or a list, or other similar composite object) in a single cell of a Pandas DataFrame. This is going to be difficult. For example, you'll have trouble appending/prepending entries to that array:
Alternatives? None are great; try either exploding the array into multiple cells, or use the technique shown in the SO thread to do what you need to do.

Appending to a DataFrame


Appending to a DataFrame is absurdly slow, by design. In fact, in Pandas 1.4 it's now been deprecated in favor of concat(), so that's two reasons not to use it.

Observe the following benchmark:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import time
 
= int(1e5)
 
# Slow & deprecated way
start = time.perf_counter()
df = pd.DataFrame({'x': [4], 'y'5})
for in range(n):
    df = df.append({'x': i, 'y': i}, ignore_index=True)
print(f"Direct appending: {time.perf_counter() - start} seconds")
 
# Medium and annoying way
start = time.perf_counter()
cache = []
cache.append(pd.DataFrame({'x': [4], 'y'5}))
for in range(n):
    cache.append(pd.DataFrame({'x': [i], 'y': i}))
df = pd.concat(cache)
print(f"Concatenation: {time.perf_counter() - start} seconds")
 
# Fast and semi-annoying way
start = time.perf_counter()
cache = []
cache.append({'x'4'y'5})
for in range(n):
    cache.append({'x': i, 'y': i})
df = pd.DataFrame(data=cache)
print(f"Delayed creation: {time.perf_counter() - start} seconds")
Output:
Direct appending: 39.961711916999995 seconds
Concatenation: 7.264249749999998 seconds
Delayed creation: 0.12632250000000056 seconds

0 Comments:

Post a Comment

<< Home