Life on Other Planets

An informal collection of some of the worst experiences I've had working with pandas.

Using `Timedelta` for months or days

>>> datetime.datetime(2018, 10, 1, 0, 0) + pd.to_timedelta(range(0, 12), unit='M')
/Users/kwilliams/git/dispatcher/rush-springs-simulations/venv/lib/python3.7/site-packages/pandas/util/_decorators.py:208: FutureWarning: M and Y units are deprecated and will be removed in a future version.
  return func(*args, **kwargs)
DatetimeIndex(['2018-10-01 00:00:00', '2018-10-31 10:29:06',
               '2018-11-30 20:58:12', '2018-12-31 07:27:18',
               '2019-01-30 17:56:24', '2019-03-02 04:25:30',
               '2019-04-01 14:54:36', '2019-05-02 01:23:42',
               '2019-06-01 11:52:48', '2019-07-01 22:21:54',
               '2019-08-01 08:51:00', '2019-08-31 19:20:06'],
              dtype='datetime64[ns]', freq=None)

There is no conceivable universe where someone would want those results.

Similar problem with days:

>>> pd.to_datetime('2019-03-10T00:00:00').tz_localize('America/Chicago') + pd.to_timedelta(1, unit='d')
Timestamp('2019-03-11 01:00:00-0500', tz='America/Chicago')

The root of the problem is that Timedelta objects only represent a fixed number of hours/minutes/seconds/micros/nanos (or perhaps just a number of seconds?), and a month or day interval doesn't fit into that worldview because they don't have constant lengths.

Better alternative - use pd.DateOffset (which unfortunately can't be vectorized):

>>> pd.DatetimeIndex([datetime.datetime(2018, 10, 1, 0, 0) + pd.DateOffset(months=i) for i in range(0, 12)])
DatetimeIndex(['2018-10-01', '2018-11-01', '2018-12-01', '2019-01-01',
               '2019-02-01', '2019-03-01', '2019-04-01', '2019-05-01',
               '2019-06-01', '2019-07-01', '2019-08-01', '2019-09-01'],
              dtype='datetime64[ns]', freq=None)

Using `date_range()` for months

>>> pd.date_range(start='2018-10-01', end='2018-12-01', freq='1M')
DatetimeIndex(['2018-10-31', '2018-11-30'], dtype='datetime64[ns]', freq='M')

The beginning date isn't even what we passed as 'start'. This is because 'M' doesn't mean "months", it means "month-end frequency".

Better alternatives:

For the specific case where the desired times are the beginning of months, you can use freq='MS' for "month start frequency":

>>> pd.date_range(start='2018-10-01', end='2018-12-01', freq='MS')
DatetimeIndex(['2018-10-01', '2018-11-01', '2018-12-01'], dtype='datetime64[ns]', freq='MS')

For the general case where you want to get the same day/time for each repeating element:

# TODO

Using `replace()`

>> dt = pd.to_datetime(datetime(2018, 11, 4, 1, 0, 0)).tz_localize("America/Chicago", ambiguous=False)
>> dt
Timestamp('2018-11-04 01:00:00-0600', tz='America/Chicago')
>> dt.replace(minute=0, second=0, microsecond=0, nanosecond=0)
Timestamp('2018-11-04 01:00:00-0500', tz='America/Chicago')

Notice how making a change that shouldn't change anything actually changed the timezone offset.

Better alternative:

# TODO

One exception - it seems to be okay to use replace() to remove a `tz` attribute:

>> dt = pd.to_datetime(datetime(2018, 11, 4, 1, 0, 0)).tz_localize("America/Chicago", ambiguous=False)
>> dt
Timestamp('2018-11-04 01:00:00-0600', tz='America/Chicago')
>> dt.replace(tzinfo=None)
Timestamp('2018-11-04 01:00:00')

Ingesting UTC-Offset Data

The best format for datetimes in text files (e.g. CSV files) is ISO-8601 format, with explicitly given offsets from UTC, either as a numeric offset like "-0600" or "-06:00", or "Z" for a zero-hour offset (UTC/GMT).

Unfortunately, using Pandas to read in such data can be fraught if the values have multiple offsets, e.g. if they span a Daylight-Saving-Time transition. For example:

from io import StringIO
import pandas as pd
 
data = """
x,y
2019-03-09 04:00:00-0600,42
2019-03-10 04:00:00-0500,43
2019-03-11 04:00:00-0500,44
"""
 
df = pd.read_csv(StringIO(data), parse_dates=['x'])
print(df.x)
 
# 0    2019-03-09 04:00:00-06:00
# 1    2019-03-10 04:00:00-05:00
# 2    2019-03-11 04:00:00-05:00
# Name: x, dtype: object

Notice that even though Pandas has parsed these strings as dates, it doesn't represent them as a datetime column; that's why it says "dtype: object". If you try to treat it as a datetime, you'll burst:

df.x.dt
# Traceback (most recent call last):
#   File "<input>", line 1, in <module>
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5458, in __getattr__
#     return object.__getattribute__(self, name)
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/accessor.py", line 180, in __get__
#     accessor_obj = self._accessor(obj)
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/indexes/accessors.py", line 494, in __new__
#     raise AttributeError("Can only use .dt accessor with datetimelike values")
# AttributeError: Can only use .dt accessor with datetimelike values
 
pd.to_datetime(df.x)
# Traceback (most recent call last):
#   File "/Users/kwilliams/git/dispatcher/optos-caiso/venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2085, in objects_to_datetime64ns
#     values, tz_parsed = conversion.datetime_to_datetime64(data)
#   File "pandas/_libs/tslibs/conversion.pyx", line 335, in pandas._libs.tslibs.conversion.datetime_to_datetime64
# ValueError: Array must be all same time zone

The best solution is usually to convert them to UTC, and then (optionally) convert to the timezone you want:

pd.to_datetime(df.x, utc=True).dt.tz_convert("America/Chicago")
# 0   2019-03-09 04:00:00-06:00
# 1   2019-03-10 04:00:00-05:00
# 2   2019-03-11 04:00:00-05:00
# Name: x, dtype: datetime64[ns, America/Chicago]

Unfortunately this requires you to know independently what timezone you want to apply, even though the data already specified its own zone offset information. Pandas just doesn't have a way to represent a datetime column with multiple offsets, so it converts it to a different type altogether.

Incidentally, this problem won't manifest until your data has multiple offsets - you get a different "dtype" if the offsets all happen to be the same:

data = """
x,y
2019-05-09 04:00:00-0500,42
2019-05-10 04:00:00-0500,43
2019-05-11 04:00:00-0500,44
"""
 
df = pd.read_csv(StringIO(data), parse_dates=['x'])
print(df.x)
# 0   2019-03-09 04:00:00-05:00
# 1   2019-03-10 04:00:00-05:00
# 2   2019-03-11 04:00:00-05:00
# Name: x, dtype: datetime64[ns, pytz.FixedOffset(-300)]

Storing arrays in Pandas cells

You may want to store a Numpy array (or a list, or other similar composite object) in a single cell of a Pandas DataFrame. This is going to be difficult. For example, you'll have trouble appending/prepending entries to that array:

https://stackoverflow.com/q/66863164/169947

Alternatives? None are great; try either exploding the array into multiple cells, or use the technique shown in the SO thread to do what you need to do.

Appending to a DataFrame

Appending to a DataFrame is absurdly slow, by design. In fact, in Pandas 1.4 it's now been deprecated in favor of concat(), so that's two reasons not to use it.

Observe the following benchmark:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import time
 
n = int(1e5)
 
# Slow & deprecated way
start = time.perf_counter()
df = pd.DataFrame({'x': [4], 'y': 5})
for i in range(n):
    df = df.append({'x': i, 'y': i}, ignore_index=True)
print(f"Direct appending: {time.perf_counter() - start} seconds")
 
# Medium and annoying way
start = time.perf_counter()
cache = []
cache.append(pd.DataFrame({'x': [4], 'y': 5}))
for i in range(n):
    cache.append(pd.DataFrame({'x': [i], 'y': i}))
df = pd.concat(cache)
print(f"Concatenation: {time.perf_counter() - start} seconds")
 
# Fast and semi-annoying way
start = time.perf_counter()
cache = []
cache.append({'x': 4, 'y': 5})
for i in range(n):
    cache.append({'x': i, 'y': i})
df = pd.DataFrame(data=cache)
print(f"Delayed creation: {time.perf_counter() - start} seconds")

Output:

Direct appending: 39.961711916999995 seconds
Concatenation: 7.264249749999998 seconds
Delayed creation: 0.12632250000000056 seconds

Life on Other Planets

Wednesday, June 12, 2024

Worst Pandas Bugs and Misfeatures

Using `Timedelta` for months or days

Using `date_range()` for months

Using `replace()`

Ingesting UTC-Offset Data

Storing arrays in Pandas cells

Appending to a DataFrame

About Me

Links

Previous Posts

Archives

Life on Other Planets

Wednesday, June 12, 2024

Worst Pandas Bugs and Misfeatures

Using Timedelta for months or days

Using date_range() for months

Using replace()

Ingesting UTC-Offset Data

Storing arrays in Pandas cells

Appending to a DataFrame

About Me

Links

Previous Posts

Archives

Using `Timedelta` for months or days

Using `date_range()` for months

Using `replace()`