Worst Pandas Bugs and Misfeatures
An informal collection of some of the worst experiences I've had working with pandas.
Using Timedelta
for months or days
There is no conceivable universe where someone would want those results.
Similar problem with days:
The root of the problem is that Timedelta objects only represent a fixed number of hours/minutes/seconds/micros/nanos (or perhaps just a number of seconds?), and a month or day interval doesn't fit into that worldview because they don't have constant lengths.
Better alternative - use pd.DateOffset (which unfortunately can't be vectorized):
Using date_range()
for months
The beginning date isn't even what we passed as 'start'. This is because 'M' doesn't mean "months", it means "month-end frequency".
Better alternatives:
For the specific case where the desired times are the beginning of months, you can use freq='MS' for "month start frequency":
For the general case where you want to get the same day/time for each repeating element:
Using replace()
Notice how making a change that shouldn't change anything actually changed the timezone offset.
Better alternative:
One exception - it seems to be okay to use replace() to remove a `tz` attribute:
Ingesting UTC-Offset Data
The best format for datetimes in text files (e.g. CSV files) is ISO-8601 format, with explicitly given offsets from UTC, either as a numeric offset like "-0600" or "-06:00", or "Z" for a zero-hour offset (UTC/GMT).
Unfortunately, using Pandas to read in such data can be fraught if the values have multiple offsets, e.g. if they span a Daylight-Saving-Time transition. For example:
Notice that even though Pandas has parsed these strings as dates, it doesn't represent them as a datetime column; that's why it says "dtype: object". If you try to treat it as a datetime, you'll burst:
The best solution is usually to convert them to UTC, and then (optionally) convert to the timezone you want:
Unfortunately this requires you to know independently what timezone you want to apply, even though the data already specified its own zone offset information. Pandas just doesn't have a way to represent a datetime column with multiple offsets, so it converts it to a different type altogether.
Incidentally, this problem won't manifest until your data has multiple offsets - you get a different "dtype" if the offsets all happen to be the same:
Storing arrays in Pandas cells
You may want to store a Numpy array (or a list, or other similar composite object) in a single cell of a Pandas DataFrame. This is going to be difficult. For example, you'll have trouble appending/prepending entries to that array:Alternatives? None are great; try either exploding the array into multiple cells, or use the technique shown in the SO thread to do what you need to do.
Appending to a DataFrame is absurdly slow, by design. In fact, in Pandas 1.4 it's now been deprecated in favor of concat(), so that's two reasons not to use it.
Observe the following benchmark:
Appending to a DataFrame
Appending to a DataFrame is absurdly slow, by design. In fact, in Pandas 1.4 it's now been deprecated in favor of concat(), so that's two reasons not to use it.
Observe the following benchmark:
Output:
0 Comments:
Post a Comment
<< Home