How to convert convert a datetime string to timestamp in dask cudf and then sort the dataframe by this column

33 Views Asked by At

I would like to convert a datetime string to timestamp in dask cudf and then sort the dataframe by this column.

Example:

import dask_cudf as ddf
import pandas as pd

# Sample data (replace with your actual data)
cdf = cudf.DataFrame({
    'city': ['Dallas', 'Bogota', 'Chicago', 'Juarez'],
    'timestamp': ['2019-12-29 14:15:08 UTC', '2019-12-30 10:30:15 UTC', '2019-12-31 18:45:30 UTC', '2020-01-01 03:20:45 UTC']
})

# Create a Dask-cuDF DataFrame
dask_df = ddf.from_cudf(cdf, npartitions=2)

def to_timestamp(x):
    import time
    import datetime
    element = datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S UTC")
    return datetime.datetime.timestamp(element)

dask_df['timestamp'] = dask_df['timestamp'].map_partitions(to_timestamp, meta=("timestamp", "str"))

dask_df.head()

I got error:

TypeError: strptime() argument 1 must be str, not Series

How can I do this for large dataframe on dask cudf ?

==========update ==========

I have tried this:

   dask_df["timestamp"] = dask_df["timestamp"].map_partitions(to_timestamp, meta=("timestamp", "str"))

and got error:

  TypeError: strptime() argument 1 must be str, not Series
1

There are 1 best solutions below

0
UnicornOnAzur On

This map_partitions thread seems to cover all the tricks of using map_partitions on a row-by-row basis.

Furthermore, you can refactor your function somewhat. The import statements can be moved outside of the function to save on loading time. You're only using datetime in the function therefore you can skip on importing time. The function could then look like this:

def to_timestamp(x):
    datetime_object = datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S UTC")
    timestamp = datetime.datetime.timestamp(element)
    return timestamp