How to map 'created at' field with a custom date format

I am trying to load a CSV file using the built-in CSV data loader, and want to map a particular column to the created_at field. This column is formatted like so: ‘2/11/2022 12:00:00 AM’.
When I run the data loader with the --map-created-at argument pointed to this column, the loading script finishes executing but the items never appear in Squirro, and from the frontend log I can see errors like the following:

/var/log/squirro/frontend/frontend.log:Thread-2 squirro.integration.frontend 2022-06-24 19:34:12,023 WARNING  SquirroClient returned an unexpected response: (400, {'msg': "Value '9/16/2014 12:00:00 AM' does not appear to be a datetime field according to the format string '%Y-%m-%dT%H:%M:%S'. Specify a custom datetime format to handle this value."}, {'error': {'msg': "Value '9/16/2014 12:00:00 AM' does not appear to be a datetime field according to the format string '%Y-%m-%dT%H:%M:%S'. Specify a custom datetime format to handle this value."}})

Is there a way to specify a custom format string to the loader for this field, like how you would use input_format_string in the facets.json config file to map it to a facet? If not, should I consider populating the created_at field using a pipelet, or is the best option to copy the existing CSV data loader code and modify it to parse this date field?

1 Like

Hi Neil,

If you convert your date in a pipelet, the --map-created-at will for sure work.
Btw, the log you want o look out for for data issues is /var/log/squirro/provider.log.

The data provider service is the REST api that the cli data loader sends its data to, and it is validating it.
So any warning about bad data would show up there.

Your idea of input_format_string should work, and its part of this training template:

Copying the data loader, while it will get the job done, is not a good idea, as you won’t be able to benefit from future improvements automatically.

The pipelet approach is most likely best, especially if you have to deal with bad/inconsistent data as it gives you the full power of python date parsing and you can even bring in more powerful time parsing libraries such as arrow.

Best,
Toni

1 Like

Thanks Toni, I’ll go with the pipelet and add a field for --map-created-at to read from. Also thanks for the tip on the provider.log - I generally use squirro_tail_logs but good to know which file the processing error is supposed to appear in.

1 Like

As luck has it, i just had to deal with a cvs and dates today.
So yes, the input_format_string approach works 100%.

A tricky bit is that the strptime() format that we’re supporting has some platform specific caveats that can be very tricky.

E.g i had to deal with single digit days / months so i first tried %-m, as its widely specified, but under Linux this is not supported, and %m is smart enough to deal with both 1 and 01. This then leads to confusing complaints by the dataloader. But once I used. the right string, it just worked.

But in my case I have to in the end user the pipelets way, because this CSV i’m working with is from hell, it changes the date string format multiple times :wink:

1 Like

Ahh, this makes sense now. I created a facet for this column and specified the input format string there, and then --map-created-at magically succeeded at parsing the date. It wasn’t clear to me that data transformations done for facets would affect the mapping to item fields.

Fortunately for me, this CSV is consistent, so I can skip the pipelet!

This is what i had to resort to today:

"""
Fix wonky dates
"""
from datetime import datetime
from squirro.sdk import PipeletV1, require


@require("log")
class SomePipelet(PipeletV1):
    def __init__(self, config):
        self.config = config

    def consume(self, item):
        kw = item.setdefault("keywords", {})

        date_str = kw.get("date_tmp", [None])[0]
        try:
            if date_str:
                if "AM" in date_str or "PM" in date_str:
                    date_obj = datetime.strptime(date_str, "%d/%m/%Y %H:%M %p")
                else:
                    date_obj = datetime.strptime(date_str, "%d/%m/%Y %H:%M")

            else:
                date_obj = datetime.now()
        except ValueError:
            self.log.warning(
                "Value %r is not a valid datetime string. Skipping item id %s",
                date_str,
                item["id"],
            )
            return None

        item["created_at"] = date_obj.strftime("%Y-%m-%dT%H:%M:%S")

        return item
1 Like

:nauseated_face: so sometimes there is an “AM” or “PM” in there even though it’s 24 hour time? how fun!