How to map 'created at' field with a custom date format

nquinn · June 27, 2022, 5:12pm

I am trying to load a CSV file using the built-in CSV data loader, and want to map a particular column to the created_at field. This column is formatted like so: ‘2/11/2022 12:00:00 AM’.
When I run the data loader with the --map-created-at argument pointed to this column, the loading script finishes executing but the items never appear in Squirro, and from the frontend log I can see errors like the following:

/var/log/squirro/frontend/frontend.log:Thread-2 squirro.integration.frontend 2022-06-24 19:34:12,023 WARNING  SquirroClient returned an unexpected response: (400, {'msg': "Value '9/16/2014 12:00:00 AM' does not appear to be a datetime field according to the format string '%Y-%m-%dT%H:%M:%S'. Specify a custom datetime format to handle this value."}, {'error': {'msg': "Value '9/16/2014 12:00:00 AM' does not appear to be a datetime field according to the format string '%Y-%m-%dT%H:%M:%S'. Specify a custom datetime format to handle this value."}})

Is there a way to specify a custom format string to the loader for this field, like how you would use input_format_string in the facets.json config file to map it to a facet? If not, should I consider populating the created_at field using a pipelet, or is the best option to copy the existing CSV data loader code and modify it to parse this date field?

tonibirrer · June 28, 2022, 8:20am

Hi Neil,

If you convert your date in a pipelet, the --map-created-at will for sure work.
Btw, the log you want o look out for for data issues is /var/log/squirro/provider.log.

The data provider service is the REST api that the cli data loader sends its data to, and it is validating it.
So any warning about bad data would show up there.

Your idea of input_format_string should work, and its part of this training template:

github.com

squirro/training/blob/de05ec673bd739168ab3e08cbbd6d571e8122d48/data_loader/5_template/facets.json#L20


      
          "company": {
              "data_type": "string",
              "display_name": "Company Name"
          },
          "ticker": {
              "data_type": "string",
              "display_name": "Ticker"
          },
          "ipo_date": {
              "data_type": "datetime",
              "input_format_string": "%Y-%m-%dT%H:%M:%S",
              "format_instr": "calendar",
              "display_name": "IPO Date"
          },
          "number_employees": {
              "data_type": "int",
              "display_name": "Number of Employees"
          },
          "link": {
              "data_type": "string",
              "display_name": "Link"

Copying the data loader, while it will get the job done, is not a good idea, as you won’t be able to benefit from future improvements automatically.

The pipelet approach is most likely best, especially if you have to deal with bad/inconsistent data as it gives you the full power of python date parsing and you can even bring in more powerful time parsing libraries such as arrow.

Best,
Toni

nquinn · June 29, 2022, 1:45pm

Thanks Toni, I’ll go with the pipelet and add a field for --map-created-at to read from. Also thanks for the tip on the provider.log - I generally use squirro_tail_logs but good to know which file the processing error is supposed to appear in.

tonibirrer · June 29, 2022, 3:09pm

As luck has it, i just had to deal with a cvs and dates today.
So yes, the input_format_string approach works 100%.

A tricky bit is that the strptime() format that we’re supporting has some platform specific caveats that can be very tricky.

E.g i had to deal with single digit days / months so i first tried %-m, as its widely specified, but under Linux this is not supported, and %m is smart enough to deal with both 1 and 01. This then leads to confusing complaints by the dataloader. But once I used. the right string, it just worked.

But in my case I have to in the end user the pipelets way, because this CSV i’m working with is from hell, it changes the date string format multiple times

nquinn · June 29, 2022, 3:42pm

Ahh, this makes sense now. I created a facet for this column and specified the input format string there, and then --map-created-at magically succeeded at parsing the date. It wasn’t clear to me that data transformations done for facets would affect the mapping to item fields.

Fortunately for me, this CSV is consistent, so I can skip the pipelet!

tonibirrer · June 29, 2022, 3:51pm

This is what i had to resort to today:

"""
Fix wonky dates
"""
from datetime import datetime
from squirro.sdk import PipeletV1, require


@require("log")
class SomePipelet(PipeletV1):
    def __init__(self, config):
        self.config = config

    def consume(self, item):
        kw = item.setdefault("keywords", {})

        date_str = kw.get("date_tmp", [None])[0]
        try:
            if date_str:
                if "AM" in date_str or "PM" in date_str:
                    date_obj = datetime.strptime(date_str, "%d/%m/%Y %H:%M %p")
                else:
                    date_obj = datetime.strptime(date_str, "%d/%m/%Y %H:%M")

            else:
                date_obj = datetime.now()
        except ValueError:
            self.log.warning(
                "Value %r is not a valid datetime string. Skipping item id %s",
                date_str,
                item["id"],
            )
            return None

        item["created_at"] = date_obj.strftime("%Y-%m-%dT%H:%M:%S")

        return item

nquinn · June 29, 2022, 3:55pm

so sometimes there is an “AM” or “PM” in there even though it’s 24 hour time? how fun!

Topic		Replies	Views
Date error when importing JSON file The Insight Engine data-loader	6	825	August 24, 2022
Is there a way to override the 'upload date' of a document? The Insight Engine	3	421	July 29, 2024
NY Times Data loader Debugging Training with Squirro engineering , software	2	916	February 18, 2022
Getting error after clicking next to see Map to item fields tab with Custom JSON data loader Training with Squirro	3	754	October 4, 2022
Generating Facets - Squirro Trainee Engineer Question Training with Squirro data-loader , facets	2	824	July 6, 2022

How to map 'created at' field with a custom date format

Related topics