I am trying to load a CSV file using the built-in CSV data loader, and want to map a particular column to the created_at field. This column is formatted like so: ‘2/11/2022 12:00:00 AM’.
When I run the data loader with the --map-created-at argument pointed to this column, the loading script finishes executing but the items never appear in Squirro, and from the frontend log I can see errors like the following:
/var/log/squirro/frontend/frontend.log:Thread-2 squirro.integration.frontend 2022-06-24 19:34:12,023 WARNING SquirroClient returned an unexpected response: (400, {'msg': "Value '9/16/2014 12:00:00 AM' does not appear to be a datetime field according to the format string '%Y-%m-%dT%H:%M:%S'. Specify a custom datetime format to handle this value."}, {'error': {'msg': "Value '9/16/2014 12:00:00 AM' does not appear to be a datetime field according to the format string '%Y-%m-%dT%H:%M:%S'. Specify a custom datetime format to handle this value."}})
Is there a way to specify a custom format string to the loader for this field, like how you would use input_format_string in the facets.json config file to map it to a facet? If not, should I consider populating the created_at field using a pipelet, or is the best option to copy the existing CSV data loader code and modify it to parse this date field?
If you convert your date in a pipelet, the --map-created-at will for sure work.
Btw, the log you want o look out for for data issues is /var/log/squirro/provider.log.
The data provider service is the REST api that the cli data loader sends its data to, and it is validating it.
So any warning about bad data would show up there.
Your idea of input_format_string should work, and its part of this training template:
Copying the data loader, while it will get the job done, is not a good idea, as you won’t be able to benefit from future improvements automatically.
The pipelet approach is most likely best, especially if you have to deal with bad/inconsistent data as it gives you the full power of python date parsing and you can even bring in more powerful time parsing libraries such as arrow.
Thanks Toni, I’ll go with the pipelet and add a field for --map-created-at to read from. Also thanks for the tip on the provider.log - I generally use squirro_tail_logs but good to know which file the processing error is supposed to appear in.
As luck has it, i just had to deal with a cvs and dates today.
So yes, the input_format_string approach works 100%.
A tricky bit is that the strptime() format that we’re supporting has some platform specific caveats that can be very tricky.
E.g i had to deal with single digit days / months so i first tried %-m, as its widely specified, but under Linux this is not supported, and %m is smart enough to deal with both 1 and 01. This then leads to confusing complaints by the dataloader. But once I used. the right string, it just worked.
But in my case I have to in the end user the pipelets way, because this CSV i’m working with is from hell, it changes the date string format multiple times
Ahh, this makes sense now. I created a facet for this column and specified the input format string there, and then --map-created-at magically succeeded at parsing the date. It wasn’t clear to me that data transformations done for facets would affect the mapping to item fields.
Fortunately for me, this CSV is consistent, so I can skip the pipelet!