NY Times Data loader Debugging

Hello all, I’m debugging a custom data_loader plugin I am practicing on making and I’ve written all the code to get the data from NY Times, and the code to yield the data. However, I am getting stuck at the same error that says it failed to map “abstract to abstract” I have pasted the error and my bash script to run the squirro data loader code below:

2022-02-18 02:33:36,566 squirro_data_load[62297] INFO     Loaded schema from data source. 11 columns are available:
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - body
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - created_at
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - document_type
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - id
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - keywords
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - link
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - section
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - source
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - subject
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - summary
2022-02-18 02:33:36,566 squirro_data_load[62297] INFO       - title
2022-02-18 02:33:36,566 squirro_data_load[62297] ERROR    Exception: Failed to map abstract to missing column 'abstract'.
Traceback (most recent call last):
  File "/Users/manutej.mulaveesala/python_virtual_environments/squirro/lib/python3.7/site-packages/squirro/dataloader/sq_data_load.py", line 175, in load_from_source
    stats = self._update_source_last_items(stats_gen, cli_mode)
  File "/Users/manutej.mulaveesala/python_virtual_environments/squirro/lib/python3.7/site-packages/squirro/dataloader/sq_data_load.py", line 203, in _update_source_last_items
    for stats in stats_gen:
  File "/Users/manutej.mulaveesala/python_virtual_environments/squirro/lib/python3.7/site-packages/squirro/dataloader/sq_data_load.py", line 225, in create_and_upload_items
    self._validate_mapping(schema)
  File "/Users/manutej.mulaveesala/python_virtual_environments/squirro/lib/python3.7/site-packages/squirro/dataloader/sq_data_load.py", line 387, in _validate_mapping
    raise ValueError(err_msg)
ValueError: Failed to map abstract to missing column 'abstract'.
2022-02-18 02:33:36,569 squirro_data_load[62297] ERROR    Failed to map abstract to missing column 'abstract'.
2022-02-18 02:33:36,569 squirro_data_load[62297] INFO     Total run time: 0:00:00.024078
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-script 'nytimes_data_loader.py' \
    --nytimes-query 'covid OR COVID-19 OR coronavirus or supply chain' \
    --source-batch-size 100 \
    --source-name 'Nytimes v1' \
    --map-abstract 'abstract' \
    --map-title 'title' \
    --map-body 'body' \
    --facets-file 'facets.json' \
    --nytimes-api-key $(getval nytimes_token nytimes) 
2 Likes

Hey @manutej! Thank you for your question and the helpful information you’ve provided to help debug this issue.

The error you’re getting reads ValueError: Failed to map abstract to missing column 'abstract'.

I see that in your bash script, you write --map-abstract 'abstract' \ . This assumes that there exists a data point (in the input data) named abstract .

Are you sure the abstract of items you’re trying to map from the data source exists or is named ‘abstract’?

I can see from the schema defined in your data loader, that summary might be a relevant data point for what you’re trying to achieve. Does that help?

1 Like

Welcome @sciurus_vulgaris :squirro:

1 Like