Completing large data loads

Hi,

I have a custom CLI data loader for SharePoint that has to load ~45gb across ~6k files, and the squirro_data_load script ran for about 80 minutes, uploading ~3.7k of those files, then hit some kind of server-side timeout error and stopped. The full stack trace is lengthy and has client-specific info, but this should give an idea of what failed:

squirro_client.exceptions.ConnectionError: (None, ConnectionError(MaxRetryError('HTTPConnectionPool(host=\'localhost\', port=81): Max retries exceeded with url: /api/provider/v1/squirro/projects/HJencPjnROa2DQYL9TYvyQ/sources/qsw2g_3IRKqY4IvRTb4etg/push_items?priority=0 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=81): Read timed out. (read timeout=55)"))')))

The loader uses the SharePoint delta API for incremental loading, meaning that subsequent runs of the loader without deleting the existing data source first would only request files edited since the previous run. But if I used the --reset flag, which should reset the delta, then my understanding is it would need to parse all of the same files again. For now I have deleted the data source and started again, hoping for better results.

Are there any good tips for how to complete these uploads without toppling the server? I haven’t tinkered with --batch-size, but seems like increasing it would hurt rather than help - same with --parallel-uploaders. The other thing that comes to mind is some kind of rate limit, but seems I would need to implement that myself. Or is there something that could be tweaked server-side?

For reference, here’s an example of my loader script. Note the size limit of 200mb - things would be worse otherwise.

squirro_data_load -v \
    --source-script ../sharepoint/sharepoint.py \
    --cluster $CLUSTER \
    --token $TOKEN \
    --project-id $PROJECT_ID \
    --site-url '<URL for your site>' \
    --source-name "<name of your site>" \
    --sharepoint-client-id "$(getval client_id sharepoint)" \
    --sharepoint-client-secret "$(getval client_secret sharepoint)" \
    --sharepoint-tenant-id "$(getval tenant_id sharepoint)" \
    --index-all \
    --file-size-limit 200 \
    --pipeline-workflow-name "SharePoint" \
    --map-id id \
    --map-title name \
    --map-body fallback_body \
    --map-created-at createdDateTime \
    --map-file-name name \
    --map-file-mime file.mimeType \
    --map-file-data content \
    --map-url url \
    --map-flag flag \
    --facets-file facets.json
1 Like

Hi Neil,

Did the next attempt go through?

The error throw is a connection timeout to the provider api. So when the data is loaded into Squirro.
Could have a few reasons, but most likley the Squirro server was overloaded and could not respond.

The ideal solution would be for the Squirro Dataloader to not fail, but to re-attempt after some back off.
We have in place for some transactions, but here it seemed to not be the case.

The rate limit is a great idea, we sould add that to the dataloader base library, so that all plugins can leverage it.

In the meantime, if it fails again and wasn’t a one of hang, then you should have a closer look at the server metrics, especially cpu load and memory usage.

Best,
Toni

1 Like