Completing large data loads

nquinn · August 18, 2022, 8:58pm

Hi,

I have a custom CLI data loader for SharePoint that has to load ~45gb across ~6k files, and the squirro_data_load script ran for about 80 minutes, uploading ~3.7k of those files, then hit some kind of server-side timeout error and stopped. The full stack trace is lengthy and has client-specific info, but this should give an idea of what failed:

squirro_client.exceptions.ConnectionError: (None, ConnectionError(MaxRetryError('HTTPConnectionPool(host=\'localhost\', port=81): Max retries exceeded with url: /api/provider/v1/squirro/projects/HJencPjnROa2DQYL9TYvyQ/sources/qsw2g_3IRKqY4IvRTb4etg/push_items?priority=0 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=81): Read timed out. (read timeout=55)"))')))

The loader uses the SharePoint delta API for incremental loading, meaning that subsequent runs of the loader without deleting the existing data source first would only request files edited since the previous run. But if I used the --reset flag, which should reset the delta, then my understanding is it would need to parse all of the same files again. For now I have deleted the data source and started again, hoping for better results.

Are there any good tips for how to complete these uploads without toppling the server? I haven’t tinkered with --batch-size, but seems like increasing it would hurt rather than help - same with --parallel-uploaders. The other thing that comes to mind is some kind of rate limit, but seems I would need to implement that myself. Or is there something that could be tweaked server-side?

For reference, here’s an example of my loader script. Note the size limit of 200mb - things would be worse otherwise.

squirro_data_load -v \
    --source-script ../sharepoint/sharepoint.py \
    --cluster $CLUSTER \
    --token $TOKEN \
    --project-id $PROJECT_ID \
    --site-url '<URL for your site>' \
    --source-name "<name of your site>" \
    --sharepoint-client-id "$(getval client_id sharepoint)" \
    --sharepoint-client-secret "$(getval client_secret sharepoint)" \
    --sharepoint-tenant-id "$(getval tenant_id sharepoint)" \
    --index-all \
    --file-size-limit 200 \
    --pipeline-workflow-name "SharePoint" \
    --map-id id \
    --map-title name \
    --map-body fallback_body \
    --map-created-at createdDateTime \
    --map-file-name name \
    --map-file-mime file.mimeType \
    --map-file-data content \
    --map-url url \
    --map-flag flag \
    --facets-file facets.json

tonibirrer · August 19, 2022, 8:05pm

Hi Neil,

Did the next attempt go through?

The error throw is a connection timeout to the provider api. So when the data is loaded into Squirro.
Could have a few reasons, but most likley the Squirro server was overloaded and could not respond.

The ideal solution would be for the Squirro Dataloader to not fail, but to re-attempt after some back off.
We have in place for some transactions, but here it seemed to not be the case.

The rate limit is a great idea, we sould add that to the dataloader base library, so that all plugins can leverage it.

In the meantime, if it fails again and wasn’t a one of hang, then you should have a closer look at the server metrics, especially cpu load and memory usage.

Best,
Toni

Topic		Replies	Views
Data loader modularization The Insight Engine data-loader , software	2	756	September 15, 2022
Getting error after clicking next to see Map to item fields tab with Custom JSON data loader Training with Squirro	3	753	October 4, 2022
Exception in RSS feed dataloader plugin The Insight Engine data-loader	2	877	April 26, 2022
Partner Question - When writing a data loader plugin, what do you do with large binary data? The Insight Engine data-loader	4	864	June 21, 2022
Batch-size vs. source-batch-size arguments The Insight Engine data-loader	4	898	July 7, 2022

Completing large data loads

Related topics