I have a custom CLI data loader for SharePoint that has to load ~45gb across ~6k files, and the
squirro_data_load script ran for about 80 minutes, uploading ~3.7k of those files, then hit some kind of server-side timeout error and stopped. The full stack trace is lengthy and has client-specific info, but this should give an idea of what failed:
squirro_client.exceptions.ConnectionError: (None, ConnectionError(MaxRetryError('HTTPConnectionPool(host=\'localhost\', port=81): Max retries exceeded with url: /api/provider/v1/squirro/projects/HJencPjnROa2DQYL9TYvyQ/sources/qsw2g_3IRKqY4IvRTb4etg/push_items?priority=0 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=81): Read timed out. (read timeout=55)"))')))
The loader uses the SharePoint delta API for incremental loading, meaning that subsequent runs of the loader without deleting the existing data source first would only request files edited since the previous run. But if I used the
--reset flag, which should reset the delta, then my understanding is it would need to parse all of the same files again. For now I have deleted the data source and started again, hoping for better results.
Are there any good tips for how to complete these uploads without toppling the server? I haven’t tinkered with
--batch-size, but seems like increasing it would hurt rather than help - same with
--parallel-uploaders. The other thing that comes to mind is some kind of rate limit, but seems I would need to implement that myself. Or is there something that could be tweaked server-side?
For reference, here’s an example of my loader script. Note the size limit of 200mb - things would be worse otherwise.
squirro_data_load -v \ --source-script ../sharepoint/sharepoint.py \ --cluster $CLUSTER \ --token $TOKEN \ --project-id $PROJECT_ID \ --site-url '<URL for your site>' \ --source-name "<name of your site>" \ --sharepoint-client-id "$(getval client_id sharepoint)" \ --sharepoint-client-secret "$(getval client_secret sharepoint)" \ --sharepoint-tenant-id "$(getval tenant_id sharepoint)" \ --index-all \ --file-size-limit 200 \ --pipeline-workflow-name "SharePoint" \ --map-id id \ --map-title name \ --map-body fallback_body \ --map-created-at createdDateTime \ --map-file-name name \ --map-file-mime file.mimeType \ --map-file-data content \ --map-url url \ --map-flag flag \ --facets-file facets.json