Data loader modularization

nquinn · September 13, 2022, 9:02pm

Hi,

I have been using an adapted version of the SharePoint data loader to extract files from SharePoint sites, but now I need to add the capability to extract SharePoint list items instead of files in drives. The logic for doing this is so different that it should be in a completely separate data loader, but the collection of SharePoint APIs I’m using is very similar. So, ideally, I would like to extract my SharePoint API client to a separate module that can be shared between the data loader modules.

I tried doing this locally, and it seemed to run without any issues. Instead of all the code being in one file that is passed to the --source-script argument of squirro_data_load, I put my API client in a sharepoint_client.py file in the same directory, put the data loader class (i.e. the one inheriting DataSource) in a dataloader.py file that imports the SharePoint client class, and passed the path to the latter as my --source-script argument.

However, I have never seen any Squirro data loaders do things this way, even though factoring code into modules is generally best practice and some data loaders can be several hundred lines of code. Are there any Squirro-specific reasons why it’s problematic to have multiple Python files in your data loader directory and import them locally, or gotchas to look out for?

guillaume.debard · September 15, 2022, 8:39am

Hi,

In the case of this Sharepoint data loader, this is probably the result of a design choice because of developers only using the file extraction of the Sharepoint API. With additional use cases for the Sharepoint API at the time, it would probably have been made more modular, with a main class in a separate script to create a connection to the Sharepoint API, and the actual data loaders classes for different Sharepoint API endpoints calling this main script before getting fed to the --source-script.

Regarding the internal design of your data loading class, you can express yourself freely. As long as the class you’re calling in your script file inherits from the from squirro.dataloader.data_source import DataSource class and respects the guidelines template, you’re good to go.
Concerning your data source and directory structure, some constraints apply. There’s no limitation in the way you can import subscripts or structure your directory, as long as your --source-script calls a single script with a single DataSource class. In your case, you can create two different data sources for these two different endpoints in two different directories, which will ensure greater flexibility. And for example you can call a Sharepoint API connector class from these scripts.

nquinn · September 15, 2022, 2:26pm

Thank you Guillaume!

Topic		Replies	Views
Completing large data loads The Insight Engine data-loader	1	798	August 19, 2022
News! D is for Data Loading - New Training Content Announcements	1	762	June 7, 2022
Delete item in SharePoint dataloader The Insight Engine data-loader	2	842	May 31, 2022
Partner Question - When writing a data loader plugin, what do you do with large binary data? The Insight Engine data-loader	4	869	June 21, 2022
Google Drive - specify folder to be ingested The Insight Engine data-loader	1	881	March 9, 2022

Data loader modularization

Related topics