Partner Question - When writing a data loader plugin, what do you do with large binary data?

conor.leddy · June 2, 2022, 2:04pm

Question from Partner:
When writing a data loader plugin, what do you do with large binary data? Is there a way to hand off a file pointer, or do I just read in 10MB of PDF and stuff it in a JSON element, or what?

John Rehwinkel (he/him)

thomas.moellers · June 3, 2022, 9:13am

Hi John,

The file system loader plugin would be a common way to load files from the file system into Squirro. Have a look at the Data loader reference. On this page in the section “Filesystem Options” you find more details on the usage.
The configuration below is directly taken from those docs. You can see that folder flag, here allows you to specify the directory in which your files are located.

squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type filesystem \
    --folder FOLDER_TO_INDEX \
    --map-title "title" \
    --map-file-name "file_name" \
    --map-file-mime "file_mime" \
    --map-file-data "file_data" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
    --facets-file facets.json

If the files are not located in the file system but for instance inside an AWS S3 bucket, this would require a custom connector tailored to the respective platform. In this instance in this case you would probably use the boto3 SDK AWS provides.

Best,

Thomas

jrehwinkel · June 3, 2022, 1:07pm

This is not sufficient: I have pairs of files on the filesystem, one with custom metadata, and another PDF with the contents to index. I want to use a custom loader to upload the PDF contents along with the metadata.

sciurus_vulgaris · June 7, 2022, 7:28am

hey @jrehwinkel you can use a pipelet to read the metadata from the JSON file and enrich the document accordingly. Let me know if that helps or if you need any more guidance on this.

tonibirrer · June 21, 2022, 6:11am

Hi John.

Yes, you can indeed just stuff larger objects into the item JSON. If you write a custom data loader plugin, you would provide the binary data base64 encoded.

If you have very large ones, e.g 100 MB or bigger, you might run into a protection setting of our Nginx webserver, but that is easily lifted.

Now, having said that, inside Squirro binary files are handled with what we call a StorageContainer.
While we haven’t made this API public (yet), this can be leveraged so that you provide the data loader plugin with a StorageCotainer URI instead of the actual binary. Whenever Squirro requires the binary content, it will use the URI and the related StorageContainer to retrieve the binary content ad-hoc. This happens both for data processing as well as for serving it to the end-user.

I’ve so far seen two patterns: One is to use S3 or a similar solution for the data storage, and the second is to avoid duplication of data by proxying the data directly from the source system.

Let me know if this sounds useful, and I will share a working example.

Topic		Replies	Views
Data loader modularization The Insight Engine data-loader , software	2	756	September 15, 2022
Completing large data loads The Insight Engine data-loader	1	793	August 19, 2022
Exception in RSS feed dataloader plugin The Insight Engine data-loader	2	877	April 26, 2022
PDF handling in storage The Insight Engine pdf , storage , data-loader	2	832	February 24, 2022
Batch-size vs. source-batch-size arguments The Insight Engine data-loader	4	898	July 7, 2022

Partner Question - When writing a data loader plugin, what do you do with large binary data?

Related topics