Partner Question - When writing a data loader plugin, what do you do with large binary data?

Question from Partner:
When writing a data loader plugin, what do you do with large binary data? Is there a way to hand off a file pointer, or do I just read in 10MB of PDF and stuff it in a JSON element, or what?

John Rehwinkel (he/him)

1 Like

Hi John,

The file system loader plugin would be a common way to load files from the file system into Squirro. Have a look at the Data loader reference. On this page in the section “Filesystem Options” you find more details on the usage.
The configuration below is directly taken from those docs. You can see that folder flag, here allows you to specify the directory in which your files are located.

squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type filesystem \
    --folder FOLDER_TO_INDEX \
    --map-title "title" \
    --map-file-name "file_name" \
    --map-file-mime "file_mime" \
    --map-file-data "file_data" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
    --facets-file facets.json

If the files are not located in the file system but for instance inside an AWS S3 bucket, this would require a custom connector tailored to the respective platform. In this instance in this case you would probably use the boto3 SDK AWS provides.

Best,

Thomas

1 Like

This is not sufficient: I have pairs of files on the filesystem, one with custom metadata, and another PDF with the contents to index. I want to use a custom loader to upload the PDF contents along with the metadata.

hey @jrehwinkel you can use a pipelet to read the metadata from the JSON file and enrich the document accordingly. Let me know if that helps or if you need any more guidance on this.

1 Like

Hi John.

Yes, you can indeed just stuff larger objects into the item JSON. If you write a custom data loader plugin, you would provide the binary data base64 encoded.

If you have very large ones, e.g 100 MB or bigger, you might run into a protection setting of our Nginx webserver, but that is easily lifted.

Now, having said that, inside Squirro binary files are handled with what we call a StorageContainer.
While we haven’t made this API public (yet), this can be leveraged so that you provide the data loader plugin with a StorageCotainer URI instead of the actual binary. Whenever Squirro requires the binary content, it will use the URI and the related StorageContainer to retrieve the binary content ad-hoc. This happens both for data processing as well as for serving it to the end-user.

I’ve so far seen two patterns: One is to use S3 or a similar solution for the data storage, and the second is to avoid duplication of data by proxying the data directly from the source system.

Let me know if this sounds useful, and I will share a working example.

1 Like