Batch-size vs. source-batch-size arguments

maximilian.hoeschler · July 5, 2022, 12:37pm

Hello dear squirro community,

during my work on a custom dataloader, I have observed an unexpected behaviour concerning the batch-size and source-batch-size arguments when using a load.sh script to manually load extracted information to my squirro instance.

It seems to me that only the source-batch-size argument is used and the batch-size argument is neglected. I could observe this by altering the batch-size argument, which from my understanding should be passed automatically to the getDataBatch-function as a parameter. Still, instead of batch-size-number elements being passed on to the dataloader, source-batch-size-number elements have been used. I did not test this explicitly for the UI, but it might be that the same issue occurs there too.

Is this an intended behaviour? Prior to observing this, I would have assumed that the source-batch-size argument e.g. is used to constrain the number of elements that get retrieved from an API in first instance, and the batch-size argument then determines the number of elements yielded by the getDataBatch-generator.

Best Regards
Maximilian from d-fine

daren.sin · July 6, 2022, 9:56am

Hi Maximilian.

The source-batch-size argument specifies the amount of data to fetch from the data source (e.g. CSV file, some other API) in a single batch. On the other hand, the batch-size argument specifies the amount of data to be loaded into Squirro after fetching the data from the source.

These parameters can be specified in a load.sh script with: --source-batch-size 123 and --batch-size 456 respectively.

You can check out this page, for example, on how to specify these parameters on the frontend: https://squirro.atlassian.net/wiki/spaces/DOC/pages/2214168412/Dataloader+Frontend+Improvements

Hope this helps!

Regards
Daren

maximilian.hoeschler · July 7, 2022, 7:16am

Hello Daren,

thank you for your response. Indeed this is also how I would have understood the functionality of both parameters. If I now go and program a custom dataloader myself, I would expect the batch-size argument from the load.sh-script to be passed on to the batch_size parameter of the getDataBatch-function automatically. Instead, it seems per default the source-batch-size argument is handed over. If you look at the getDataBatch-function on this page https://squirro.atlassian.net/wiki/spaces/DOC/pages/1859682364/Data+loader+plugin+boilerplate I think the source-batch-size argument would be used in the get_content_from_somewhere() function call, but the batch_size-parameter is used in the yield-loop in a way that would rather fit the batch-size argument.

Of course, this is not too crucial as I can simply explicitly get both arguments with self.args.batch-…, I just thought it might be good to make you aware of the behaviour as it seems to me like a small bug:)

Best Regards
Maximilian

tonibirrer · July 7, 2022, 12:47pm

Hi Maximilian,

The way to look at the design is that getDataBatch is in charge of dealing with the source system.
Hence its batch size argument is mapped to the --source-batch-size argument.

The --batch-size argument is not exposed to you in the plugin, it happens further down.
So say if you yield a list of 10000 items and the --batch-size is 100 the dataloader library will break your big list into batches of 100.

The driving force here should be the source system anyways, e.g. how heavily will the source system be taxed when fetching 1000 items in one go? How big will this batch be? How can you resume if something goes wrong? What is your rate limit?

On the Squirro side, the batch size if of relevance too. 1000 tweets are different than 1000 pdf documents. But ultimately Squirros API just takes the data, validates it and then hands it over to the data ingester service to work on it.

On the cli side, I almost always then set both args to the same value, based on what I need on the source side. It just makes following the progress on the console easier.

Hope this helps.

Best,
Toni

maximilian.hoeschler · July 7, 2022, 12:57pm

Hi Toni,
hi Daren,

thank you both very much for the detailed explanation! Got it now, I misunderstood the point at which the batch-size argument is used, sorry for the confusion!

Then indeed everything works as intended .

Best Regards
Maximilian

Topic		Replies	Views
Getting error after clicking next to see Map to item fields tab with Custom JSON data loader Training with Squirro	3	754	October 4, 2022
Completing large data loads The Insight Engine data-loader	1	794	August 19, 2022
Partner Question - When writing a data loader plugin, what do you do with large binary data? The Insight Engine data-loader	4	864	June 21, 2022
Data loader modularization The Insight Engine data-loader , software	2	756	September 15, 2022
Exception in RSS feed dataloader plugin The Insight Engine data-loader	2	878	April 26, 2022

Batch-size vs. source-batch-size arguments

Related topics