Hello dear squirro community,
during my work on a custom dataloader, I have observed an unexpected behaviour concerning the batch-size and source-batch-size arguments when using a load.sh script to manually load extracted information to my squirro instance.
It seems to me that only the source-batch-size argument is used and the batch-size argument is neglected. I could observe this by altering the batch-size argument, which from my understanding should be passed automatically to the getDataBatch-function as a parameter. Still, instead of batch-size-number elements being passed on to the dataloader, source-batch-size-number elements have been used. I did not test this explicitly for the UI, but it might be that the same issue occurs there too.
Is this an intended behaviour? Prior to observing this, I would have assumed that the source-batch-size argument e.g. is used to constrain the number of elements that get retrieved from an API in first instance, and the batch-size argument then determines the number of elements yielded by the getDataBatch-generator.
Maximilian from d-fine
source-batch-size argument specifies the amount of data to fetch from the data source (e.g. CSV file, some other API) in a single batch. On the other hand, the
batch-size argument specifies the amount of data to be loaded into Squirro after fetching the data from the source.
These parameters can be specified in a
load.sh script with:
--source-batch-size 123 and
--batch-size 456 respectively.
You can check out this page, for example, on how to specify these parameters on the frontend: https://squirro.atlassian.net/wiki/spaces/DOC/pages/2214168412/Dataloader+Frontend+Improvements
Hope this helps!
thank you for your response. Indeed this is also how I would have understood the functionality of both parameters. If I now go and program a custom dataloader myself, I would expect the batch-size argument from the load.sh-script to be passed on to the batch_size parameter of the
getDataBatch-function automatically. Instead, it seems per default the source-batch-size argument is handed over. If you look at the
getDataBatch-function on this page https://squirro.atlassian.net/wiki/spaces/DOC/pages/1859682364/Data+loader+plugin+boilerplate I think the source-batch-size argument would be used in the
get_content_from_somewhere() function call, but the batch_size-parameter is used in the yield-loop in a way that would rather fit the batch-size argument.
Of course, this is not too crucial as I can simply explicitly get both arguments with self.args.batch-…, I just thought it might be good to make you aware of the behaviour as it seems to me like a small bug:)
The way to look at the design is that getDataBatch is in charge of dealing with the source system.
Hence its batch size argument is mapped to the --source-batch-size argument.
The --batch-size argument is not exposed to you in the plugin, it happens further down.
So say if you yield a list of 10000 items and the --batch-size is 100 the dataloader library will break your big list into batches of 100.
The driving force here should be the source system anyways, e.g. how heavily will the source system be taxed when fetching 1000 items in one go? How big will this batch be? How can you resume if something goes wrong? What is your rate limit?
On the Squirro side, the batch size if of relevance too. 1000 tweets are different than 1000 pdf documents. But ultimately Squirros API just takes the data, validates it and then hands it over to the data ingester service to work on it.
On the cli side, I almost always then set both args to the same value, based on what I need on the source side. It just makes following the progress on the console easier.
Hope this helps.
thank you both very much for the detailed explanation! Got it now, I misunderstood the point at which the batch-size argument is used, sorry for the confusion!
Then indeed everything works as intended .