Technical docs on Squirro and Elasticsearch

Hey Squirro folks - I’m looking for any technical docs on the interaction of Elasticsearch w Squirro. In particular, how is content that is ingested into Squirro subsequently stored and indexed in Elasticsearch?

I’ve seen this content:
https://squirro.atlassian.net/wiki/spaces/DOC/pages/255295524/Managing+Elasticsearch

But I’m looking for more detail. Does anything like this exist?

Thanks,
Chris

3 Likes

Hi Chris,

before I provide you with additional information, I would like to understand whether your desire to better understand the interaction between Elasticsearch and Squirro is just for personal curiosity or is related to other reasons.

Thanks.

Hey Filipo - EK is working w Squirro on a couple of projects now. As a Solutions Architect, it will help me greatly to understand this interaction b/w Squirro and Elasticsearch when developing solutions. I have many years of experience w Enterprise Search and tools like Elasticsearch.

Does this make sense? Feel free to reach out to me personally if you’d like.

Thanks,
Chris

Hey @cmarino! Welcome to the Squirro forum and thank you for your question. I’m really pleased to see you’re keenly interested in how Squirro interacts with Elasticsearch. I’ve alerted one of our search experts here at Squirro to see if he can give a brief explanation on how we do this.

In the meantime, it’s important to note that we do not recommend or support any changes made to the Elasticsearch configurations. Any interaction with the data stored in Elasticsearch indices should happen either through Squirro’s web API or with the Squirro client.

Hi Amin - Thanks for the welcome and thanks for your response. I totally agree w your note in the second paragraph, rest assured I have no need to tamper w any of the Elasticsearch configs or mappings.

It’s still valuable to be able to understand this interaction, esp as I understand documents are indexed at the sentence level (though correct me if this is not the case).

Chris

While I’m not the search expert that Amin mentions, I can at least already get you started.

Elasticsearch is fully abstracted by our APIs. So outside of some of the operational challenges of ES, one should not need to have to interact with it.

When it comes to indexing, we use our own data processing pipeline (aka the ingester service) to validate, transform and enrich the data. The pipelines then usually have the Index step, which handles the actual indexing into ES. Nothing special about that since all the data preparation steps are done in Squirro it is a simple index action using the python Elasticsearch client.

If you have access to a Squirro installation, you can see the schema definition under /etc/elasticsearch/templates

The main one is squirro_v9.json.

In the end, each Squirro project has its own main index called squirro_v9_<project_id_lowercased>
This index is where the actual documents are persisted. A single Squirro document can make up multiple records, e.g. a 10 page PDF will end up as one main document, plus 10 items representing the pages.

I hope this gets you started.

3 Likes