top of page
  • Shamik Ray

Exploring Solr Internals: How it works?

Updated: Aug 30, 2021


Exploring Solr Internals: How it works?


Solr is a great search platform built on top of Lucene which works pretty well out of the box but there are times when you would want to customize it to get something extra done. In this blog, we will be exploring one such advanced use case.

How to modify the way Solr indexes your documents.

An example where this is useful is if we want to add a new field based on the value of another field that is contained in the documents. How can we achieve this if we don’t have control over the document source?


Before we really go into detail about how to do this in Solr. First, we will need to understand the concept of – update requests processors (URPs). Every update request received by Solr is run through a chain of plugins called URPs.


One can write these plugins to do any sort of pre-processing on the Solr docs. You can add new fields or even drop a document that you don’t like?


In fact, a lot of features of solr are written this way as plugins so it is essential to understand how they work and how to configure them.


How do you implement UPRs?


UPRs are created by implementing 2 abstract classes – the UpdateRequestProcessor and UpdateRequestProcessorFactory. The factory class is used to initiate instances of UPRs when a new request comes and the main business logic goes in the request processor. The factory can also take configuration parameters that can be used to modify the way processor will work.


Many such small request processors make a chain of UPRs. They are applied in the order they are present in the chain when a new document is indexed.


A quick look at, solr config XML should reveal many such samples of update request processors like the one below

  

<updateRequestProcessorChain name=”dedupe”>

 <processor class=”solr.processor.SignatureUpdateProcessorFactory”>
 
        <bool name=”enabled”>true</bool>
        <str name=”signatureField”>id</str>
        <bool name=”overwriteDupes”>false</bool>
        <str name=”fields”>name,features,cat</str>

   <str name=”signatureClass”>solr.processor.Lookup3Signature</str>
 
      </processor>

      <processor class=”solr.LogUpdateProcessorFactory” />
      <processor class=”solr.RunUpdateProcessorFactory” />
 
    </updateRequestProcessorChain>

Shown above is an update processor chain with the name “dedupe” which is used to generate a signature field based on certain field(s) specified, and it is used to de-duplicate documents and not index the same document multiple times.


In this example, the fields – name, features, and cat are used to generate a field called id, which is used to identify duplicates, and duplicates are not overwritten. The class which does this is solr.processor.Lookup3Signature.


The last 2 are part of the default processor chain, which performs an essential function, and as such any custom chain usually contains these processors and shouldn’t be removed.

We can create such chains/individual processors in solrconfig.xml and specify which chain/processor to be used while indexing the document.


Let’s say our document has a field called “Category” and we expect a list of values for it, if the category value in an incoming document is something different, we want to change the field value to “Others” for those documents.


So our update request processor will look something like this-


public class CustomRequestProcessor extends UpdateRequestProcessor {
…..
@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {

 Log.info(“Processing the input Document in custom Request Processor”);
 SolrInputDocument doc = cmd.getSolrInputDocument();
 
 String category = (String) doc.getFieldValue(“Category”);
 //If category is not from a predefined list
     doc.setField(“Category”,”Others”);
 
 // pass it up the chain
 super.processAdd(cmd);
}
}

When this is completed, we just need to create a jar of the plugin classes and add a lib directive in solr config to inform solr where our plugins are present. They will be loaded when solr core (re)loads. We can now use the class in a processor definition like shown above.

30 views0 comments

Recent Posts

See All
bottom of page