Skip to main content

📘 Custom analyzers

Composed from analyzer building blocks:

  • charFilters: pre-process characters of text for filtering/replacing (optional) htmlStrip, icuNormalize, mapping, persian
  • tokenizer: splits text into tokens edgeGram, keyword, nGram, regexCaptureGroup, regexSplit, standard, uaxUrlEmail, whitespace
  • tokenFilters: processes individual tokens (optional) asciiFolding, daitchMokotoffSoundex, edgeGram, englishPossessive, flattenGraph, icuFolding, icuNormalizer, kStemming, length, lowercase, nGram, porterStemming, regex, reverse, shingle, snowballStemming, spanishPluralStemming, stempel, stopword, trim, wordDelimiterGraph

Custom analyzers

last 4 digits

🔗 Last 4 digit of phone number matching (regex extraction during indexing, keyword analysis at query time)

reverse token filter

🔗 Example of being able to do ‘startsWith’ and ‘endsWith’ using wildcard and ‘reverse’ token filter

starts-with

🔗 Example of being able to do ‘startsWith’, ‘endsWith’ and ‘contains’ using nGrams

Exercises

HTML stripping indexer on English text

🔗 Stripping HTML tags for indexing

From there:

  • lucene.english - "span" and "alert" are indexed/searchable
  • create an htmlStrip using custom analyzer where those don't match, but "look" and "here" do
  • now evolve such that "looking" matches too