📘 Custom analyzers

Composed from analyzer building blocks:

charFilters: pre-process characters of text for filtering/replacing (optional) htmlStrip, icuNormalize, mapping, persian
tokenizer: splits text into tokens edgeGram, keyword, nGram, regexCaptureGroup, regexSplit, standard, uaxUrlEmail, whitespace
tokenFilters: processes individual tokens (optional) asciiFolding, daitchMokotoffSoundex, edgeGram, englishPossessive, flattenGraph, icuFolding, icuNormalizer, kStemming, length, lowercase, nGram, porterStemming, regex, reverse, shingle, snowballStemming, spanishPluralStemming, stempel, stopword, trim, wordDelimiterGraph

Custom analyzers

🔗 Last 4 digit of phone number matching (regex extraction during indexing, keyword analysis at query time)

From there:

lucene.english - "span" and "alert" are indexed/searchable
create an htmlStrip using custom analyzer where those don't match, but "look" and "here" do
now evolve such that "looking" matches too