📘 Custom analyzers
Composed from analyzer building blocks:
charFilters
: pre-process characters of text for filtering/replacing (optional) htmlStrip, icuNormalize, mapping, persiantokenizer
: splits text into tokens edgeGram, keyword, nGram, regexCaptureGroup, regexSplit, standard, uaxUrlEmail, whitespacetokenFilters
: processes individual tokens (optional) asciiFolding, daitchMokotoffSoundex, edgeGram, englishPossessive, flattenGraph, icuFolding, icuNormalizer, kStemming, length, lowercase, nGram, porterStemming, regex, reverse, shingle, snowballStemming, spanishPluralStemming, stempel, stopword, trim, wordDelimiterGraph
Custom analyzers
last 4 digits
🔗 Last 4 digit of phone number matching (regex extraction during indexing, keyword analysis at query time)
reverse token filter
🔗 Example of being able to do ‘startsWith’ and ‘endsWith’ using wildcard and ‘reverse’ token filterstarts-with
🔗 Example of being able to do ‘startsWith’, ‘endsWith’ and ‘contains’ using nGramsExercises
HTML stripping indexer on English text
🔗 Stripping HTML tags for indexingFrom there:
lucene.english
- "span" and "alert" are indexed/searchable- create an htmlStrip using custom analyzer where those don't match, but "look" and "here" do
- now evolve such that "looking" matches too