Natural Language Processing (NLP) Functions
This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.
detectCharset
The detectCharset function detects the character set of the non-UTF8-encoded input string.
Syntax
Arguments
- text_to_be_analyzed— A collection (or sentences) of strings to analyze. String.
Returned value
- A Stringcontaining the code of the detected character set
Examples
Query:
Result:
detectLanguage
Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection, and it returns the 2-letter ISO language code.
The detectLanguage function works best when providing over 200 characters in the input string.
Syntax
Arguments
- text_to_be_analyzed— A collection (or sentences) of strings to analyze. String.
Returned value
- The 2-letter ISO code of the detected language
Other possible results:
- un= unknown, can not detect any language.
- other= the detected language does not have 2 letter code.
Examples
Query:
Result:
detectLanguageMixed
Similar to the detectLanguage function, but detectLanguageMixed returns a Map of 2-letter language codes that are mapped to the percentage of the certain language in the text.
Syntax
Arguments
- text_to_be_analyzed— A collection (or sentences) of strings to analyze. String.
Returned value
- Map(String, Float32): The keys are 2-letter ISO codes and the values are a percentage of text found for that language
Examples
Query:
Result:
detectProgrammingLanguage
Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code. Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.
Syntax
Arguments
- source_code— String representation of the source code to analyze. String.
Returned value
- Programming language. String.
Examples
Query:
Result:
detectLanguageUnknown
Similar to the detectLanguage function, except the detectLanguageUnknown function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
Syntax
Arguments
- text_to_be_analyzed— A collection (or sentences) of strings to analyze. String.
Returned value
- The 2-letter ISO code of the detected language
Other possible results:
- un= unknown, can not detect any language.
- other= the detected language does not have 2 letter code.
Examples
Query:
Result:
detectTonality
Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from -12 to 6.
For each text, it calculates the average sentiment value of its words and returns it in the range [-1,1].
This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at /contrib/nlp-data/tonality_ru.zst and only works for the Russian language.
Syntax
Arguments
- text— The text to be analyzed. String.
Returned value
- The average sentiment value of the words in text. Float32.
Examples
Query:
Result:
lemmatize
Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained here.
Syntax
Arguments
- language— Language which rules will be applied. String.
- word— Word that needs to be lemmatized. Must be lowercase. String.
Examples
Query:
Result:
Configuration
This configuration specifies that the dictionary en.bin should be used for lemmatization of English (en) words.  The .bin files can be downloaded from
here.
stem
Performs stemming on a given word.
Syntax
Arguments
- language— Language which rules will be applied. Use the two letter ISO 639-1 code.
- word— word that needs to be stemmed. Must be in lowercase. String.
Examples
Query:
Result:
Supported languages for stem()
The stem() function uses the Snowball stemming library, see the Snowball website for updated languages etc.
- Arabic
- Armenian
- Basque
- Catalan
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Lithuanian
- Nepali
- Norwegian
- Porter
- Portuguese
- Romanian
- Russian
- Serbian
- Spanish
- Swedish
- Tamil
- Turkish
- Yiddish
synonyms
Finds synonyms to a given word. There are two types of synonym extensions: plain and wordnet.
With the plain extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.
With the wordnet extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.
Syntax
Arguments
- extension_name— Name of the extension in which search will be performed. String.
- word— Word that will be searched in extension. String.
Examples
Query:
Result:
Configuration