Let's have a look. We are doing a simple match_phrase query like this:
No results. However the JSON content of the available document looks fine:
{
.....
"_source": {
"_docType": "jobs",
"header": "IT-Support (m/w)",
"url": "https://www......",
"mainContent": """
Für unseren Standort in München suchen wir ab sofort Mitarbeiter/-innen für den
**IT-Support (m/w)**
.......
- Erfahrung mit den Betriebssystemen (Windows 7, Linux)
.......
"""
}
.....
}
{
"tokens": [
{
"token": "windows",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "7",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}
{
{
"tokens": [
{
"token": "(windows",
....
},
{
"token": "7",
.....
}
]
}
Ok, that's unexpected. The "(" is part of the token. That's the reason why "windows 7" is not matching. Instead "(windows 7" would match. What did we do wrong?
It becomes apparent that our tokenizer definition of the index is buggy. We don't use the standard tokenizer because it doesn't make sense for some IT terminology (e.g. node.js, .NET). We are using a self-defined pattern for tokenizing. The solution is to add the brackets to the tokenizer pattern. Simplified something like this:
{
"tokenizer": {
"jobpushy_tokenizer": {
"pattern": """,|;|:|\s|-|/|\(|\)""",
"type": "pattern"
}
}
About the #WTF posts on this blog:
These are short articles that show simple solutions to problems that lead to a headache in the first place. This serves two goals: 1.) Remembering the solution (and own stupidity) for the next time 2.) Helping other people with similiar problems.