A Novel Approach to Combined Keyword and Semantic Search with Contextual Reranking for Enhanced LLM Queries
By Eric Boehlke – truevis.com
Abstract
In the era of large language models (LLMs), efficient and accurate information retrieval is crucial. This paper introduces a novel hybrid approach that integrates keyword extraction, text-based search, semantic vector search, and reranking. By combining syntactic and semantic representations, it enhances retrieval and improves LLM-generated responses. We highlight limitations of semantic vector search with obscure jargon and show keyword-based methods address these issues.
Introduction
Large language models have advanced natural language processing capabilities, but their effectiveness depends on the quality of contextual information. Traditional keyword and semantic searches have gaps: keyword search might miss conceptually related terms, while semantic search may struggle with domain-specific jargon. We propose a hybrid approach that extracts keywords, performs text and semantic searches, then reranks results, improving contextual relevance for LLM queries.
Background
Keyword-Based Text Search
Keyword search matches exact terms and excels with technical jargon but misses documents that use different wording for similar concepts.
Semantic Vector Search
Semantic search uses embeddings to find conceptually similar documents even without exact term matches. However, it can fail with obscure jargon or terms with multiple meanings.
Limitations with Obscure Jargon
Semantic models trained on general data may misinterpret specialized terms. For example, “set screws” might be confused due to multiple meanings of “set.”
Reranking Techniques
Reranking orders search results by relevance after initial retrieval, ensuring the most pertinent documents are prioritized.
Methodology
Our approach:
- Extract keywords from user query
- Text-based search using keywords
- Semantic vector search
- Combine and rerank results
- Provide top-ranked context to LLM
1. Keyword Extraction
We use an LLM system prompt to extract key terms, ensuring inclusion of at least one query term verbatim. This focuses on technical concepts and constructs a Boolean “OR” search phrase.
key_phrase_system_prompt = """
You are an expert at understanding technical queries in electrical construction building specifications and extracting key concepts for searching documentation. Analyze the given query and follow these steps:
1. Extract the most important terms, prioritizing specific technical concepts over general words.
2. ALWAYS include at least one word from the original query verbatim.
3. Construct a search phrase using Boolean operator 'OR' to refine results.
4. Limit the search phrase to 3-5 key terms for focused results.
5. Never repeat the same term more than once.
Return the extracted search phrase in JSON format under the key "important term", along with a brief explanation:
{
"important term": "extracted OR search OR phrase",
"explanation": "Brief rationale for chosen terms"
}
Input query: "ground requirements?"
Output:
{
"important term": "ground OR grounding OR earthing",
"explanation": "Included 'ground' verbatim from the query, added related terms 'grounding' and 'earthing' to cover electrical safety concepts."
}
"""
2. Text-Based Search
The extracted keywords query an SQLite database. Regex patterns handle variations in terms.
def parse_search_input(phrase):
operators = {'AND', 'OR', 'NOT'}
pattern = re.compile(r'\b(AND|OR|NOT)\b')
parts = pattern.split(phrase)
parts = [part.strip() for part in parts if part.strip()]
query_conditions = []
params = []
i = 0
while i < len(parts):
part = parts[i]
is_operator = part in operators and part.isupper()
if is_operator and i + 1 < len(parts):
next_part = parts[i + 1]
if part == 'AND':
if query_conditions:
query_conditions.append("AND")
query_conditions.append("t.file_content REGEXP ?")
params.append(create_word_pattern(next_part))
elif part == 'OR':
if query_conditions:
query_conditions.append("OR")
query_conditions.append("t.file_content REGEXP ?")
params.append(create_word_pattern(next_part))
elif part == 'NOT':
if query_conditions:
query_conditions.append("AND")
query_conditions.append("t.file_content NOT REGEXP ?")
params.append(create_word_pattern(next_part))
i += 1
elif not is_operator:
if query_conditions:
query_conditions.append("AND")
query_conditions.append("t.file_content REGEXP ?")
params.append(create_word_pattern(part))
i += 1
query = "SELECT t.file_name, t.file_content FROM text_files t WHERE " + " ".join(query_conditions)
return query, params
3. Semantic Vector Search
We generate embeddings of the user query and query a vector database (e.g., Pinecone).
def generate_embeddings(text):
response = openai.embeddings.create(model=EMBEDDINGS_MODEL, input=[text])
embeddings = response.data[0].embedding
return embeddings
def search_documents(user_input, top_k):
query_embeddings = generate_embeddings(user_input)
index = initialize_pinecone(PINECONE_API_KEY, PINECONE_INDEX_NAME)
query_results = query_pinecone(index, query_embeddings, top_k)
# Process query_results to extract document texts
4. Combining and Reranking Results
We merge text-based and semantic results and use a reranker (e.g., Voyage AI) to reorder by relevance.
def truncate_text(text, max_tokens=500):
# Truncate text to the last 'max_tokens' tokens
def rerank_results(flattened_results, test_query, voyage_api_key):
vo = voyageai.Client(api_key=voyage_api_key)
chunk_size = 100
all_reranked_results = []
total_chunks = (len(flattened_results) + chunk_size - 1) // chunk_size
for i in range(0, len(flattened_results), chunk_size):
chunk = flattened_results[i:i+chunk_size]
reranked_chunk = vo.rerank(test_query, chunk, model="rerank-1")
all_reranked_results.extend(reranked_chunk.results)
sorted_results = sorted(all_reranked_results, key=lambda x: x.relevance_score, reverse=True)
return sorted_results
5. Context to LLM
Top documents become the LLM’s context, improving accuracy and relevance of the final response.
Application
In electrical construction specifications, precision is key. For “grounding requirements,” we extract “ground OR grounding OR earthing” and retrieve exact matches and semantically similar documents, then rerank.
For “set screws,” keyword search handles domain-specific jargon that semantic search alone might misinterpret.
Conclusion
By combining keyword and semantic searches with reranking, we improve retrieval and context for LLMs, especially in domains with complex jargon.