A Novel Approach to Combined Keyword and Semantic Search with Contextual Reranking for Enhanced LLM Queries

By Eric Boehlke – truevis.com

Abstract
In the era of large language models (LLMs), efficient and accurate information retrieval is crucial. This paper introduces a novel hybrid approach that integrates keyword extraction, text-based search, semantic vector search, and reranking. By combining syntactic and semantic representations, it enhances retrieval and improves LLM-generated responses. We highlight limitations of semantic vector search with obscure jargon and show keyword-based methods address these issues.

Introduction

Large language models have advanced natural language processing capabilities, but their effectiveness depends on the quality of contextual information. Traditional keyword and semantic searches have gaps: keyword search might miss conceptually related terms, while semantic search may struggle with domain-specific jargon. We propose a hybrid approach that extracts keywords, performs text and semantic searches, then reranks results, improving contextual relevance for LLM queries.

Background

Keyword-Based Text Search

Keyword search matches exact terms and excels with technical jargon but misses documents that use different wording for similar concepts.

Semantic Vector Search

Semantic search uses embeddings to find conceptually similar documents even without exact term matches. However, it can fail with obscure jargon or terms with multiple meanings.

Limitations with Obscure Jargon

Semantic models trained on general data may misinterpret specialized terms. For example, “set screws” might be confused due to multiple meanings of “set.”

Reranking Techniques

Reranking orders search results by relevance after initial retrieval, ensuring the most pertinent documents are prioritized.

Methodology

Our approach:

  • Extract keywords from user query
  • Text-based search using keywords
  • Semantic vector search
  • Combine and rerank results
  • Provide top-ranked context to LLM

1. Keyword Extraction

We use an LLM system prompt to extract key terms, ensuring inclusion of at least one query term verbatim. This focuses on technical concepts and constructs a Boolean “OR” search phrase.

key_phrase_system_prompt = """

You are an expert at understanding technical queries in electrical construction building specifications and extracting key concepts for searching documentation. Analyze the given query and follow these steps:


1. Extract the most important terms, prioritizing specific technical concepts over general words.
2. ALWAYS include at least one word from the original query verbatim.
3. Construct a search phrase using Boolean operator 'OR' to refine results.
4. Limit the search phrase to 3-5 key terms for focused results.
5. Never repeat the same term more than once.



Return the extracted search phrase in JSON format under the key "important term", along with a brief explanation:

{
  "important term": "extracted OR search OR phrase",
  "explanation": "Brief rationale for chosen terms"
}



Input query: "ground requirements?"

Output:
{
  "important term": "ground OR grounding OR earthing",
  "explanation": "Included 'ground' verbatim from the query, added related terms 'grounding' and 'earthing' to cover electrical safety concepts."
}


"""

2. Text-Based Search

The extracted keywords query an SQLite database. Regex patterns handle variations in terms.

def parse_search_input(phrase):
    operators = {'AND', 'OR', 'NOT'}
    pattern = re.compile(r'\b(AND|OR|NOT)\b')
    parts = pattern.split(phrase)
    parts = [part.strip() for part in parts if part.strip()]
    query_conditions = []
    params = []
    i = 0
    while i < len(parts):
        part = parts[i]
        is_operator = part in operators and part.isupper()
        if is_operator and i + 1 < len(parts):
            next_part = parts[i + 1]
            if part == 'AND':
                if query_conditions:
                    query_conditions.append("AND")
                query_conditions.append("t.file_content REGEXP ?")
                params.append(create_word_pattern(next_part))
            elif part == 'OR':
                if query_conditions:
                    query_conditions.append("OR")
                query_conditions.append("t.file_content REGEXP ?")
                params.append(create_word_pattern(next_part))
            elif part == 'NOT':
                if query_conditions:
                    query_conditions.append("AND")
                query_conditions.append("t.file_content NOT REGEXP ?")
                params.append(create_word_pattern(next_part))
            i += 1
        elif not is_operator:
            if query_conditions:
                query_conditions.append("AND")
            query_conditions.append("t.file_content REGEXP ?")
            params.append(create_word_pattern(part))
        i += 1

    query = "SELECT t.file_name, t.file_content FROM text_files t WHERE " + " ".join(query_conditions)
    return query, params

3. Semantic Vector Search

We generate embeddings of the user query and query a vector database (e.g., Pinecone).

def generate_embeddings(text):
    response = openai.embeddings.create(model=EMBEDDINGS_MODEL, input=[text])
    embeddings = response.data[0].embedding
    return embeddings

def search_documents(user_input, top_k):
    query_embeddings = generate_embeddings(user_input)
    index = initialize_pinecone(PINECONE_API_KEY, PINECONE_INDEX_NAME)
    query_results = query_pinecone(index, query_embeddings, top_k)
    # Process query_results to extract document texts

4. Combining and Reranking Results

We merge text-based and semantic results and use a reranker (e.g., Voyage AI) to reorder by relevance.

def truncate_text(text, max_tokens=500):
    # Truncate text to the last 'max_tokens' tokens

def rerank_results(flattened_results, test_query, voyage_api_key):
    vo = voyageai.Client(api_key=voyage_api_key)
    chunk_size = 100
    all_reranked_results = []
    total_chunks = (len(flattened_results) + chunk_size - 1) // chunk_size

    for i in range(0, len(flattened_results), chunk_size):
        chunk = flattened_results[i:i+chunk_size]
        reranked_chunk = vo.rerank(test_query, chunk, model="rerank-1")
        all_reranked_results.extend(reranked_chunk.results)

    sorted_results = sorted(all_reranked_results, key=lambda x: x.relevance_score, reverse=True)
    return sorted_results

5. Context to LLM

Top documents become the LLM’s context, improving accuracy and relevance of the final response.

Application

In electrical construction specifications, precision is key. For “grounding requirements,” we extract “ground OR grounding OR earthing” and retrieve exact matches and semantically similar documents, then rerank.

For “set screws,” keyword search handles domain-specific jargon that semantic search alone might misinterpret.

Conclusion

By combining keyword and semantic searches with reranking, we improve retrieval and context for LLMs, especially in domains with complex jargon.