Replies: 1 comment
-
|
Semantic chunking is important for quality retrieval. Here is how to implement it: Semantic vs Fixed-Size Chunking
Semantic Chunking Implementationfrom sentence_transformers import SentenceTransformer
import numpy as np
from typing import List
class SemanticChunker:
def __init__(
self,
model_name: str = "all-MiniLM-L6-v2",
similarity_threshold: float = 0.7,
max_chunk_size: int = 1000
):
self.model = SentenceTransformer(model_name)
self.threshold = similarity_threshold
self.max_size = max_chunk_size
def chunk(self, text: str) -> List[str]:
# Split into sentences
sentences = self._split_sentences(text)
if not sentences:
return [text]
# Get embeddings
embeddings = self.model.encode(sentences)
# Group by semantic similarity
chunks = []
current_chunk = [sentences[0]]
current_embedding = embeddings[0]
for i in range(1, len(sentences)):
similarity = self._cosine_similarity(
current_embedding,
embeddings[i]
)
current_text = " ".join(current_chunk)
# Start new chunk if:
# 1. Semantic shift detected, OR
# 2. Current chunk too large
if similarity < self.threshold or len(current_text) > self.max_size:
chunks.append(current_text)
current_chunk = [sentences[i]]
current_embedding = embeddings[i]
else:
current_chunk.append(sentences[i])
# Update embedding as weighted average
current_embedding = (
current_embedding * (len(current_chunk) - 1) + embeddings[i]
) / len(current_chunk)
# Add final chunk
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def _split_sentences(self, text: str) -> List[str]:
import re
return [s.strip() for s in re.split(r"[.!?]+", text) if s.strip()]
def _cosine_similarity(self, a, b) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))Integration with LLMWarefrom llmware.library import Library
class SemanticLibrary(Library):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.chunker = SemanticChunker()
def add_document(self, text: str, metadata: dict = None):
chunks = self.chunker.chunk(text)
for i, chunk in enumerate(chunks):
super().add_document(
chunk,
metadata={"chunk_index": i, **(metadata or {})}
)Benefits
More patterns: https://github.com/KeepALifeUS/autonomous-agents |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Thank your for this amazing project.
I asked Chatgpt if LLMware support Semantic Chunking and it says yes but I'm not sure. Can someone help to clarify please.
"Yes, LLMWare supports semantic chunking as part of its parsing capabilities. Semantic chunking involves dividing text into meaningful units to enhance processing and retrieval. LLMWare facilitates this through its TextChunker class, which allows for the configuration of text chunking and extraction parameters. This functionality is crucial in building effective Retrieval-Augmented Generation (RAG) pipelines, as it ensures that parsed text is segmented into coherent chunks, improving the relevance and accuracy of information retrieval.
For example, when parsing documents such as PDFs, LLMWare enables users to configure how text is chunked, ensuring that each segment maintains semantic integrity. This approach enhances the performance of downstream tasks like embedding and querying by preserving the contextual meaning within each chunk.
By integrating semantic chunking into its framework, LLMWare provides a robust toolset for developers aiming to build sophisticated AI workflows that require precise text parsing and information retrieval."
Semantic Chunking link: https://www.pointable.ai/blog/exploring-semantic-chunking
Beta Was this translation helpful? Give feedback.
All reactions