Skip to main content

馃憪 Chunk up the data

Since we are working with large documents, we first need to break them up into smaller chunks before embedding and storing them in MongoDB.

Fill in any <CODE_BLOCK_N> placeholders and run the cells under the Step 4: Chunk up the data section in the notebook to chunk up the articles we loaded.

The answers for code blocks in this section are as follows:

CODE_BLOCK_1

Answer
RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4", separators=separators, chunk_size=200, chunk_overlap=30
)

CODE_BLOCK_2

Answer
text_splitter.split_text(text)

CODE_BLOCK_3

Answer
get_chunks(doc, "body")