...

Text file src/gitlab.hexacode.org/go-libs/chromem-go/README.md

Documentation: gitlab.hexacode.org/go-libs/chromem-go

     1# chromem-go
     2
     3[![Go Reference](https://pkg.go.dev/badge/github.com/philippgille/chromem-go.svg)](https://pkg.go.dev/github.com/philippgille/chromem-go)
     4[![Build status](https://github.com/philippgille/chromem-go/actions/workflows/go.yml/badge.svg)](https://github.com/philippgille/chromem-go/actions/workflows/go.yml)
     5[![Go Report Card](https://goreportcard.com/badge/github.com/philippgille/chromem-go)](https://goreportcard.com/report/github.com/philippgille/chromem-go)
     6[![GitHub Releases](https://img.shields.io/github/release/philippgille/chromem-go.svg)](https://github.com/philippgille/chromem-go/releases)
     7
     8Embeddable vector database for Go with Chroma-like interface and zero third-party dependencies. In-memory with optional persistence.
     9
    10Because `chromem-go` is embeddable it enables you to add retrieval augmented generation (RAG) and similar embeddings-based features into your Go app *without having to run a separate database*. Like when using SQLite instead of PostgreSQL/MySQL/etc.
    11
    12It's *not* a library to connect to Chroma and also not a reimplementation of it in Go. It's a database on its own.
    13
    14The focus is not scale (millions of documents) or number of features, but simplicity and performance for the most common use cases. On a mid-range 2020 Intel laptop CPU you can query 1,000 documents in 0.3 ms and 100,000 documents in 40 ms, with very few and small memory allocations. See [Benchmarks](#benchmarks) for details.
    15
    16> ⚠️ The project is in beta, under heavy construction, and may introduce breaking changes in releases before `v1.0.0`. All changes are documented in the [`CHANGELOG`](./CHANGELOG.md).
    17
    18## Contents
    19
    201. [Use cases](#use-cases)
    212. [Interface](#interface)
    223. [Features + Roadmap](#features)
    234. [Installation](#installation)
    245. [Usage](#usage)
    256. [Benchmarks](#benchmarks)
    267. [Development](#development)
    278. [Motivation](#motivation)
    289. [Related projects](#related-projects)
    29
    30## Use cases
    31
    32With a vector database you can do various things:
    33
    34- Retrieval augmented generation (RAG), question answering (Q&A)
    35- Text and code search
    36- Recommendation systems
    37- Classification
    38- Clustering
    39
    40Let's look at the RAG use case in more detail:
    41
    42### RAG
    43
    44The knowledge of large language models (LLMs) - even the ones with 30 billion, 70 billion parameters and more - is limited. They don't know anything about what happened after their training ended, they don't know anything about data they were not trained with (like your company's intranet, Jira / bug tracker, wiki or other kinds of knowledge bases), and even the data they *do* know they often can't reproduce it *exactly*, but start to *hallucinate* instead.
    45
    46Fine-tuning an LLM can help a bit, but it's more meant to improve the LLMs reasoning about specific topics, or reproduce the style of written text or code. Fine-tuning does *not* add knowledge *1:1* into the model. Details are lost or mixed up. And knowledge cutoff (about anything that happened after the fine-tuning) isn't solved either.
    47
    48=> A vector database can act as the up-to-date, precise knowledge for LLMs:
    49
    501. You store relevant documents that you want the LLM to know in the database.
    512. The database stores the *embeddings* alongside the documents, which you can either provide or can be created by specific "embedding models" like OpenAI's `text-embedding-3-small`.
    52   - `chromem-go` can do this for you and supports multiple embedding providers and models out-of-the-box.
    533. Later, when you want to talk to the LLM, you first send the question to the vector DB to find *similar*/*related* content. This is called "nearest neighbor search".
    544. In the question to the LLM, you provide this content alongside your question.
    555. The LLM can take this up-to-date precise content into account when answering.
    56
    57Check out the [example code](examples) to see it in action!
    58
    59## Interface
    60
    61Our original inspiration was the [Chroma](https://www.trychroma.com/) interface, whose core API is the following (taken from their [README](https://github.com/chroma-core/chroma/blob/0.4.21/README.md)):
    62
    63<details><summary>Chroma core interface</summary>
    64
    65```python
    66import chromadb
    67# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
    68client = chromadb.Client()
    69
    70# Create collection. get_collection, get_or_create_collection, delete_collection also available!
    71collection = client.create_collection("all-my-documents")
    72
    73# Add docs to the collection. Can also update and delete. Row-based API coming soon!
    74collection.add(
    75    documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    76    metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
    77    ids=["doc1", "doc2"], # unique for each doc
    78)
    79
    80# Query/search 2 most similar results. You can also .get by id
    81results = collection.query(
    82    query_texts=["This is a query document"],
    83    n_results=2,
    84    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    85    # where_document={"$contains":"search_string"}  # optional filter
    86)
    87```
    88
    89</details>
    90
    91Our Go library exposes the same interface:
    92
    93<details><summary>chromem-go equivalent</summary>
    94
    95```go
    96package main
    97
    98import "github.com/philippgille/chromem-go"
    99
   100func main() {
   101    // Set up chromem-go in-memory, for easy prototyping. Can add persistence easily!
   102    // We call it DB instead of client because there's no client-server separation. The DB is embedded.
   103    db := chromem.NewDB()
   104
   105    // Create collection. GetCollection, GetOrCreateCollection, DeleteCollection also available!
   106    collection, _ := db.CreateCollection("all-my-documents", nil, nil)
   107
   108    // Add docs to the collection. Update and delete will be added in the future.
   109    // Can be multi-threaded with AddConcurrently()!
   110    // We're showing the Chroma-like method here, but more Go-idiomatic methods are also available!
   111    _ = collection.Add(ctx,
   112        []string{"doc1", "doc2"}, // unique ID for each doc
   113        nil, // We handle embedding automatically. You can skip that and add your own embeddings as well.
   114        []map[string]string{{"source": "notion"}, {"source": "google-docs"}}, // Filter on these!
   115        []string{"This is document1", "This is document2"},
   116    )
   117
   118    // Query/search 2 most similar results. Getting by ID will be added in the future.
   119    results, _ := collection.Query(ctx,
   120        "This is a query document",
   121        2,
   122        map[string]string{"metadata_field": "is_equal_to_this"}, // optional filter
   123        map[string]string{"$contains": "search_string"},         // optional filter
   124    )
   125}
   126```
   127
   128</details>
   129
   130Initially `chromem-go` started with just the four core methods, but we added more over time. We intentionally don't want to cover 100% of Chroma's API surface though.  
   131We're providing some alternative methods that are more Go-idiomatic instead.
   132
   133For the full interface see the Godoc: <https://pkg.go.dev/github.com/philippgille/chromem-go>
   134
   135## Features
   136
   137- [X] Zero dependencies on third party libraries
   138- [X] Embeddable (like SQLite, i.e. no client-server model, no separate DB to maintain)
   139- [X] Multithreaded processing (when adding and querying documents), making use of Go's native concurrency features
   140- [X] Experimental WebAssembly binding
   141- Embedding creators:
   142  - Hosted:
   143    - [X] [OpenAI](https://platform.openai.com/docs/guides/embeddings/embedding-models) (default)
   144    - [X] [Cohere](https://cohere.com/models/embed)
   145    - [X] [Mistral](https://docs.mistral.ai/platform/endpoints/#embedding-models)
   146    - [X] [Jina](https://jina.ai/embeddings)
   147    - [X] [mixedbread.ai](https://www.mixedbread.ai/)
   148  - Local:
   149    - [X] [Ollama](https://github.com/ollama/ollama)
   150    - [X] [LocalAI](https://github.com/mudler/LocalAI)
   151  - Bring your own (implement [`chromem.EmbeddingFunc`](https://pkg.go.dev/github.com/philippgille/chromem-go#EmbeddingFunc))
   152  - You can also pass existing embeddings when adding documents to a collection, instead of letting `chromem-go` create them
   153- Similarity search:
   154  - [X] Exhaustive nearest neighbor search using cosine similarity (sometimes also called exact search or brute-force search or FLAT index)
   155- Filters:
   156  - [X] Document filters: `$contains`, `$not_contains`
   157  - [X] Metadata filters: Exact matches
   158- Storage:
   159  - [X] In-memory
   160  - [X] Optional immediate persistence (writes one file for each added collection and document, encoded as [gob](https://go.dev/blog/gob), optionally gzip-compressed)
   161  - [X] Backups: Export and import of the entire DB to/from a single file (encoded as [gob](https://go.dev/blog/gob), optionally gzip-compressed and AES-GCM encrypted)
   162    - Includes methods for generic `io.Writer`/`io.Reader` so you can plug S3 buckets and other blob storage, see [examples/s3-export-import](examples/s3-export-import) for example code
   163- Data types:
   164  - [X] Documents (text)
   165
   166### Roadmap
   167
   168- Performance:
   169  - Use SIMD for dot product calculation on supported CPUs (draft PR: [#48](https://github.com/philippgille/chromem-go/pull/48))
   170  - Add [roaring bitmaps](https://github.com/RoaringBitmap/roaring) to speed up full text filtering
   171- Embedding creators:
   172  - Add an `EmbeddingFunc` that downloads and shells out to [llamafile](https://github.com/Mozilla-Ocho/llamafile)
   173- Similarity search:
   174  - Approximate nearest neighbor search with index (ANN)
   175    - Hierarchical Navigable Small World (HNSW)
   176    - Inverted file flat (IVFFlat)
   177- Filters:
   178  - Operators (`$and`, `$or` etc.)
   179- Storage:
   180  - JSON as second encoding format
   181  - Write-ahead log (WAL) as second file format
   182  - Optional remote storage (S3, PostgreSQL, ...)
   183- Data types:
   184  - Images
   185  - Videos
   186
   187## Installation
   188
   189`go get github.com/philippgille/chromem-go@latest`
   190
   191## Usage
   192
   193See the Godoc for a reference: <https://pkg.go.dev/github.com/philippgille/chromem-go>
   194
   195For full, working examples, using the vector database for retrieval augmented generation (RAG) and semantic search and using either OpenAI or locally running the embeddings model and LLM (in Ollama), see the [example code](examples).
   196
   197### Quickstart
   198
   199This is taken from the ["minimal" example](examples/minimal):
   200
   201```go
   202package main
   203
   204import (
   205 "context"
   206 "fmt"
   207 "runtime"
   208
   209 "github.com/philippgille/chromem-go"
   210)
   211
   212func main() {
   213  ctx := context.Background()
   214
   215  db := chromem.NewDB()
   216
   217  c, err := db.CreateCollection("knowledge-base", nil, nil)
   218  if err != nil {
   219    panic(err)
   220  }
   221
   222  err = c.AddDocuments(ctx, []chromem.Document{
   223    {
   224      ID:      "1",
   225      Content: "The sky is blue because of Rayleigh scattering.",
   226    },
   227    {
   228      ID:      "2",
   229      Content: "Leaves are green because chlorophyll absorbs red and blue light.",
   230    },
   231  }, runtime.NumCPU())
   232  if err != nil {
   233    panic(err)
   234  }
   235
   236  res, err := c.Query(ctx, "Why is the sky blue?", 1, nil, nil)
   237  if err != nil {
   238    panic(err)
   239  }
   240
   241  fmt.Printf("ID: %v\nSimilarity: %v\nContent: %v\n", res[0].ID, res[0].Similarity, res[0].Content)
   242}
   243```
   244
   245Output:
   246
   247```text
   248ID: 1
   249Similarity: 0.6833369
   250Content: The sky is blue because of Rayleigh scattering.
   251```
   252
   253## Benchmarks
   254
   255Benchmarked on 2024-03-17 with:
   256
   257- Computer: Framework Laptop 13 (first generation, 2021)
   258- CPU: 11th Gen Intel Core i5-1135G7 (2020)
   259- Memory: 32 GB
   260- OS: Fedora Linux 39
   261  - Kernel: 6.7
   262
   263```console
   264$ go test -benchmem -run=^$ -bench .
   265goos: linux
   266goarch: amd64
   267pkg: github.com/philippgille/chromem-go
   268cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
   269BenchmarkCollection_Query_NoContent_100-8          13164      90276 ns/op     5176 B/op       95 allocs/op
   270BenchmarkCollection_Query_NoContent_1000-8          2142     520261 ns/op    13558 B/op      141 allocs/op
   271BenchmarkCollection_Query_NoContent_5000-8           561    2150354 ns/op    47096 B/op      173 allocs/op
   272BenchmarkCollection_Query_NoContent_25000-8          120    9890177 ns/op   211783 B/op      208 allocs/op
   273BenchmarkCollection_Query_NoContent_100000-8          30   39574238 ns/op   810370 B/op      232 allocs/op
   274BenchmarkCollection_Query_100-8                    13225      91058 ns/op     5177 B/op       95 allocs/op
   275BenchmarkCollection_Query_1000-8                    2226     519693 ns/op    13552 B/op      140 allocs/op
   276BenchmarkCollection_Query_5000-8                     550    2128121 ns/op    47108 B/op      173 allocs/op
   277BenchmarkCollection_Query_25000-8                    100   10063260 ns/op   211705 B/op      205 allocs/op
   278BenchmarkCollection_Query_100000-8                    30   39404005 ns/op   810295 B/op      229 allocs/op
   279PASS
   280ok   github.com/philippgille/chromem-go 28.402s
   281```
   282
   283## Development
   284
   285- Build: `go build ./...`
   286- Test: `go test -v -race -count 1 ./...`
   287- Benchmark:
   288  - `go test -benchmem -run=^$ -bench .` (add `> bench.out` or similar to write to a file)
   289  - With profiling: `go test -benchmem -run ^$ -cpuprofile cpu.out -bench .`
   290    - (profiles: `-cpuprofile`, `-memprofile`, `-blockprofile`, `-mutexprofile`)
   291- Compare benchmarks:
   292  1. Install `benchstat`: `go install golang.org/x/perf/cmd/benchstat@latest`
   293  2. Compare two benchmark results: `benchstat before.out after.out`
   294
   295## Motivation
   296
   297In December 2023, when I wanted to play around with retrieval augmented generation (RAG) in a Go program, I looked for a vector database that could be embedded in the Go program, just like you would embed SQLite in order to not require any separate DB setup and maintenance. I was surprised when I didn't find any, given the abundance of embedded key-value stores in the Go ecosystem.
   298
   299At the time most of the popular vector databases like Pinecone, Qdrant, Milvus, Chroma, Weaviate and others were not embeddable at all or only in Python or JavaScript/TypeScript.
   300
   301Then I found [@eliben](https://github.com/eliben)'s [blog post](https://eli.thegreenplace.net/2023/retrieval-augmented-generation-in-go/) and [example code](https://github.com/eliben/code-for-blog/tree/eda87b87dad9ed8bd45d1c8d6395efba3741ed39/2023/go-rag-openai) which showed that with very little Go code you could create a very basic PoC of a vector database.
   302
   303That's when I decided to build my own vector database, embeddable in Go, inspired by the ChromaDB interface. ChromaDB stood out for being embeddable (in Python), and by showing its core API in 4 commands on their README and on the landing page of their website.
   304
   305## Related projects
   306
   307- Shoutout to [@eliben](https://github.com/eliben) whose [blog post](https://eli.thegreenplace.net/2023/retrieval-augmented-generation-in-go/) and [example code](https://github.com/eliben/code-for-blog/tree/eda87b87dad9ed8bd45d1c8d6395efba3741ed39/2023/go-rag-openai) inspired me to start this project!
   308- [Chroma](https://github.com/chroma-core/chroma): Looking at Pinecone, Qdrant, Milvus, Weaviate and others, Chroma stood out by showing its core API in 4 commands on their README and on the landing page of their website. It was also putting the most emphasis on its embeddability (in Python).
   309- The big, full-fledged client-server-based vector databases for maximum scale and performance:
   310  - [Pinecone](https://www.pinecone.io/): Closed source
   311  - [Qdrant](https://github.com/qdrant/qdrant): Written in Rust, not embeddable in Go
   312  - [Milvus](https://github.com/milvus-io/milvus): Written in Go and C++, but not embeddable as of December 2023
   313  - [Weaviate](https://github.com/weaviate/weaviate): Written in Go, but not embeddable in Go as of March 2024 (only in Python and JavaScript/TypeScript and that's experimental)
   314- Some non-specialized SQL, NoSQL and Key-Value databases added support for storing vectors and (some of them) querying based on similarity:
   315  - [pgvector](https://github.com/pgvector/pgvector) extension for [PostgreSQL](https://www.postgresql.org/): Client-server model
   316  - [Redis](https://github.com/redis/redis) ([1](https://redis.io/docs/interact/search-and-query/query/vector-search/), [2](https://redis.io/docs/interact/search-and-query/advanced-concepts/vectors/)): Client-server model
   317  - [sqlite-vss](https://github.com/asg017/sqlite-vss) extension for [SQLite](https://www.sqlite.org/): Embedded, but the [Go bindings](https://github.com/asg017/sqlite-vss/tree/8fc44301843029a13a474d1f292378485e1fdd62/bindings/go) require CGO. There's a [CGO-free Go library](https://gitlab.com/cznic/sqlite) for SQLite, but then it's without the vector search extension.
   318  - [DuckDB](https://github.com/duckdb/duckdb) has a function to calculate cosine similarity ([1](https://duckdb.org/docs/sql/functions/nested)): Embedded, but the Go bindings use CGO
   319  - [MongoDB](https://github.com/mongodb/mongo)'s cloud platform offers a vector search product ([1](https://www.mongodb.com/products/platform/atlas-vector-search)): Client-server model
   320- Some libraries for vector similarity search:
   321  - [Faiss](https://github.com/facebookresearch/faiss): Written in C++; 3rd party Go bindings use CGO
   322  - [Annoy](https://github.com/spotify/annoy): Written in C++; Go bindings use CGO ([1](https://github.com/spotify/annoy/blob/2be37c9e015544be2cf60c431f0cccc076151a2d/README_GO.rst))
   323  - [USearch](https://github.com/unum-cloud/usearch): Written in C++; Go bindings use CGO
   324- Some orchestration libraries, inspired by the Python library [LangChain](https://github.com/langchain-ai/langchain), but with no or only rudimentary embedded vector DB:
   325  - [LangChain Go](https://github.com/tmc/langchaingo)
   326  - [LinGoose](https://github.com/henomis/lingoose)
   327  - [GoLC](https://github.com/hupe1980/golc)

View as plain text