The word “database” is massively overloaded. Those seem to be storage, indexing and query engines, with no actual data included. They also seem to be quite different in focus, some in-memory intended to replicate and run on a client, some server-oriented for more ACID-like multiuser use, and each with different query properties.
Having done related work for a long long time, I’d strongly recommend against shiny, and against ever evaluating a vendor product when it’s not driven by your own problem statement to test it against. In fact, for almost all tech questions, start with “what do I want to accomplish”, not “how can I use this”?
Especially for data storage and manipulation, I even more strongly recommend against shiny. Simplicity and older mechanisms are almost always more valuable than the bells and whistles of newer systems.
What data (dimensionality and quantity) are you planning to put in it, and what uses of the data are you anticipating?
related: I’d like to be able to query what’s needed to display a page in a roamlike ui, which would involve a tree walk.
graph traversal: I want to be able to ask what references what efficiently, get shortest path between two nodes given some constraints on the path, etc.
search: I’d like to be able to query at least 3k (pages), maybe more like 30k (pages + line-level embeddings from lines of editable pages), if not more like 400k (line-level embeddings from all pages) vectors, comfortably; I’ll often want to query vectors while filtering to only relevant types of vector (page vs line, category, etc). milvus claims to have this down pat, weaviate seems shinier and has built in support for generating the embeddings, but according to a test is less performant? also it has fewer types of vector relationships and some of the ones milvus has look very useful, eg
sync: I’d like multiple users to be able to open a webclient (or deno/rust/python/something desktop client?) at the same time and get a realtime-ish synced view. this doesn’t necessarily have to be gdocs grade, but it should work for multiple users straightforwardly and so the serverside should know how to push to the client by default. if possible I want this without special setup. surrealdb specifically offers this, and its storage seems to be solid. but no python client. maybe that’s fine and I can use it entirely from javascript, but then how shall I combine with the vector db?
seems like I really need at least two dbs for this because none of them do both good vector search and good realtimeish sync. but, hmm, docs for surrealdb seem pretty weak. okay, maybe not surrealdb then. edgedb looks nice for main storage, but no realtime. I guess I’ll keep looking for that part.
Yeah, it seems likely you’ll end up with 2 or 3 different store/query mechanisms. Something fairly flat and transactional-ish (best-efforts probably fine, not long-disconnected edit resolution) for interactive edits, something for search/traversal (which will vary widely based on the depth of the traversals, the cardinality of the graph, etc. Could be a denormalized schema in the same DBM or.a different DBM). And perhaps a caching layer for low-latency needs (maybe not a different store/query, but just results caching somewhere). And perhaps an analytics store for asynchronous big-data processing.
Honestly, even if this is pretty big in scope, I’d prototype with Mongo or DynamoDB as my primary store (or a SQL store if you’re into that), using simple adjacency tables for the graph connections. Then either layer a GraphQL processor directly or on a replicated/differently-normalized store.
Can you give me some more clues here, I want to help with this. By vectors are you talking about similarity vectors between eg. lines of text, paragraphs etc? And to optimize this you would want a vector db?
Why is sync difficult? In my experience any regular postgres db will have pretty snappy sync times? I feel like the text generation times will always be the bottleneck? Or are you more thinking for post-generation weaving?
Maybe I also just don’t understand how different these types of dbs are from a regular postgres..
By sync, I meant server-initiated push for changes. Yep, vectors are sentence/document embeddings.
The main differences from postgres I seek are 1. I can be lazier setting up schema 2. realtime push built into the db so I don’t have to build messaging 3. if it could have surrealdb’s alleged “connect direct from the client” feature and not need serverside code at all that’d be wonderful
I’ve seen supabase suggested, as well as rethinkdb and kuzzle.
too many dang databases that look shiny. which of these are good? worth trying? idk. decision paralysis.
https://www.edgedb.com/docs—main-db-focused graph db, postgres core
https://terminusdb.com/products/terminusdb/ - main-db-focused graph db, prolog core (wat)
https://surrealdb.com/ - main-db-focused graph db, realtime functionality,
https://milvus.io/ - vector
https://weaviate.io/developers/weaviate—vector, sleek and easy to use, might not scale as well as milvus but I guess I should just not care
https://clientdb.dev/ - embedded, to compensate if not realtime, looks quite sleek and easy to use?
https://github.com/orbitdb/orbit-db—embedded, janky
https://one-db.org—embedded, unclear if it even works
https://immudb.io/ - serverside, maybe janky, ignore
maybe edgedb with clientdb? or surrealdb with clientdb? milvus for vector query? do I have to maintain multiple schemas then?
tried to get bing to answer this, it just agreed with me that these databases all sound cool and are hard to pick betwen :P
The word “database” is massively overloaded. Those seem to be storage, indexing and query engines, with no actual data included. They also seem to be quite different in focus, some in-memory intended to replicate and run on a client, some server-oriented for more ACID-like multiuser use, and each with different query properties.
Having done related work for a long long time, I’d strongly recommend against shiny, and against ever evaluating a vendor product when it’s not driven by your own problem statement to test it against. In fact, for almost all tech questions, start with “what do I want to accomplish”, not “how can I use this”?
Especially for data storage and manipulation, I even more strongly recommend against shiny. Simplicity and older mechanisms are almost always more valuable than the bells and whistles of newer systems.
What data (dimensionality and quantity) are you planning to put in it, and what uses of the data are you anticipating?
Good prompts.
related: I’d like to be able to query what’s needed to display a page in a roamlike ui, which would involve a tree walk.
graph traversal: I want to be able to ask what references what efficiently, get shortest path between two nodes given some constraints on the path, etc.
search: I’d like to be able to query at least 3k (pages), maybe more like 30k (pages + line-level embeddings from lines of editable pages), if not more like 400k (line-level embeddings from all pages) vectors, comfortably; I’ll often want to query vectors while filtering to only relevant types of vector (page vs line, category, etc). milvus claims to have this down pat, weaviate seems shinier and has built in support for generating the embeddings, but according to a test is less performant? also it has fewer types of vector relationships and some of the ones milvus has look very useful, eg
sync: I’d like multiple users to be able to open a webclient (or deno/rust/python/something desktop client?) at the same time and get a realtime-ish synced view. this doesn’t necessarily have to be gdocs grade, but it should work for multiple users straightforwardly and so the serverside should know how to push to the client by default. if possible I want this without special setup. surrealdb specifically offers this, and its storage seems to be solid. but no python client. maybe that’s fine and I can use it entirely from javascript, but then how shall I combine with the vector db?
seems like I really need at least two dbs for this because none of them do both good vector search and good realtimeish sync. but, hmm, docs for surrealdb seem pretty weak. okay, maybe not surrealdb then. edgedb looks nice for main storage, but no realtime. I guess I’ll keep looking for that part.
Yeah, it seems likely you’ll end up with 2 or 3 different store/query mechanisms. Something fairly flat and transactional-ish (best-efforts probably fine, not long-disconnected edit resolution) for interactive edits, something for search/traversal (which will vary widely based on the depth of the traversals, the cardinality of the graph, etc. Could be a denormalized schema in the same DBM or.a different DBM). And perhaps a caching layer for low-latency needs (maybe not a different store/query, but just results caching somewhere). And perhaps an analytics store for asynchronous big-data processing.
Honestly, even if this is pretty big in scope, I’d prototype with Mongo or DynamoDB as my primary store (or a SQL store if you’re into that), using simple adjacency tables for the graph connections. Then either layer a GraphQL processor directly or on a replicated/differently-normalized store.
Can you give me some more clues here, I want to help with this. By vectors are you talking about similarity vectors between eg. lines of text, paragraphs etc? And to optimize this you would want a vector db?
Why is sync difficult? In my experience any regular postgres db will have pretty snappy sync times? I feel like the text generation times will always be the bottleneck? Or are you more thinking for post-generation weaving?
Maybe I also just don’t understand how different these types of dbs are from a regular postgres..
By sync, I meant server-initiated push for changes. Yep, vectors are sentence/document embeddings.
The main differences from postgres I seek are 1. I can be lazier setting up schema 2. realtime push built into the db so I don’t have to build messaging 3. if it could have surrealdb’s alleged “connect direct from the client” feature and not need serverside code at all that’d be wonderful
I’ve seen supabase suggested, as well as rethinkdb and kuzzle.