How to store big data?
Suppose we have a web service that aggregates 20 000 users, and each one
of them is linked to 300 unique user data entities containing whatever.
Here's naive approach on how to design an example relational database that
would be able to store above data:
Create table for users.
Create table for user data.
And thus, user data table contains 6 000 000 rows.
Querying tables that have millions of rows is slow. Especially when you
want to use JOIN, IN and other relational database features. We use
SQLite, but that might be subject to change.
I'd like our web service to be fast, so I thought of following approaches:
Optimize the hell out of queries, do a lot of caching etc. This is nice,
but these are just temporary workarounds. When database grows even
further, these will cease to work.
Rewriting our model layer to use NoSQL technology. This is not possible
due to lack of relational database features and even if we wanted this
approach, early tests made some functionalities even slower than they
already were.
Implement some kind of scalability. (You hear about cloud computing a lot
nowadays.) This is the most wanted option.
Implement some manual solution. For example, I could store all the users
with names beginning with letter "A..M" on server 1, while all other users
would belong to server 2. The problem with this approach is that I have to
redesign our architecture quite a lot and I'd like to avoid that.
Ideally, I'd have some kind of transparent solution that would allow me to
query seemingly uniform database server with no changes to code
whatsoever. The database server would scatter its table data on many
workers in a smart way (much like database optimizers), thus effectively
speeding everything up. (Is this even possible?)
In both cases, achieving interoperability seems like a lot of trouble...
Switching from SQLite to Postgres or Oracle solution. This isn't going to
be cheap, so I'd like some kind of confirmation before doing this.
What are my options? I want all my SELECTs and JOINs with indexed data to
be real-time, but right now every query costs 0.2s which is too much.
No comments:
Post a Comment