Working with big vDatabases
Valentina is a perfect choice to work with a big data. It employs vertical storage approach. It means – each column has own virtual file to be stored in. Most of db vendors use a horizontal storage – it stores a table data in record-by-record manner.
There are some vertical storage advantages:
- There is no limit for a tuple length. In other words you may have up to 16 mln. fields in a table and you should not think of a record length at all. Horizontal storages practically always have some restrictions.
- It’s more speedy in modifying the data. When you choose a record and want to update only couple of fields you should not rewrite all the record – only that fields data will be changed.
- Also there is a huge speed up in getting data. Say you have a table with 100 fields but you want to get only couple of them – so you do something like this: “select f1, f2 from t1″. In this case vertical storage will read only two virtual files. In horizontal storage approach you have to read all the fields anyway.
Last days I tried to work with a pretty big db. Firstly I get the data from
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17203
Then import it in just created database using vStudio. It taked some time (~1 hour) but there was a debug vStudio build and ~6Gb of data. Finally we got a db – single table, 11 fields, ~120mln. records.
Then we played with it a bit:
- Open/Close this db takes almost zero time.
- “select * from t1″ takes ~15sec. (Most of this time is for locking records. I think it would be times less using “no-lock” cursor)
- Make some field nullable – takes almost zero time.
- Indexing of string field (2Gb field data) – 35 minutes.
- Searching with index – “select * from t1 where Allele1_Forward = ‘T’” – ~27mln records in result - takes 18sec (again – most of the time is a locking records).
- Similar searching without index – 586sec.
- select count(GC_Score) from t1 where Allele1_Forward = ‘T’ – ~2sec (You see – there is no locking records and time reduces from 18sec to only 2sec).
- select sum(GC_Score) from t1 where Allele1_Forward = ‘T’ – ~47sec
- Converting string field to the double type – 20minutes
- “select Allele1_Forward from t1 order by Allele1_Forward” – getting an exception – unable to allocate the memory for sorting data. Must be ok on 64-bit version.
Seems to be a pretty good results.
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
Valentina DB senior developer Ivan Samhain has been testing Valentina against NIH human genome data to see how well Valentina performs with complex databases and specifically to the advantage of a columnar database.