Sequencing data
management is big business.
The business model
of companies in that space is based on the well-publicised
fact that it takes a lot of disk space
to store sequencing data, and that it may be advantageous for many
organisations to outsource the management of that data.
Surely, this means
that data storage is going to be a major issue for genomics?
I would like to
argue that things may not get as bad as often feared.
First, it's
necessary to explain why sequencing data currently takes so much disk space. A human genome is
around 3 billion bases long, and storing it should therefore take no more than
3 gigabytes (GB). In reality, around 100 GB are needed. That's more than you could get on
the newest iPad.
The reason for this
is the nature of the data generated by most sequencing machines. They can only
sequence segments of DNA that are no more than a few hundred bases long. As a
result, each base needs to be sequenced several times to make sure that there
are no gaps between segments. Another reason for sequencing bases multiple
times is to make sure that the sequencing machine has made no errors.
Therefore, the large
amount of data generated by sequencing is at least in part due to the
limitations of currently used technology. There are four reasons why this is
likely to change and why in the future less disk space may be required for
store a genome:
- Currently all sequencing data is stored. For many applications, this may not be necessary. Instead, a lot of disk space could be saved by only storing information on where the genome differs from what can be expected
- New sequencing technologies could lead to less sequencing data being generated per genome. Lower error rates would mean that each base has to be sequenced fewer times in order to be sure that no error has been made
- New technologies that can sequence longer fragments at a time could also reduce the amount of data generated per genome
- Currently used data formats contain a lot of redundant information. New data formats making use of compression algorithms could save a lot of disk space
Taken together, this
implies that the amount of disk space per genome is likely to decrease.
I still think that
the overall demand for disk space for
sequencing data will grow, but that growth will be driven by the number of
genomes being sequenced, not by the storage requirement per genome.

No comments:
Post a Comment