March 9, 2012

The sequencing data deluge, just another inaccurate apocalyptic prediction?

Sequencing data management is big business.

The business model of companies in that space is based on the well-publicised fact that it takes a lot of disk space to store sequencing data, and that it may be advantageous for many organisations to outsource the management of that data.

Surely, this means that data storage is going to be a major issue for genomics?


I would like to argue that things may not get as bad as often feared.

First, it's necessary to explain why sequencing data currently takes so much disk space. A human genome is around 3 billion bases long, and storing it should therefore take no more than 3 gigabytes (GB). In reality, around 100 GB are needed. That's more than you could get on the newest iPad.

The reason for this is the nature of the data generated by most sequencing machines. They can only sequence segments of DNA that are no more than a few hundred bases long. As a result, each base needs to be sequenced several times to make sure that there are no gaps between segments. Another reason for sequencing bases multiple times is to make sure that the sequencing machine has made no errors.

Therefore, the large amount of data generated by sequencing is at least in part due to the limitations of currently used technology. There are four reasons why this is likely to change and why in the future less disk space may be required for store a genome:

  • Currently all sequencing data is stored. For many applications, this may not be necessary. Instead, a lot of disk space could be saved by only storing information on where the genome differs from what can be expected
  • New sequencing technologies could lead to less sequencing data being generated per genome. Lower error rates would mean that each base has to be sequenced fewer times in order to be sure that no error has been made
  • New technologies that can sequence longer fragments at a time could also reduce the amount of data generated per genome
  • Currently used data formats contain a lot of redundant information. New data formats making use of compression algorithms could save a lot of disk space

Taken together, this implies that the amount of disk space per genome is likely to decrease.

I still think that the overall demand for disk space for sequencing data will grow, but that growth will be driven by the number of genomes being sequenced, not by the storage requirement per genome.

No comments:

Post a Comment