FASTQ Splitter

About

This script divides a large FASTQ file into a set of smaller equally sized files. It allows processing the dataset in parallel, for instance on a cluster computer.

This tool was made by Kirill Kryukov. It is shared with the hope that it can be useful, but without any warranties.

News

2014-04-24 – Version 0.1.2:

Now accepts FASTQ with sequence name duplicated in '+' line (thanks Andrew!).
Added --check option to do some additional verification of FASTQ correctness.
Added detection of truncated input.
Added reporting elapsed time.

2014-02-10 – 0.1.1: Added --eol option.

2014-01-29 – This page is created, version 0.1.0 is uploaded.

Download

Version 0.1.2 (2014-04-24) (3 kB)

(Distributed under the zlib/libpng license, see the source file for details)

Usage

Usage: fastq-splitter.pl [options] <file>... Options: --n-parts <N> - Divide into <N> parts --part-size <N> - Divide into parts of size <N> --measure (all|seq|count) - Specify whether all data, sequence length, or number of sequences is used for determining part sizes ('all' by default). --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default). --check - Check FASTQ correctness. --version - Show version. --help - Show help.

The script supports two strategies: dividing into given number of parts (--n-parts <N>) and dividing into parts of given size (--part-size <N>).

It's possible to specify both --n-parts <N> and --part-size <M>. In such case the size of each part will not exceed <M>, and at most <N> parts will be written. This can be useful to extract some parts from the beginning of a large FASTQ file without processing the whole file.

--measure option controls what is used to determine part sizes. With --measure count simply the number of sequences is used to delimit parts. With --measure seq sequence length in basepairs is used. With --measure all the complete size of FASTQ entry in bytes (including sequence name, quality and end of line characters) is used.

The script won't cut any sequence in the middle, or change the order of sequences.

--check allows you to control how strictly you want validate with FASTQ format correctness. Without this option the script will verify the barest minimum necessary for proper parsing. With this option the script will additionally verify that the two names (in '@' and '+' lines) match, and that quality string has same length with the sequence. Processing speed should be about the same in both cases.

Limitations

Wrapped (multi-line) sequence and quality is currently not supported.

Contact

If you have any questions, comments or suggestions, please contact me.


	© 2014 Kirill Kryukov This page is available under the CC BY 3.0 License