package fastdoop
Type Members
-
case class
FAIRecord(id: String, length: Long, start: Long, bpsPerLine: Int, bytesPerLine: Int) extends Product with Serializable
FAI (fasta index) record.
FAI (fasta index) record.
Example entries in a FAI file:
ENA|LR865458|LR865458.1 590561804 75 60 61 ENA|LR865459|LR865459.1 685720839 600404651 60 61 ENA|LR865460|LR865460.1 490910922 1297554246 60 61
- id
Sequence ID
- length
length in bps
- start
start position (byte offset in file)
- bpsPerLine
bps per line
- bytesPerLine
bytes per line
-
class
FASTAlongInputFileFormat extends FileInputFormat[Text, PartialSequence]
A
FileInputFormat
for reading FASTA files containing sequences of arbitrary length.A
FileInputFormat
for reading FASTA files containing sequences of arbitrary length.- Version
1.0
- See also
FileInputFormat
-
class
FASTAshortInputFileFormat extends FileInputFormat[Text, Record]
A
FileInputFormat
for reading FASTA files containing short sequences.A
FileInputFormat
for reading FASTA files containing short sequences.- Version
1.0
- See also
FileInputFormat
-
class
FASTQInputFileFormat extends FileInputFormat[Text, QRecord]
A
FileInputFormat
for reading FASTQ files.A
FileInputFormat
for reading FASTQ files.- Version
1.0
- See also
FileInputFormat
-
class
FASTQReadsRecordReader extends RecordReader[Text, QRecord]
This class reads
<key, value>
pairs from anInputSplit
.This class reads
<key, value>
pairs from anInputSplit
. The input file is in FASTQ format. A FASTA record has a header line that is the key, the data line, an optional single line header string and a quality line.Example:
@
SRR034939.184 090406_HWI-EAS68_9096_FC400PR_PE_1_1_10 length=100 CCACCTCCTGGGTTCAAGGGGTTCTCTTGCCTCAGCTNNNNNNNNNNNNGGNNNNNNNNNTNNNN +SRR034939.184 090406_HWI-EAS68_9096_FC400PR_PE_1_1_10 length=100 HDHFHHHHHHFFAFF6?<
:<
HHHHHHHHHEDHHHF##!!!!!!!!!!!!##!!!!!!!!!#!!!! ...- Version
1.0
- See also
InputSplit
-
class
IndexedFastaFormat extends FileInputFormat[Text, PartialSequence]
Hadoop input format for FASTA files with an accompanying .fai index file.
Hadoop input format for FASTA files with an accompanying .fai index file.
- Version
1.0
- See also
-
class
IndexedFastaReader extends RecordReader[Text, PartialSequence]
FASTA file reader that uses a faidx (.fai) file to track sequence locations.
FASTA file reader that uses a faidx (.fai) file to track sequence locations. .fai indexes can be generated by various tools, for example seqkit: https://github.com/shenwei356/seqkit/
This reader can read a mix of full and partial sequences. If the sequence is fully contained in this split, it will be read as a single PartialSequence record. Otherwise, it will be read as multiple records. Partial sequences can be identified and reassembled using their header (corresponding to sequence ID) and seqPosition fields.
Partial sequences are read together with (k-1) bps from the next part to ensure that full k-mers can be processed.
The reader for every split must stream the FAI file. Thus, it is not recommended to use this reader for e.g. short reads, or when the maximum size of a sequence is relatively small. ShortReadsRecordReader and FASTQReadsRecordReader are better suited to such a task. For reading a single long sequence without a FAI index, LongReadsRecordReader can be used instead.
- Version
1.0
- See also
-
class
LongReadsRecordReader extends RecordReader[Text, PartialSequence]
This class reads
<key, value>
pairs from anInputSplit
.This class reads
<key, value>
pairs from anInputSplit
. The input file is in FASTA format and contains a single long sequence. A FASTA record has a header line that is the key, and data lines that are the value.>header...
data ...Example:
>Seq1
TAATCCCAAATGATTATATCCTTCTCCGATCGCTAGCTATACCTTCCAGGCGATGAACTTAGACGGAATCCACTTTGCTA CAACGCGATGACTCAACCGCCATGGTGGTACTAGTCGCGGAAAAGAAAGAGTAAACGCCAACGGGCTAGACACACTAATC CTCCGTCCCCAACAGGTATGATACCGTTGGCTTCACTTCTACTACATTCGTAATCTCTTTGTCAGTCCTCCCGTACGTTG GCAAAGGTTCACTGGAAAAATTGCCGACGCACAGGTGCCGGGCCGTGAATAGGGCCAGATGAACAAGGAAATAATCACCA CCGAGGTGTGACATGCCCTCTCGGGCAACCACTCTTCCTCATACCCCCTCTGGGCTAACTCGGAGCAAAGAACTTGGTAA ...- Version
1.0
- See also
InputSplit
-
class
PartialSequence extends Serializable
This class is used to store fragments of a long input FASTA sequence as an array of bytes.
This class is used to store fragments of a long input FASTA sequence as an array of bytes.
- Version
1.0
-
class
QRecord extends Serializable
Utility class used to represent as a record a sequence existing in a FASTQ file.
Utility class used to represent as a record a sequence existing in a FASTQ file.
- Version
1.0
-
class
Record extends Serializable
Utility class used to represent as a record a sequence existing in a FASTA file.
Utility class used to represent as a record a sequence existing in a FASTA file.
- Version
1.0
-
class
ShortReadsRecordReader extends RecordReader[Text, Record]
This class reads
<key, value>
pairs from anInputSplit
.This class reads
<key, value>
pairs from anInputSplit
. The input file is in FASTA format. A FASTA record has a header line that is the key, and data lines that are the value.>header...
data ...Example:
>Seq1
TAATCCCAAATGATTATATCCTTCTCCGATCGCTAGCTATACCTTCCAGGCGATGAACTTAGACGGAATCCACTTTGCTA CAACGCGATGACTCAACCGCCATGGTGGTACTAGTCGCGGAAAAGAAAGAGTAAACGCCAACGGGCTAGACACACTAATC CTCCGTCCCCAACAGGTATGATACCGTTGGCTTCACTTCTA>Seq2
CTACATTCGTAATCTCTTTGTCAGTCCTCCCGTACGTTGGCAAAGGTTCACTGGAAAAATTGCCGACGCACAGGTGCCGG GCCGTGAATAGGGCCAGATGAACAAGGAAATAATCACCACCGAGGTGTGACATGCCCTCTCGGGCAACCACTCTTCCTCA TACCCCCTCTGGGCTAACTCGGAGCAAAGAACTTGGTAA ...- Version
1.0
- See also
InputSplit