package spark
Provides classes and routines for running on Apache Spark. The main entry point is the Discount class. Once configured, it can be used to generate other classes of interest, such as GroupedSegments and CountedKmers.
- Alphabetic
- By Inheritance
- spark
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
- type AnyMinSplitter = MinSplitter[MinimizerPriorities]
-
sealed
trait
CountMethod extends AnyRef
Defines a strategy for counting k-mers in Spark.
-
class
CountedKmers extends AnyRef
A collection of counted k-mers represented in encoded form.
A collection of counted k-mers represented in encoded form. Each k-mer is represented individually, making this dataset large if cached or persisted.
-
final
case class
Discount(k: Int, minimizers: MinimizerSource = Bundled, m: Int = 10, ordering: MinimizerOrdering = Frequency, sample: Double = 0.01, maxSequenceLength: Int = 1000000, normalize: Boolean = false, method: CountMethod = Auto, partitions: Int = 200)(implicit spark: SparkSession) extends Product with Serializable
Main API entry point for Discount.
Main API entry point for Discount. Also see the command line examples in the documentation for more information on these options.
- k
k-mer length
- minimizers
source of minimizers. See MinimizerSource
- m
minimizer width
- ordering
minimizer ordering. See MinimizerOrdering
- sample
sample fraction for frequency orderings
- maxSequenceLength
max length of a single sequence (for short reads)
- normalize
whether to normalize k-mer orientation during counting. Causes every sequence to be scanned in both forward and reverse, after which only forward orientation k-mers are kept.
- method
counting method to use (or None for automatic selection). See CountMethod
- partitions
number of shuffle partitions/index buckets
- spark
the SparkSession
- class FastaOutputFormat[K, V] extends TextOutputFormat[K, V]
-
class
FastaShortInput extends InputReader
Input reader for FASTA sequences of a fixed maximum length.
Input reader for FASTA sequences of a fixed maximum length. Uses FASTAshortInputFileFormat
-
class
FastqShortInput extends InputReader
Input reader for FASTQ short reads.
Input reader for FASTQ short reads. Uses FASTQInputFileFormat
-
class
GroupedSegments extends AnyRef
A collection of counted super-mers grouped into bins (by minimizer).
A collection of counted super-mers grouped into bins (by minimizer). Super-mers are segments of length >= k where every k-mer shares the same minimizer.
Unlike with the Index, every k-mer in the super-mers is guaranteed to be present.
-
final
case class
HashSegment(hash: BucketId, segment: ZeroNTBitArray) extends Product with Serializable
A single hashed sequence segment (super-mer) with its minimizer.
A single hashed sequence segment (super-mer) with its minimizer.
- hash
The minimizer
- segment
The super-mer
-
class
Index extends AnyRef
A bucketed k-mer index.
A bucketed k-mer index. Indexes store super-mers in a Dataset of ReducibleBucket, where each k-mer is associated with a tag. Typically tags are k-mer counts, and then the Index becomes a multiset of counted k-mers. Indexes are immutable, like other Spark datastructures, and operations like filtering return a new Index rather than change the existing one in place. Indexes can be combined using operations like union, intersect, and subtract, and can be written to disk in various formats. The default format used by the write() and read() methods is bucketed parquet files, which gives good data compression and avoids shuffling when the same Index is used repeatedly.
-
case class
IndexParams(bcSplit: Broadcast[AnyMinSplitter], buckets: Int, location: String) extends Product with Serializable
Parameters for a k-mer index.
Parameters for a k-mer index.
- bcSplit
The broadcast splitter (minimizer scheme/ordering)
- buckets
The number of buckets (Spark partitions) to partition the index into - NB, not the same as minimizer bins
- location
The location (directory/prefix name) where the index is stored
-
class
IndexedFastaInput extends InputReader
Input reader for FASTA files containing potentially long sequences, with a .fai index FAI indexes can be created with tools such as seqkit.
Input reader for FASTA files containing potentially long sequences, with a .fai index FAI indexes can be created with tools such as seqkit. Uses IndexedFastaFormat
-
abstract
class
InputReader extends AnyRef
A reader that reads input data from one file using a specific Hadoop format
-
class
Inputs extends AnyRef
A set of input files that can be parsed into com.jnpersson.discount.hash.InputFragment
-
class
Kmers extends AnyRef
Convenience methods for interacting with k-mers from a set of input files.
-
trait
MinimizerSource extends AnyRef
A method for obtaining a set of minimizers for given values of k and m.
A method for obtaining a set of minimizers for given values of k and m. Except for the case of All, the sets obtained should be universal hitting sets (UHSs).
-
final
case class
Path(path: String) extends MinimizerSource with Product with Serializable
A file, or a directory containing multiple files with names like minimizers_{k}_{m}.txt, in which case the best file will be selected.
A file, or a directory containing multiple files with names like minimizers_{k}_{m}.txt, in which case the best file will be selected. These files may specify an ordering.
- path
the file, or directory to scan
-
sealed
trait
Rule extends Serializable
k-mer combination (reduction) rules for combining indexes.
k-mer combination (reduction) rules for combining indexes. Most of these support both intersection and union. An intersection is an operation that requires the k-mer to be present in every input index, or it will not be present in the output. A union may preserve the k-mer even if it is present in only one input index. Except for the case of the union Sum reduction, indexes must be compacted prior to reduction, that is, each k-mer must occur in each index with a nonzero value only once.
These rules were inspired by the design of KMC3: https://github.com/refresh-bio/KMC
-
class
Sampling extends AnyRef
Routines for creating and managing frequency sampled minimizer orderings.
Value Members
-
object
All extends MinimizerSource with Product with Serializable
Use all m-mers as minimizers.
Use all m-mers as minimizers. Can be auto-generated for any m. The initial ordering is lexicographic.
-
object
Auto extends CountMethod with Product with Serializable
Indicate that a strategy should be auto-selected
-
object
Bundled extends MinimizerSource with Product with Serializable
Bundled minimizers on the classpath (only available for some values of k and m).
-
object
Discount extends SparkTool with Serializable
Main command-line interface to Discount.
- object GroupedSegments
-
object
HDFSUtil
HDFS helper routines
- object Helpers
- object Index
- object IndexParams extends Serializable
-
object
Output
Output format helper methods
-
object
Pregrouped extends CountMethod with Product with Serializable
Pregrouped counting: groups and counts identical super-mers before counting k-mers.
Pregrouped counting: groups and counts identical super-mers before counting k-mers. Faster for datasets with high redundancy.
- object Rule extends Serializable
- object Sampling
-
object
Simple extends CountMethod with Product with Serializable
Non-pregrouped: counts k-mers immediately.
Non-pregrouped: counts k-mers immediately. Faster for datasets with low redundancy.