spark

package spark

Provides classes and routines for running on Apache Spark. The main entry point is the Discount class. Once configured, it can be used to generate other classes of interest, such as GroupedSegments and CountedKmers.

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

spark
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Type Members

type AnyMinSplitter = MinSplitter[MinimizerPriorities]
sealed trait CountMethod extends AnyRef
Defines a strategy for counting k-mers in Spark.
class CountedKmers extends AnyRef
A collection of counted k-mers represented in encoded form.
A collection of counted k-mers represented in encoded form. Each k-mer is represented individually, making this dataset large if cached or persisted.
final case class Discount(k: Int, minimizers: MinimizerSource = Bundled, m: Int = 10, ordering: MinimizerOrdering = Frequency, sample: Double = 0.01, maxSequenceLength: Int = 1000000, normalize: Boolean = false, method: CountMethod = Auto, partitions: Int = 200)(implicit spark: SparkSession) extends Product with Serializable
Main API entry point for Discount.
Main API entry point for Discount. Also see the command line examples in the documentation for more information on these options.
k
k-mer length
minimizers
source of minimizers. See MinimizerSource
m
minimizer width
ordering
minimizer ordering. See MinimizerOrdering
sample
sample fraction for frequency orderings
maxSequenceLength
max length of a single sequence (for short reads)
normalize
whether to normalize k-mer orientation during counting. Causes every sequence to be scanned in both forward and reverse, after which only forward orientation k-mers are kept.
method
counting method to use (or None for automatic selection). See CountMethod
partitions
number of shuffle partitions/index buckets
spark
the SparkSession
class FastaOutputFormat[K, V] extends TextOutputFormat[K, V]
class FastaShortInput extends InputReader
Input reader for FASTA sequences of a fixed maximum length.
Input reader for FASTA sequences of a fixed maximum length. Uses FASTAshortInputFileFormat
class FastqShortInput extends InputReader
Input reader for FASTQ short reads.
Input reader for FASTQ short reads. Uses FASTQInputFileFormat
class GroupedSegments extends AnyRef
A collection of counted super-mers grouped into bins (by minimizer).
A collection of counted super-mers grouped into bins (by minimizer). Super-mers are segments of length >= k where every k-mer shares the same minimizer.
Unlike with the Index, every k-mer in the super-mers is guaranteed to be present.
final case class HashSegment(hash: BucketId, segment: ZeroNTBitArray) extends Product with Serializable
A single hashed sequence segment (super-mer) with its minimizer.
A single hashed sequence segment (super-mer) with its minimizer.
hash
The minimizer
segment
The super-mer
class Index extends AnyRef
A bucketed k-mer index.
A bucketed k-mer index. Indexes store super-mers in a Dataset of ReducibleBucket, where each k-mer is associated with a tag. Typically tags are k-mer counts, and then the Index becomes a multiset of counted k-mers. Indexes are immutable, like other Spark datastructures, and operations like filtering return a new Index rather than change the existing one in place. Indexes can be combined using operations like union, intersect, and subtract, and can be written to disk in various formats. The default format used by the write() and read() methods is bucketed parquet files, which gives good data compression and avoids shuffling when the same Index is used repeatedly.
case class IndexParams(bcSplit: Broadcast[AnyMinSplitter], buckets: Int, location: String) extends Product with Serializable
Parameters for a k-mer index.
Parameters for a k-mer index.
bcSplit
The broadcast splitter (minimizer scheme/ordering)
buckets
The number of buckets (Spark partitions) to partition the index into - NB, not the same as minimizer bins
location
The location (directory/prefix name) where the index is stored
class IndexedFastaInput extends InputReader
Input reader for FASTA files containing potentially long sequences, with a .fai index FAI indexes can be created with tools such as seqkit.
Input reader for FASTA files containing potentially long sequences, with a .fai index FAI indexes can be created with tools such as seqkit. Uses IndexedFastaFormat
abstract class InputReader extends AnyRef
A reader that reads input data from one file using a specific Hadoop format
class Inputs extends AnyRef
A set of input files that can be parsed into com.jnpersson.discount.hash.InputFragment
class Kmers extends AnyRef
Convenience methods for interacting with k-mers from a set of input files.
trait MinimizerSource extends AnyRef
A method for obtaining a set of minimizers for given values of k and m.
A method for obtaining a set of minimizers for given values of k and m. Except for the case of All, the sets obtained should be universal hitting sets (UHSs).
final case class Path(path: String) extends MinimizerSource with Product with Serializable
A file, or a directory containing multiple files with names like minimizers_{k}_{m}.txt, in which case the best file will be selected.
A file, or a directory containing multiple files with names like minimizers_{k}_{m}.txt, in which case the best file will be selected. These files may specify an ordering.
path
the file, or directory to scan
sealed trait Rule extends Serializable
k-mer combination (reduction) rules for combining indexes.
k-mer combination (reduction) rules for combining indexes. Most of these support both intersection and union. An intersection is an operation that requires the k-mer to be present in every input index, or it will not be present in the output. A union may preserve the k-mer even if it is present in only one input index. Except for the case of the union Sum reduction, indexes must be compacted prior to reduction, that is, each k-mer must occur in each index with a nonzero value only once.
These rules were inspired by the design of KMC3: https://github.com/refresh-bio/KMC
class Sampling extends AnyRef
Routines for creating and managing frequency sampled minimizer orderings.

Value Members

object All extends MinimizerSource with Product with Serializable
Use all m-mers as minimizers.
Use all m-mers as minimizers. Can be auto-generated for any m. The initial ordering is lexicographic.
object Auto extends CountMethod with Product with Serializable
Indicate that a strategy should be auto-selected
object Bundled extends MinimizerSource with Product with Serializable
Bundled minimizers on the classpath (only available for some values of k and m).
object Discount extends SparkTool with Serializable
Main command-line interface to Discount.
object GroupedSegments
object HDFSUtil
HDFS helper routines
object Helpers
object Index
object IndexParams extends Serializable
object Output
Output format helper methods
object Pregrouped extends CountMethod with Product with Serializable
Pregrouped counting: groups and counts identical super-mers before counting k-mers.
Pregrouped counting: groups and counts identical super-mers before counting k-mers. Faster for datasets with high redundancy.
object Rule extends Serializable
object Sampling
object Simple extends CountMethod with Product with Serializable
Non-pregrouped: counts k-mers immediately.
Non-pregrouped: counts k-mers immediately. Faster for datasets with low redundancy.

Packages

spark

package spark

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

spark 

package spark

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

spark