The Perl Toolchain Summit 2025 Needs You: You can help 🙏 Learn more

NAME

fu-uniq - Dereplicate sequences and generate abundance information

SYNOPSIS

fu-uniq [options] input.fa > uniq.fa

DESCRIPTION

fu-uniq is a tool for dereplicating DNA sequences and generating abundance information. It identifies unique sequences and can track their abundance using USEARCH-style labels. The tool supports both exact sequence matching and customizable output formats.

Key features: - Dereplicates sequences while maintaining abundance information - Supports USEARCH-style size annotations - Flexible sequence naming options - Handles both FASTA and FASTQ inputs - Processes gzipped files automatically

OPTIONS

Sequence Processing

Output Formatting

EXAMPLES

Basic deduplication:

# Find unique sequences and add abundance information
fu-uniq input.fa > uniq.fa

Keep only abundant sequences:

# Keep sequences that appear at least 10 times
fu-uniq -m 10 input.fa > abundant.fa

Custom sequence naming:

# Use custom prefix and separator
fu-uniq -p 'cluster' -s '_' input.fa > clusters.fa

Process multiple files:

# Combine and deduplicate multiple files
fu-uniq file1.fa file2.fa > combined_uniq.fa

Add size as comment:

# Place size information in sequence comment
fu-uniq --size-as-comment input.fa > commented.fa

NOTES

MODERN ALTERNATIVE

This suite of tools has been superseded by SeqFu, a compiled program providing faster and safer tools for sequence analysis. This suite is maintained for the higher portability of Perl scripts under certain circumstances.

SeqFu is available at https://github.com/telatin/seqfu2, and can be installed with BioConda conda install -c bioconda seqfu

CITING

Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering 2021, 8, 59. https://doi.org/10.3390/bioengineering8050059