How to run arraylib
interactively
First we import the LibraryExperiment class from the package.
[2]:
from arraylib.libraryexperiment import LibraryExperiment
We generate an experiment object by instantiating the LibraryExperiment class with our chosen parameters.
Required parameters:
input_dir
: path to directory holding the input fastq filesexp_design
: path to csv file indicating experimental design (values should be separated by a comma). The experimental design file should have columns, Filename, Poolname and Pooldimension. (see example in tests/test_data/full_exp_design.csv)Filename should contain all the unqiue input fastq filenames.
Poolname should indicate to which pool a given file belongs. Multiple files per poolname are allowed.
Pooldimension indicates the pooling dimension a pool belongs to. All pools sharing the same pooling dimension should have the same string in the Pooldimension column.
An example of how an exp_design file could look like:
Filename |
Poolname |
Pooldimension |
---|---|---|
column1.fastq |
column1 |
columns |
column2.fastq |
column2 |
columns |
row1.fastq |
row1 |
rows |
row2.fastq |
row2 |
rows |
platerow1.fastq |
platerow1 |
platerows |
platerow2.fastq |
platerow2 |
platerows |
platecol1.fastq |
platecol1 |
platecols |
platecol2.fastq |
platecol2 |
platecols |
gb_ref
path to genbank reference filebowtie_ref
path to bowtie index files, ending with the basename of your index (if the basename of your index is UTI89 and you store your bowtie2 [1] references inbowtie_ref
it should be bowtie_ref/UTI89). Please visit https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer for a manual how to create bowtie2 indices.tn_seq
transposon sequence (e.g. AGATGTGTATAAGAGACAG)bar_upstream
upstream sequence of barcode (e.g. CGAGGTCTCT)bar_downstream
downstream sequence of barcode (e.g. CGTACGCTGC)
Optional parameters:
map_quality
minimum bowtie2 alignment quality score for each base to include readseq_quality
minimum phred score for each base to include readtransposon_mismatches
number of transposon mismatches allowedfilter_thr
threshold for local filter (e.g. a threshold of 0.05 would filter out all reads < 0.05 of the maximum read count for a given mutant)global_thr
threshold for global filter (all reads below g_thr will be set to 0)min_counts
minimum counts of a barcode to be included in analysisuse_barcodes
whether to perform deconvolution only based on barcodes without genomic alignment
You can also refer to the API to get a detailed description of the input parameter choices.
[3]:
experiment = LibraryExperiment(cores=4,
map_quality=30,
seq_quality=10,
gb_ref="gb_ref",
bowtie_ref="bowtie_ref/UTI89",
tn_seq="AGATGTGTATAAGAGACAG",
tn_mismatches=2,
input_dir="input",
exp_design="full_exp_design.csv",
use_barcodes=False,
bar_upstream="CGAGGTCTCT",
bar_downstream="CGTACGCTGC",
filter_thr=0.05,
global_filter_thr=5,
min_counts=5)
Read trimming and genomic alignment
We detect the transposon sequence up to the chosen number of tn_mismatches
in the reads in our input fastq files in input_dir
in each read using the provided tn_seq
. The barcode sequences are detected between the provided bar_upstream
and bar_downstream
sequences and the genomic part downstream of the transposon sequence of the read is collected in a compiled .fastq file. As a side effect, a temp
directory is created and the compiled and trimmed reads are saved to
temp/trimmed_sequences.fastq
.
[4]:
# detect transposon and trim sequences
experiment.get_genomic_seq()
Then we use bowtie2 to align the trimmed reads to the reference genome in bowtie_ref
. The output of bowtie2 is parsed and stored in temp/alignment_result.csv
.
[5]:
# align trimmed sequences to reference using bowtie2
experiment.align_genomic_seq()
624483 reads; of these:
624483 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
624483 (100.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
100.00% overall alignment rate
Assembly of count matrix and preprocessing
Once the alignment is finished we can count the reads of each mutant in each pool and assemble the count matrix using write_count_matrix()
.
[6]:
# assemble count matrix from aligned reads
experiment.write_count_matrix()
Generating count matrix with 7000 mutants
Generated count matrix!
write_count_matrix()
generates 4 different count matrices stored as attributes:
raw_count_mat
: stores the raw reads per mutant and poolcount_mat
: stores the raw reads per mutant and pool, but spurious barcodes are filtered outnormalized_count_mat
: stores the reads from count_mat, but normalized to counts per millionfiltered_count_mat
: stores normalized reads, but has filters applied according tofilter_thr
andglobal_thr
In practice it is beneficial to apply filters to the normalized count matrix, as often spurious reads are detected. This usually makes the deconvolution and location inference much harder, as many mutants become ambiguous. Therefore, we usually want to continue with the filtered_count_mat
.
[7]:
experiment.count_mat
[7]:
Feature | Orientation | Barcode | Reference | 1 | 10 | 11 | 12 | 2 | 3 | ... | PC07 | PC08 | PR01 | PR02 | PR03 | PR04 | PR05 | PR06 | PR07 | PR08 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | - | nan | NC_007946.1 | 0 | 0 | 10 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 50 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
2 | 51 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 |
3 | 53 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 13 | 0 | ... | 10 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 54 | + | nan | NC_007946.1 | 0 | 13 | 0 | 0 | 0 | 0 | ... | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6995 | 7045 | - | nan | NC_007946.1 | 0 | 10 | 0 | 0 | 0 | 14 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 11 | 0 |
6996 | 7046 | + | nan | NC_007946.1 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 13 | 0 | 0 | 12 | 11 | 0 | 0 |
6997 | 7047 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 14 | 12 | 0 | 0 | 0 | 11 | 0 |
6998 | 7048 | + | nan | NC_007946.1 | 14 | 0 | 0 | 0 | 10 | 11 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 14 |
6999 | 7050 | + | nan | NC_007946.1 | 0 | 0 | 10 | 0 | 14 | 0 | ... | 12 | 0 | 14 | 0 | 0 | 11 | 13 | 0 | 0 | 0 |
7000 rows × 40 columns
[8]:
experiment.filtered_count_mat
[8]:
Feature | Orientation | Barcode | Reference | 1 | 10 | 11 | 12 | 2 | 3 | ... | PC07 | PC08 | PR01 | PR02 | PR03 | PR04 | PR05 | PR06 | PR07 | PR08 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 734.322221 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 732.945919 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1 | 50 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 675.465032 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 664.349959 |
2 | 51 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 581.856652 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3 | 53 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 980.022616 | 0.000000 | ... | 496.15480 | 0.000000 | 509.735957 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
4 | 54 | + | nan | NC_007946.1 | 0.00000 | 981.798958 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 727.423880 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 721.277692 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6995 | 7045 | - | nan | NC_007946.1 | 0.00000 | 755.229968 | 0.000000 | 0.0 | 0.000000 | 1003.152766 | ... | 0.00000 | 0.000000 | 662.656744 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 549.862534 | 0.000000 |
6996 | 7046 | + | nan | NC_007946.1 | 0.00000 | 755.229968 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 680.592639 | 0.000000 | 0.000000 | 629.954328 | 566.718187 | 0.000000 | 0.000000 |
6997 | 7047 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 732.945919 | 635.930048 | 0.000000 | 0.000000 | 0.000000 | 549.862534 | 0.000000 |
6998 | 7048 | + | nan | NC_007946.1 | 1108.64745 | 0.000000 | 0.000000 | 0.0 | 753.863551 | 788.191459 | ... | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 682.450522 | 0.000000 | 0.000000 | 715.453802 |
6999 | 7050 | + | nan | NC_007946.1 | 0.00000 | 0.000000 | 734.322221 | 0.0 | 1055.408971 | 0.000000 | ... | 595.38576 | 0.000000 | 713.630339 | 0.000000 | 0.000000 | 581.856652 | 682.450522 | 0.000000 | 0.000000 | 0.000000 |
7000 rows × 40 columns
Deconvolution and location inference
We then use the count matrix to infer the most likely location of each mutant using deconvolve()
.
[9]:
# Inferring most likely locations and writing location summary
experiment.deconvolve()
We can access the deconvolution results with location_summary
to get a detailed information about each mutant detected. To get information about each well in the arrayed call transposed_location_summary
[11]:
experiment.location_summary
[11]:
Reference | Coordinate | Gene | Locus_tag | Predicted_wells | Gene_orientation | Transposon_orientation | Percent_from_start | Gene_hit | Ambiguity | Possible_wells | Pool_counts | Barcode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NC_007946.1 | 49 | NaN | NaN | F_11_PC02_PR02 | NaN | - | NaN | False | Unambiguous | F_11_PC02_PR02 | 11:10.0;F:12.0;PC02:12.0;PR02:14.0 | nan |
1 | NC_007946.1 | 50 | NaN | NaN | F_5_PC08_PR08 | NaN | - | NaN | False | Unambiguous | F_5_PC08_PR08 | 5:13.0;F:10.0;PC08:13.0;PR08:13.0 | nan |
2 | NC_007946.1 | 51 | NaN | NaN | D_8_PC05_PR04 | NaN | - | NaN | False | Unambiguous | D_8_PC05_PR04 | 8:10.0;D:10.0;PC05:10.0;PR04:11.0 | nan |
3 | NC_007946.1 | 53 | NaN | NaN | E_2_PC07_PR01 | NaN | - | NaN | False | Unambiguous | E_2_PC07_PR01 | 2:13.0;E:11.0;PC07:10.0;PR01:10.0 | nan |
4 | NC_007946.1 | 54 | NaN | NaN | D_10_PC08_PR06 | NaN | + | NaN | False | Unambiguous | D_10_PC08_PR06 | 10:13.0;D:10.0;PC08:14.0;PR06:14.0 | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6995 | NC_007946.1 | 7045 | Not_found | UTI89_RS00040 | B_10_PC03_PR01;D_3_PC05_PR07;F_7_PC03_PR01 | - | - | 0.55905 | True | Ambiguous | B_10_PC03_PR01;B_10_PC03_PR07;B_10_PC05_PR01;B... | 10:10.0;3:14.0;7:13.0;B:12.0;D:12.0;F:12.0;PC0... | nan |
6996 | NC_007946.1 | 7046 | Not_found | UTI89_RS00040 | B_10_PC05_PR06;C_6_PC03_PR02;G_4_PC05_PR05 | - | + | 0.558351 | True | Ambiguous | B_10_PC03_PR02;B_10_PC03_PR05;B_10_PC03_PR06;B... | 10:10.0;4:13.0;6:14.0;B:10.0;C:14.0;G:11.0;PC0... | nan |
6997 | NC_007946.1 | 7047 | Not_found | UTI89_RS00040 | A_6_PC01_PR02;C_9_PC04_PR03;D_8_PC04_PR07 | - | - | 0.557652 | True | Ambiguous | A_6_PC01_PR02;A_6_PC01_PR03;A_6_PC01_PR07;A_6_... | 6:14.0;8:10.0;9:12.0;A:13.0;C:13.0;D:10.0;PC01... | nan |
6998 | NC_007946.1 | 7048 | Not_found | UTI89_RS00040 | A_1_PC04_PR08;A_2_PC03_PR05;F_3_PC03_PR05 | - | + | 0.556953 | True | Ambiguous | A_1_PC03_PR05;A_1_PC03_PR08;A_1_PC04_PR05;A_1_... | 1:14.0;2:10.0;3:11.0;A:13.0;F:10.0;PC03:10.0;P... | nan |
6999 | NC_007946.1 | 7050 | Not_found | UTI89_RS00040 | A_5_PC07_PR05;D_11_PC07_PR04;H_2_PC02_PR01 | - | + | 0.555556 | True | Ambiguous | A_11_PC02_PR01;A_11_PC02_PR04;A_11_PC02_PR05;A... | 11:10.0;2:14.0;5:14.0;A:13.0;D:11.0;H:13.0;PC0... | nan |
7000 rows × 13 columns
[12]:
experiment.transposed_location_summary
[12]:
Well | row | col | PC | PR | Reference | Coordinate | Gene | Locus_tag | Gene_orientation | Transposon_orientation | Percent_from_start | Gene_hit | Ambiguity | Barcode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Well | |||||||||||||||
A_1_PC01_PR01 | A_1_PC01_PR01 | A | 1 | PC01 | PR01 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
A_1_PC01_PR02 | A_1_PC01_PR02 | A | 1 | PC01 | PR02 | NC_007946.1;NC_007946.1 | 689;6692 | thrA;Not_found | UTI89_RS00010;UTI89_RS00040 | +;- | -;- | 0.14372716199756394;0.8057302585604472 | True;True | Unambiguous;Ambiguous | nan |
A_1_PC01_PR03 | A_1_PC01_PR03 | A | 1 | PC01 | PR03 | NC_007946.1;NC_007946.1;NC_007946.1 | 4056;5156;5581 | thrC;nan;yaaA | UTI89_RS00020;nan;UTI89_RS00035 | +;nan;- | +;-;+ | 0.2517482517482518;nan;0.9832689832689833 | True;False;True | Ambiguous;Ambiguous;Ambiguous | nan |
A_1_PC01_PR04 | A_1_PC01_PR04 | A | 1 | PC01 | PR04 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
A_1_PC01_PR05 | A_1_PC01_PR05 | A | 1 | PC01 | PR05 | NC_007946.1 | 5715 | yaaA | UTI89_RS00035 | - | - | 0.810811 | True | Ambiguous | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
H_12_PC08_PR04 | H_12_PC08_PR04 | H | 12 | PC08 | PR04 | NC_007946.1;NC_007946.1 | 3370;3884 | thrB;thrC | UTI89_RS00015;UTI89_RS00020 | +;+ | +;- | 0.6120042872454448;0.11810411810411811 | True;True | Unambiguous;Ambiguous | nan |
H_12_PC08_PR05 | H_12_PC08_PR05 | H | 12 | PC08 | PR05 | NC_007946.1 | 3566 | thrB | UTI89_RS00015 | + | + | 0.822079 | True | Ambiguous | nan |
H_12_PC08_PR06 | H_12_PC08_PR06 | H | 12 | PC08 | PR06 | NC_007946.1 | 5399 | Not_found | UTI89_RS00030 | + | + | 0.56229 | True | Ambiguous | nan |
H_12_PC08_PR07 | H_12_PC08_PR07 | H | 12 | PC08 | PR07 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
H_12_PC08_PR08 | H_12_PC08_PR08 | H | 12 | PC08 | PR08 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
6144 rows × 15 columns
References
[1] Langmead, B. and Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), pp.357-359.