How to run arraylib interactively
First we import the LibraryExperiment class from the package.
[2]:
from arraylib.libraryexperiment import LibraryExperiment
We generate an experiment object by instantiating the LibraryExperiment class with our chosen parameters.
Required parameters:
input_dir: path to directory holding the input fastq filesexp_design: path to csv file indicating experimental design (values should be separated by a comma). The experimental design file should have columns, Filename, Poolname and Pooldimension. (see example in tests/test_data/full_exp_design.csv)Filename should contain all the unqiue input fastq filenames.
Poolname should indicate to which pool a given file belongs. Multiple files per poolname are allowed.
Pooldimension indicates the pooling dimension a pool belongs to. All pools sharing the same pooling dimension should have the same string in the Pooldimension column.
An example of how an exp_design file could look like:
Filename |
Poolname |
Pooldimension |
|---|---|---|
column1.fastq |
column1 |
columns |
column2.fastq |
column2 |
columns |
row1.fastq |
row1 |
rows |
row2.fastq |
row2 |
rows |
platerow1.fastq |
platerow1 |
platerows |
platerow2.fastq |
platerow2 |
platerows |
platecol1.fastq |
platecol1 |
platecols |
platecol2.fastq |
platecol2 |
platecols |
gb_refpath to genbank reference filebowtie_refpath to bowtie index files, ending with the basename of your index (if the basename of your index is UTI89 and you store your bowtie2 [1] references inbowtie_refit should be bowtie_ref/UTI89). Please visit https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer for a manual how to create bowtie2 indices.tn_seqtransposon sequence (e.g. AGATGTGTATAAGAGACAG)bar_upstreamupstream sequence of barcode (e.g. CGAGGTCTCT)bar_downstreamdownstream sequence of barcode (e.g. CGTACGCTGC)
Optional parameters:
map_qualityminimum bowtie2 alignment quality score for each base to include readseq_qualityminimum phred score for each base to include readtransposon_mismatchesnumber of transposon mismatches allowedfilter_thrthreshold for local filter (e.g. a threshold of 0.05 would filter out all reads < 0.05 of the maximum read count for a given mutant)global_thrthreshold for global filter (all reads below g_thr will be set to 0)min_countsminimum counts of a barcode to be included in analysisuse_barcodeswhether to perform deconvolution only based on barcodes without genomic alignment
You can also refer to the API to get a detailed description of the input parameter choices.
[3]:
experiment = LibraryExperiment(cores=4,
map_quality=30,
seq_quality=10,
gb_ref="gb_ref",
bowtie_ref="bowtie_ref/UTI89",
tn_seq="AGATGTGTATAAGAGACAG",
tn_mismatches=2,
input_dir="input",
exp_design="full_exp_design.csv",
use_barcodes=False,
bar_upstream="CGAGGTCTCT",
bar_downstream="CGTACGCTGC",
filter_thr=0.05,
global_filter_thr=5,
min_counts=5)
Read trimming and genomic alignment
We detect the transposon sequence up to the chosen number of tn_mismatches in the reads in our input fastq files in input_dir in each read using the provided tn_seq. The barcode sequences are detected between the provided bar_upstream and bar_downstream sequences and the genomic part downstream of the transposon sequence of the read is collected in a compiled .fastq file. As a side effect, a temp directory is created and the compiled and trimmed reads are saved to
temp/trimmed_sequences.fastq.
[4]:
# detect transposon and trim sequences
experiment.get_genomic_seq()
Then we use bowtie2 to align the trimmed reads to the reference genome in bowtie_ref. The output of bowtie2 is parsed and stored in temp/alignment_result.csv.
[5]:
# align trimmed sequences to reference using bowtie2
experiment.align_genomic_seq()
624483 reads; of these:
624483 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
624483 (100.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
100.00% overall alignment rate
Assembly of count matrix and preprocessing
Once the alignment is finished we can count the reads of each mutant in each pool and assemble the count matrix using write_count_matrix().
[6]:
# assemble count matrix from aligned reads
experiment.write_count_matrix()
Generating count matrix with 7000 mutants
Generated count matrix!
write_count_matrix() generates 4 different count matrices stored as attributes:
raw_count_mat: stores the raw reads per mutant and poolcount_mat: stores the raw reads per mutant and pool, but spurious barcodes are filtered outnormalized_count_mat: stores the reads from count_mat, but normalized to counts per millionfiltered_count_mat: stores normalized reads, but has filters applied according tofilter_thrandglobal_thr
In practice it is beneficial to apply filters to the normalized count matrix, as often spurious reads are detected. This usually makes the deconvolution and location inference much harder, as many mutants become ambiguous. Therefore, we usually want to continue with the filtered_count_mat.
[7]:
experiment.count_mat
[7]:
| Feature | Orientation | Barcode | Reference | 1 | 10 | 11 | 12 | 2 | 3 | ... | PC07 | PC08 | PR01 | PR02 | PR03 | PR04 | PR05 | PR06 | PR07 | PR08 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | - | nan | NC_007946.1 | 0 | 0 | 10 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 50 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
| 2 | 51 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 |
| 3 | 53 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 13 | 0 | ... | 10 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 54 | + | nan | NC_007946.1 | 0 | 13 | 0 | 0 | 0 | 0 | ... | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6995 | 7045 | - | nan | NC_007946.1 | 0 | 10 | 0 | 0 | 0 | 14 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 11 | 0 |
| 6996 | 7046 | + | nan | NC_007946.1 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 13 | 0 | 0 | 12 | 11 | 0 | 0 |
| 6997 | 7047 | - | nan | NC_007946.1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 14 | 12 | 0 | 0 | 0 | 11 | 0 |
| 6998 | 7048 | + | nan | NC_007946.1 | 14 | 0 | 0 | 0 | 10 | 11 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 14 |
| 6999 | 7050 | + | nan | NC_007946.1 | 0 | 0 | 10 | 0 | 14 | 0 | ... | 12 | 0 | 14 | 0 | 0 | 11 | 13 | 0 | 0 | 0 |
7000 rows × 40 columns
[8]:
experiment.filtered_count_mat
[8]:
| Feature | Orientation | Barcode | Reference | 1 | 10 | 11 | 12 | 2 | 3 | ... | PC07 | PC08 | PR01 | PR02 | PR03 | PR04 | PR05 | PR06 | PR07 | PR08 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 734.322221 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 732.945919 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 1 | 50 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 675.465032 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 664.349959 |
| 2 | 51 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 581.856652 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3 | 53 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 980.022616 | 0.000000 | ... | 496.15480 | 0.000000 | 509.735957 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 4 | 54 | + | nan | NC_007946.1 | 0.00000 | 981.798958 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 727.423880 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 721.277692 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6995 | 7045 | - | nan | NC_007946.1 | 0.00000 | 755.229968 | 0.000000 | 0.0 | 0.000000 | 1003.152766 | ... | 0.00000 | 0.000000 | 662.656744 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 549.862534 | 0.000000 |
| 6996 | 7046 | + | nan | NC_007946.1 | 0.00000 | 755.229968 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 680.592639 | 0.000000 | 0.000000 | 629.954328 | 566.718187 | 0.000000 | 0.000000 |
| 6997 | 7047 | - | nan | NC_007946.1 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 732.945919 | 635.930048 | 0.000000 | 0.000000 | 0.000000 | 549.862534 | 0.000000 |
| 6998 | 7048 | + | nan | NC_007946.1 | 1108.64745 | 0.000000 | 0.000000 | 0.0 | 753.863551 | 788.191459 | ... | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 682.450522 | 0.000000 | 0.000000 | 715.453802 |
| 6999 | 7050 | + | nan | NC_007946.1 | 0.00000 | 0.000000 | 734.322221 | 0.0 | 1055.408971 | 0.000000 | ... | 595.38576 | 0.000000 | 713.630339 | 0.000000 | 0.000000 | 581.856652 | 682.450522 | 0.000000 | 0.000000 | 0.000000 |
7000 rows × 40 columns
Deconvolution and location inference
We then use the count matrix to infer the most likely location of each mutant using deconvolve().
[9]:
# Inferring most likely locations and writing location summary
experiment.deconvolve()
We can access the deconvolution results with location_summary to get a detailed information about each mutant detected. To get information about each well in the arrayed call transposed_location_summary
[11]:
experiment.location_summary
[11]:
| Reference | Coordinate | Gene | Locus_tag | Predicted_wells | Gene_orientation | Transposon_orientation | Percent_from_start | Gene_hit | Ambiguity | Possible_wells | Pool_counts | Barcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NC_007946.1 | 49 | NaN | NaN | F_11_PC02_PR02 | NaN | - | NaN | False | Unambiguous | F_11_PC02_PR02 | 11:10.0;F:12.0;PC02:12.0;PR02:14.0 | nan |
| 1 | NC_007946.1 | 50 | NaN | NaN | F_5_PC08_PR08 | NaN | - | NaN | False | Unambiguous | F_5_PC08_PR08 | 5:13.0;F:10.0;PC08:13.0;PR08:13.0 | nan |
| 2 | NC_007946.1 | 51 | NaN | NaN | D_8_PC05_PR04 | NaN | - | NaN | False | Unambiguous | D_8_PC05_PR04 | 8:10.0;D:10.0;PC05:10.0;PR04:11.0 | nan |
| 3 | NC_007946.1 | 53 | NaN | NaN | E_2_PC07_PR01 | NaN | - | NaN | False | Unambiguous | E_2_PC07_PR01 | 2:13.0;E:11.0;PC07:10.0;PR01:10.0 | nan |
| 4 | NC_007946.1 | 54 | NaN | NaN | D_10_PC08_PR06 | NaN | + | NaN | False | Unambiguous | D_10_PC08_PR06 | 10:13.0;D:10.0;PC08:14.0;PR06:14.0 | nan |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6995 | NC_007946.1 | 7045 | Not_found | UTI89_RS00040 | B_10_PC03_PR01;D_3_PC05_PR07;F_7_PC03_PR01 | - | - | 0.55905 | True | Ambiguous | B_10_PC03_PR01;B_10_PC03_PR07;B_10_PC05_PR01;B... | 10:10.0;3:14.0;7:13.0;B:12.0;D:12.0;F:12.0;PC0... | nan |
| 6996 | NC_007946.1 | 7046 | Not_found | UTI89_RS00040 | B_10_PC05_PR06;C_6_PC03_PR02;G_4_PC05_PR05 | - | + | 0.558351 | True | Ambiguous | B_10_PC03_PR02;B_10_PC03_PR05;B_10_PC03_PR06;B... | 10:10.0;4:13.0;6:14.0;B:10.0;C:14.0;G:11.0;PC0... | nan |
| 6997 | NC_007946.1 | 7047 | Not_found | UTI89_RS00040 | A_6_PC01_PR02;C_9_PC04_PR03;D_8_PC04_PR07 | - | - | 0.557652 | True | Ambiguous | A_6_PC01_PR02;A_6_PC01_PR03;A_6_PC01_PR07;A_6_... | 6:14.0;8:10.0;9:12.0;A:13.0;C:13.0;D:10.0;PC01... | nan |
| 6998 | NC_007946.1 | 7048 | Not_found | UTI89_RS00040 | A_1_PC04_PR08;A_2_PC03_PR05;F_3_PC03_PR05 | - | + | 0.556953 | True | Ambiguous | A_1_PC03_PR05;A_1_PC03_PR08;A_1_PC04_PR05;A_1_... | 1:14.0;2:10.0;3:11.0;A:13.0;F:10.0;PC03:10.0;P... | nan |
| 6999 | NC_007946.1 | 7050 | Not_found | UTI89_RS00040 | A_5_PC07_PR05;D_11_PC07_PR04;H_2_PC02_PR01 | - | + | 0.555556 | True | Ambiguous | A_11_PC02_PR01;A_11_PC02_PR04;A_11_PC02_PR05;A... | 11:10.0;2:14.0;5:14.0;A:13.0;D:11.0;H:13.0;PC0... | nan |
7000 rows × 13 columns
[12]:
experiment.transposed_location_summary
[12]:
| Well | row | col | PC | PR | Reference | Coordinate | Gene | Locus_tag | Gene_orientation | Transposon_orientation | Percent_from_start | Gene_hit | Ambiguity | Barcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Well | |||||||||||||||
| A_1_PC01_PR01 | A_1_PC01_PR01 | A | 1 | PC01 | PR01 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| A_1_PC01_PR02 | A_1_PC01_PR02 | A | 1 | PC01 | PR02 | NC_007946.1;NC_007946.1 | 689;6692 | thrA;Not_found | UTI89_RS00010;UTI89_RS00040 | +;- | -;- | 0.14372716199756394;0.8057302585604472 | True;True | Unambiguous;Ambiguous | nan |
| A_1_PC01_PR03 | A_1_PC01_PR03 | A | 1 | PC01 | PR03 | NC_007946.1;NC_007946.1;NC_007946.1 | 4056;5156;5581 | thrC;nan;yaaA | UTI89_RS00020;nan;UTI89_RS00035 | +;nan;- | +;-;+ | 0.2517482517482518;nan;0.9832689832689833 | True;False;True | Ambiguous;Ambiguous;Ambiguous | nan |
| A_1_PC01_PR04 | A_1_PC01_PR04 | A | 1 | PC01 | PR04 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| A_1_PC01_PR05 | A_1_PC01_PR05 | A | 1 | PC01 | PR05 | NC_007946.1 | 5715 | yaaA | UTI89_RS00035 | - | - | 0.810811 | True | Ambiguous | nan |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| H_12_PC08_PR04 | H_12_PC08_PR04 | H | 12 | PC08 | PR04 | NC_007946.1;NC_007946.1 | 3370;3884 | thrB;thrC | UTI89_RS00015;UTI89_RS00020 | +;+ | +;- | 0.6120042872454448;0.11810411810411811 | True;True | Unambiguous;Ambiguous | nan |
| H_12_PC08_PR05 | H_12_PC08_PR05 | H | 12 | PC08 | PR05 | NC_007946.1 | 3566 | thrB | UTI89_RS00015 | + | + | 0.822079 | True | Ambiguous | nan |
| H_12_PC08_PR06 | H_12_PC08_PR06 | H | 12 | PC08 | PR06 | NC_007946.1 | 5399 | Not_found | UTI89_RS00030 | + | + | 0.56229 | True | Ambiguous | nan |
| H_12_PC08_PR07 | H_12_PC08_PR07 | H | 12 | PC08 | PR07 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| H_12_PC08_PR08 | H_12_PC08_PR08 | H | 12 | PC08 | PR08 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
6144 rows × 15 columns
References
[1] Langmead, B. and Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), pp.357-359.