How to run `arraylib` interactively

First we import the LibraryExperiment class from the package.

[2]:

from arraylib.libraryexperiment import LibraryExperiment

We generate an experiment object by instantiating the LibraryExperiment class with our chosen parameters.

Required parameters:

input_dir: path to directory holding the input fastq files
exp_design: path to csv file indicating experimental design (values should be separated by a comma). The experimental design file should have columns, Filename, Poolname and Pooldimension. (see example in tests/test_data/full_exp_design.csv)
- Filename should contain all the unqiue input fastq filenames.
- Poolname should indicate to which pool a given file belongs. Multiple files per poolname are allowed.
- Pooldimension indicates the pooling dimension a pool belongs to. All pools sharing the same pooling dimension should have the same string in the Pooldimension column.

An example of how an exp_design file could look like:

Filename	Poolname	Pooldimension
column1.fastq	column1	columns
column2.fastq	column2	columns
row1.fastq	row1	rows
row2.fastq	row2	rows
platerow1.fastq	platerow1	platerows
platerow2.fastq	platerow2	platerows
platecol1.fastq	platecol1	platecols
platecol2.fastq	platecol2	platecols

gb_ref path to genbank reference file
bowtie_ref path to bowtie index files, ending with the basename of your index (if the basename of your index is UTI89 and you store your bowtie2 [1] references in bowtie_ref it should be bowtie_ref/UTI89). Please visit https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer for a manual how to create bowtie2 indices.
tn_seq transposon sequence (e.g. AGATGTGTATAAGAGACAG)
bar_upstream upstream sequence of barcode (e.g. CGAGGTCTCT)
bar_downstream downstream sequence of barcode (e.g. CGTACGCTGC)

Optional parameters:

map_quality minimum bowtie2 alignment quality score for each base to include read
seq_quality minimum phred score for each base to include read
transposon_mismatches number of transposon mismatches allowed
filter_thr threshold for local filter (e.g. a threshold of 0.05 would filter out all reads < 0.05 of the maximum read count for a given mutant)
global_thr threshold for global filter (all reads below g_thr will be set to 0)
min_counts minimum counts of a barcode to be included in analysis
use_barcodes whether to perform deconvolution only based on barcodes without genomic alignment

You can also refer to the API to get a detailed description of the input parameter choices.

[3]:

experiment = LibraryExperiment(cores=4,
                               map_quality=30,
                               seq_quality=10,
                               gb_ref="gb_ref",
                               bowtie_ref="bowtie_ref/UTI89",
                               tn_seq="AGATGTGTATAAGAGACAG",
                               tn_mismatches=2,
                               input_dir="input",
                               exp_design="full_exp_design.csv",
                               use_barcodes=False,
                               bar_upstream="CGAGGTCTCT",
                               bar_downstream="CGTACGCTGC",
                               filter_thr=0.05,
                               global_filter_thr=5,
                               min_counts=5)

Read trimming and genomic alignment

We detect the transposon sequence up to the chosen number of tn_mismatches in the reads in our input fastq files in input_dir in each read using the provided tn_seq. The barcode sequences are detected between the provided bar_upstream and bar_downstream sequences and the genomic part downstream of the transposon sequence of the read is collected in a compiled .fastq file. As a side effect, a temp directory is created and the compiled and trimmed reads are saved to temp/trimmed_sequences.fastq.

[4]:

# detect transposon and trim sequences
experiment.get_genomic_seq()

Then we use bowtie2 to align the trimmed reads to the reference genome in bowtie_ref. The output of bowtie2 is parsed and stored in temp/alignment_result.csv.

[5]:

# align trimmed sequences to reference using bowtie2
experiment.align_genomic_seq()

624483 reads; of these:
  624483 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    624483 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate

Assembly of count matrix and preprocessing

Once the alignment is finished we can count the reads of each mutant in each pool and assemble the count matrix using write_count_matrix().

[6]:

# assemble count matrix from aligned reads
experiment.write_count_matrix()

Generating count matrix with 7000 mutants
Generated count matrix!

write_count_matrix() generates 4 different count matrices stored as attributes:

raw_count_mat: stores the raw reads per mutant and pool
count_mat: stores the raw reads per mutant and pool, but spurious barcodes are filtered out
normalized_count_mat: stores the reads from count_mat, but normalized to counts per million
filtered_count_mat: stores normalized reads, but has filters applied according to filter_thr and global_thr

In practice it is beneficial to apply filters to the normalized count matrix, as often spurious reads are detected. This usually makes the deconvolution and location inference much harder, as many mutants become ambiguous. Therefore, we usually want to continue with the filtered_count_mat.

[7]:

experiment.count_mat

[7]:

	Feature	Orientation	Barcode	Reference	1	10	11	12	2	3	...	PC07	PC08	PR01	PR02	PR03	PR04	PR05	PR06	PR07	PR08
0	49	-	nan	NC_007946.1	0	0	10	0	0	0	...	0	0	0	14	0	0	0	0	0	0
1	50	-	nan	NC_007946.1	0	0	0	0	0	0	...	0	13	0	0	0	0	0	0	0	13
2	51	-	nan	NC_007946.1	0	0	0	0	0	0	...	0	0	0	0	0	11	0	0	0	0
3	53	-	nan	NC_007946.1	0	0	0	0	13	0	...	10	0	10	0	0	0	0	0	0	0
4	54	+	nan	NC_007946.1	0	13	0	0	0	0	...	0	14	0	0	0	0	0	14	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6995	7045	-	nan	NC_007946.1	0	10	0	0	0	14	...	0	0	13	0	0	0	0	0	11	0
6996	7046	+	nan	NC_007946.1	0	10	0	0	0	0	...	0	0	0	13	0	0	12	11	0	0
6997	7047	-	nan	NC_007946.1	0	0	0	0	0	0	...	0	0	0	14	12	0	0	0	11	0
6998	7048	+	nan	NC_007946.1	14	0	0	0	10	11	...	0	0	0	0	0	0	13	0	0	14
6999	7050	+	nan	NC_007946.1	0	0	10	0	14	0	...	12	0	14	0	0	11	13	0	0	0

7000 rows × 40 columns

[8]:

experiment.filtered_count_mat

[8]:

	Feature	Orientation	Barcode	Reference	1	10	11	12	2	3	...	PC07	PC08	PR01	PR02	PR03	PR04	PR05	PR06	PR07	PR08
0	49	-	nan	NC_007946.1	0.00000	0.000000	734.322221	0.0	0.000000	0.000000	...	0.00000	0.000000	0.000000	732.945919	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
1	50	-	nan	NC_007946.1	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	...	0.00000	675.465032	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	664.349959
2	51	-	nan	NC_007946.1	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	...	0.00000	0.000000	0.000000	0.000000	0.000000	581.856652	0.000000	0.000000	0.000000	0.000000
3	53	-	nan	NC_007946.1	0.00000	0.000000	0.000000	0.0	980.022616	0.000000	...	496.15480	0.000000	509.735957	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	54	+	nan	NC_007946.1	0.00000	981.798958	0.000000	0.0	0.000000	0.000000	...	0.00000	727.423880	0.000000	0.000000	0.000000	0.000000	0.000000	721.277692	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6995	7045	-	nan	NC_007946.1	0.00000	755.229968	0.000000	0.0	0.000000	1003.152766	...	0.00000	0.000000	662.656744	0.000000	0.000000	0.000000	0.000000	0.000000	549.862534	0.000000
6996	7046	+	nan	NC_007946.1	0.00000	755.229968	0.000000	0.0	0.000000	0.000000	...	0.00000	0.000000	0.000000	680.592639	0.000000	0.000000	629.954328	566.718187	0.000000	0.000000
6997	7047	-	nan	NC_007946.1	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	...	0.00000	0.000000	0.000000	732.945919	635.930048	0.000000	0.000000	0.000000	549.862534	0.000000
6998	7048	+	nan	NC_007946.1	1108.64745	0.000000	0.000000	0.0	753.863551	788.191459	...	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	682.450522	0.000000	0.000000	715.453802
6999	7050	+	nan	NC_007946.1	0.00000	0.000000	734.322221	0.0	1055.408971	0.000000	...	595.38576	0.000000	713.630339	0.000000	0.000000	581.856652	682.450522	0.000000	0.000000	0.000000

7000 rows × 40 columns

Deconvolution and location inference

We then use the count matrix to infer the most likely location of each mutant using deconvolve().

[9]:

# Inferring most likely locations and writing location summary
experiment.deconvolve()

We can access the deconvolution results with location_summary to get a detailed information about each mutant detected. To get information about each well in the arrayed call transposed_location_summary

[11]:

experiment.location_summary

[11]:

	Reference	Coordinate	Gene	Locus_tag	Predicted_wells	Gene_orientation	Transposon_orientation	Percent_from_start	Gene_hit	Ambiguity	Possible_wells	Pool_counts	Barcode
0	NC_007946.1	49	NaN	NaN	F_11_PC02_PR02	NaN	-	NaN	False	Unambiguous	F_11_PC02_PR02	11:10.0;F:12.0;PC02:12.0;PR02:14.0	nan
1	NC_007946.1	50	NaN	NaN	F_5_PC08_PR08	NaN	-	NaN	False	Unambiguous	F_5_PC08_PR08	5:13.0;F:10.0;PC08:13.0;PR08:13.0	nan
2	NC_007946.1	51	NaN	NaN	D_8_PC05_PR04	NaN	-	NaN	False	Unambiguous	D_8_PC05_PR04	8:10.0;D:10.0;PC05:10.0;PR04:11.0	nan
3	NC_007946.1	53	NaN	NaN	E_2_PC07_PR01	NaN	-	NaN	False	Unambiguous	E_2_PC07_PR01	2:13.0;E:11.0;PC07:10.0;PR01:10.0	nan
4	NC_007946.1	54	NaN	NaN	D_10_PC08_PR06	NaN	+	NaN	False	Unambiguous	D_10_PC08_PR06	10:13.0;D:10.0;PC08:14.0;PR06:14.0	nan
...	...	...	...	...	...	...	...	...	...	...	...	...	...
6995	NC_007946.1	7045	Not_found	UTI89_RS00040	B_10_PC03_PR01;D_3_PC05_PR07;F_7_PC03_PR01	-	-	0.55905	True	Ambiguous	B_10_PC03_PR01;B_10_PC03_PR07;B_10_PC05_PR01;B...	10:10.0;3:14.0;7:13.0;B:12.0;D:12.0;F:12.0;PC0...	nan
6996	NC_007946.1	7046	Not_found	UTI89_RS00040	B_10_PC05_PR06;C_6_PC03_PR02;G_4_PC05_PR05	-	+	0.558351	True	Ambiguous	B_10_PC03_PR02;B_10_PC03_PR05;B_10_PC03_PR06;B...	10:10.0;4:13.0;6:14.0;B:10.0;C:14.0;G:11.0;PC0...	nan
6997	NC_007946.1	7047	Not_found	UTI89_RS00040	A_6_PC01_PR02;C_9_PC04_PR03;D_8_PC04_PR07	-	-	0.557652	True	Ambiguous	A_6_PC01_PR02;A_6_PC01_PR03;A_6_PC01_PR07;A_6_...	6:14.0;8:10.0;9:12.0;A:13.0;C:13.0;D:10.0;PC01...	nan
6998	NC_007946.1	7048	Not_found	UTI89_RS00040	A_1_PC04_PR08;A_2_PC03_PR05;F_3_PC03_PR05	-	+	0.556953	True	Ambiguous	A_1_PC03_PR05;A_1_PC03_PR08;A_1_PC04_PR05;A_1_...	1:14.0;2:10.0;3:11.0;A:13.0;F:10.0;PC03:10.0;P...	nan
6999	NC_007946.1	7050	Not_found	UTI89_RS00040	A_5_PC07_PR05;D_11_PC07_PR04;H_2_PC02_PR01	-	+	0.555556	True	Ambiguous	A_11_PC02_PR01;A_11_PC02_PR04;A_11_PC02_PR05;A...	11:10.0;2:14.0;5:14.0;A:13.0;D:11.0;H:13.0;PC0...	nan

7000 rows × 13 columns

[12]:


experiment.transposed_location_summary

[12]:

	Well	row	col	PC	PR	Reference	Coordinate	Gene	Locus_tag	Gene_orientation	Transposon_orientation	Percent_from_start	Gene_hit	Ambiguity	Barcode
Well
A_1_PC01_PR01	A_1_PC01_PR01	A	1	PC01	PR01	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
A_1_PC01_PR02	A_1_PC01_PR02	A	1	PC01	PR02	NC_007946.1;NC_007946.1	689;6692	thrA;Not_found	UTI89_RS00010;UTI89_RS00040	+;-	-;-	0.14372716199756394;0.8057302585604472	True;True	Unambiguous;Ambiguous	nan
A_1_PC01_PR03	A_1_PC01_PR03	A	1	PC01	PR03	NC_007946.1;NC_007946.1;NC_007946.1	4056;5156;5581	thrC;nan;yaaA	UTI89_RS00020;nan;UTI89_RS00035	+;nan;-	+;-;+	0.2517482517482518;nan;0.9832689832689833	True;False;True	Ambiguous;Ambiguous;Ambiguous	nan
A_1_PC01_PR04	A_1_PC01_PR04	A	1	PC01	PR04	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
A_1_PC01_PR05	A_1_PC01_PR05	A	1	PC01	PR05	NC_007946.1	5715	yaaA	UTI89_RS00035	-	-	0.810811	True	Ambiguous	nan
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
H_12_PC08_PR04	H_12_PC08_PR04	H	12	PC08	PR04	NC_007946.1;NC_007946.1	3370;3884	thrB;thrC	UTI89_RS00015;UTI89_RS00020	+;+	+;-	0.6120042872454448;0.11810411810411811	True;True	Unambiguous;Ambiguous	nan
H_12_PC08_PR05	H_12_PC08_PR05	H	12	PC08	PR05	NC_007946.1	3566	thrB	UTI89_RS00015	+	+	0.822079	True	Ambiguous	nan
H_12_PC08_PR06	H_12_PC08_PR06	H	12	PC08	PR06	NC_007946.1	5399	Not_found	UTI89_RS00030	+	+	0.56229	True	Ambiguous	nan
H_12_PC08_PR07	H_12_PC08_PR07	H	12	PC08	PR07	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
H_12_PC08_PR08	H_12_PC08_PR08	H	12	PC08	PR08	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan

6144 rows × 15 columns

References

[1] Langmead, B. and Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), pp.357-359.

How to run arraylib interactively

Read trimming and genomic alignment

Assembly of count matrix and preprocessing

Deconvolution and location inference

References

How to run `arraylib` interactively