How to run arraylib interactively

First we import the LibraryExperiment class from the package.

[2]:
from arraylib.libraryexperiment import LibraryExperiment

We generate an experiment object by instantiating the LibraryExperiment class with our chosen parameters.

Required parameters:

  • input_dir: path to directory holding the input fastq files

  • exp_design: path to csv file indicating experimental design (values should be separated by a comma). The experimental design file should have columns, Filename, Poolname and Pooldimension. (see example in tests/test_data/full_exp_design.csv)

    • Filename should contain all the unqiue input fastq filenames.

    • Poolname should indicate to which pool a given file belongs. Multiple files per poolname are allowed.

    • Pooldimension indicates the pooling dimension a pool belongs to. All pools sharing the same pooling dimension should have the same string in the Pooldimension column.

An example of how an exp_design file could look like:

Filename

Poolname

Pooldimension

column1.fastq

column1

columns

column2.fastq

column2

columns

row1.fastq

row1

rows

row2.fastq

row2

rows

platerow1.fastq

platerow1

platerows

platerow2.fastq

platerow2

platerows

platecol1.fastq

platecol1

platecols

platecol2.fastq

platecol2

platecols

  • gb_ref path to genbank reference file

  • bowtie_ref path to bowtie index files, ending with the basename of your index (if the basename of your index is UTI89 and you store your bowtie2 [1] references in bowtie_ref it should be bowtie_ref/UTI89). Please visit https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer for a manual how to create bowtie2 indices.

  • tn_seq transposon sequence (e.g. AGATGTGTATAAGAGACAG)

  • bar_upstream upstream sequence of barcode (e.g. CGAGGTCTCT)

  • bar_downstream downstream sequence of barcode (e.g. CGTACGCTGC)

Optional parameters:

  • map_quality minimum bowtie2 alignment quality score for each base to include read

  • seq_quality minimum phred score for each base to include read

  • transposon_mismatches number of transposon mismatches allowed

  • filter_thr threshold for local filter (e.g. a threshold of 0.05 would filter out all reads < 0.05 of the maximum read count for a given mutant)

  • global_thr threshold for global filter (all reads below g_thr will be set to 0)

  • min_counts minimum counts of a barcode to be included in analysis

  • use_barcodes whether to perform deconvolution only based on barcodes without genomic alignment

You can also refer to the API to get a detailed description of the input parameter choices.

[3]:
experiment = LibraryExperiment(cores=4,
                               map_quality=30,
                               seq_quality=10,
                               gb_ref="gb_ref",
                               bowtie_ref="bowtie_ref/UTI89",
                               tn_seq="AGATGTGTATAAGAGACAG",
                               tn_mismatches=2,
                               input_dir="input",
                               exp_design="full_exp_design.csv",
                               use_barcodes=False,
                               bar_upstream="CGAGGTCTCT",
                               bar_downstream="CGTACGCTGC",
                               filter_thr=0.05,
                               global_filter_thr=5,
                               min_counts=5)

Read trimming and genomic alignment

We detect the transposon sequence up to the chosen number of tn_mismatches in the reads in our input fastq files in input_dir in each read using the provided tn_seq. The barcode sequences are detected between the provided bar_upstream and bar_downstream sequences and the genomic part downstream of the transposon sequence of the read is collected in a compiled .fastq file. As a side effect, a temp directory is created and the compiled and trimmed reads are saved to temp/trimmed_sequences.fastq.

[4]:
# detect transposon and trim sequences
experiment.get_genomic_seq()

Then we use bowtie2 to align the trimmed reads to the reference genome in bowtie_ref. The output of bowtie2 is parsed and stored in temp/alignment_result.csv.

[5]:
# align trimmed sequences to reference using bowtie2
experiment.align_genomic_seq()
624483 reads; of these:
  624483 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    624483 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate

Assembly of count matrix and preprocessing

Once the alignment is finished we can count the reads of each mutant in each pool and assemble the count matrix using write_count_matrix().

[6]:
# assemble count matrix from aligned reads
experiment.write_count_matrix()
Generating count matrix with 7000 mutants
Generated count matrix!

write_count_matrix() generates 4 different count matrices stored as attributes:

  • raw_count_mat: stores the raw reads per mutant and pool

  • count_mat: stores the raw reads per mutant and pool, but spurious barcodes are filtered out

  • normalized_count_mat: stores the reads from count_mat, but normalized to counts per million

  • filtered_count_mat: stores normalized reads, but has filters applied according to filter_thr and global_thr

In practice it is beneficial to apply filters to the normalized count matrix, as often spurious reads are detected. This usually makes the deconvolution and location inference much harder, as many mutants become ambiguous. Therefore, we usually want to continue with the filtered_count_mat.

[7]:
experiment.count_mat
[7]:
Feature Orientation Barcode Reference 1 10 11 12 2 3 ... PC07 PC08 PR01 PR02 PR03 PR04 PR05 PR06 PR07 PR08
0 49 - nan NC_007946.1 0 0 10 0 0 0 ... 0 0 0 14 0 0 0 0 0 0
1 50 - nan NC_007946.1 0 0 0 0 0 0 ... 0 13 0 0 0 0 0 0 0 13
2 51 - nan NC_007946.1 0 0 0 0 0 0 ... 0 0 0 0 0 11 0 0 0 0
3 53 - nan NC_007946.1 0 0 0 0 13 0 ... 10 0 10 0 0 0 0 0 0 0
4 54 + nan NC_007946.1 0 13 0 0 0 0 ... 0 14 0 0 0 0 0 14 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6995 7045 - nan NC_007946.1 0 10 0 0 0 14 ... 0 0 13 0 0 0 0 0 11 0
6996 7046 + nan NC_007946.1 0 10 0 0 0 0 ... 0 0 0 13 0 0 12 11 0 0
6997 7047 - nan NC_007946.1 0 0 0 0 0 0 ... 0 0 0 14 12 0 0 0 11 0
6998 7048 + nan NC_007946.1 14 0 0 0 10 11 ... 0 0 0 0 0 0 13 0 0 14
6999 7050 + nan NC_007946.1 0 0 10 0 14 0 ... 12 0 14 0 0 11 13 0 0 0

7000 rows × 40 columns

[8]:
experiment.filtered_count_mat
[8]:
Feature Orientation Barcode Reference 1 10 11 12 2 3 ... PC07 PC08 PR01 PR02 PR03 PR04 PR05 PR06 PR07 PR08
0 49 - nan NC_007946.1 0.00000 0.000000 734.322221 0.0 0.000000 0.000000 ... 0.00000 0.000000 0.000000 732.945919 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 50 - nan NC_007946.1 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.00000 675.465032 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 664.349959
2 51 - nan NC_007946.1 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.00000 0.000000 0.000000 0.000000 0.000000 581.856652 0.000000 0.000000 0.000000 0.000000
3 53 - nan NC_007946.1 0.00000 0.000000 0.000000 0.0 980.022616 0.000000 ... 496.15480 0.000000 509.735957 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 54 + nan NC_007946.1 0.00000 981.798958 0.000000 0.0 0.000000 0.000000 ... 0.00000 727.423880 0.000000 0.000000 0.000000 0.000000 0.000000 721.277692 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6995 7045 - nan NC_007946.1 0.00000 755.229968 0.000000 0.0 0.000000 1003.152766 ... 0.00000 0.000000 662.656744 0.000000 0.000000 0.000000 0.000000 0.000000 549.862534 0.000000
6996 7046 + nan NC_007946.1 0.00000 755.229968 0.000000 0.0 0.000000 0.000000 ... 0.00000 0.000000 0.000000 680.592639 0.000000 0.000000 629.954328 566.718187 0.000000 0.000000
6997 7047 - nan NC_007946.1 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.00000 0.000000 0.000000 732.945919 635.930048 0.000000 0.000000 0.000000 549.862534 0.000000
6998 7048 + nan NC_007946.1 1108.64745 0.000000 0.000000 0.0 753.863551 788.191459 ... 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 682.450522 0.000000 0.000000 715.453802
6999 7050 + nan NC_007946.1 0.00000 0.000000 734.322221 0.0 1055.408971 0.000000 ... 595.38576 0.000000 713.630339 0.000000 0.000000 581.856652 682.450522 0.000000 0.000000 0.000000

7000 rows × 40 columns

Deconvolution and location inference

We then use the count matrix to infer the most likely location of each mutant using deconvolve().

[9]:
# Inferring most likely locations and writing location summary
experiment.deconvolve()

We can access the deconvolution results with location_summary to get a detailed information about each mutant detected. To get information about each well in the arrayed call transposed_location_summary

[11]:
experiment.location_summary

[11]:
Reference Coordinate Gene Locus_tag Predicted_wells Gene_orientation Transposon_orientation Percent_from_start Gene_hit Ambiguity Possible_wells Pool_counts Barcode
0 NC_007946.1 49 NaN NaN F_11_PC02_PR02 NaN - NaN False Unambiguous F_11_PC02_PR02 11:10.0;F:12.0;PC02:12.0;PR02:14.0 nan
1 NC_007946.1 50 NaN NaN F_5_PC08_PR08 NaN - NaN False Unambiguous F_5_PC08_PR08 5:13.0;F:10.0;PC08:13.0;PR08:13.0 nan
2 NC_007946.1 51 NaN NaN D_8_PC05_PR04 NaN - NaN False Unambiguous D_8_PC05_PR04 8:10.0;D:10.0;PC05:10.0;PR04:11.0 nan
3 NC_007946.1 53 NaN NaN E_2_PC07_PR01 NaN - NaN False Unambiguous E_2_PC07_PR01 2:13.0;E:11.0;PC07:10.0;PR01:10.0 nan
4 NC_007946.1 54 NaN NaN D_10_PC08_PR06 NaN + NaN False Unambiguous D_10_PC08_PR06 10:13.0;D:10.0;PC08:14.0;PR06:14.0 nan
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6995 NC_007946.1 7045 Not_found UTI89_RS00040 B_10_PC03_PR01;D_3_PC05_PR07;F_7_PC03_PR01 - - 0.55905 True Ambiguous B_10_PC03_PR01;B_10_PC03_PR07;B_10_PC05_PR01;B... 10:10.0;3:14.0;7:13.0;B:12.0;D:12.0;F:12.0;PC0... nan
6996 NC_007946.1 7046 Not_found UTI89_RS00040 B_10_PC05_PR06;C_6_PC03_PR02;G_4_PC05_PR05 - + 0.558351 True Ambiguous B_10_PC03_PR02;B_10_PC03_PR05;B_10_PC03_PR06;B... 10:10.0;4:13.0;6:14.0;B:10.0;C:14.0;G:11.0;PC0... nan
6997 NC_007946.1 7047 Not_found UTI89_RS00040 A_6_PC01_PR02;C_9_PC04_PR03;D_8_PC04_PR07 - - 0.557652 True Ambiguous A_6_PC01_PR02;A_6_PC01_PR03;A_6_PC01_PR07;A_6_... 6:14.0;8:10.0;9:12.0;A:13.0;C:13.0;D:10.0;PC01... nan
6998 NC_007946.1 7048 Not_found UTI89_RS00040 A_1_PC04_PR08;A_2_PC03_PR05;F_3_PC03_PR05 - + 0.556953 True Ambiguous A_1_PC03_PR05;A_1_PC03_PR08;A_1_PC04_PR05;A_1_... 1:14.0;2:10.0;3:11.0;A:13.0;F:10.0;PC03:10.0;P... nan
6999 NC_007946.1 7050 Not_found UTI89_RS00040 A_5_PC07_PR05;D_11_PC07_PR04;H_2_PC02_PR01 - + 0.555556 True Ambiguous A_11_PC02_PR01;A_11_PC02_PR04;A_11_PC02_PR05;A... 11:10.0;2:14.0;5:14.0;A:13.0;D:11.0;H:13.0;PC0... nan

7000 rows × 13 columns

[12]:

experiment.transposed_location_summary
[12]:
Well row col PC PR Reference Coordinate Gene Locus_tag Gene_orientation Transposon_orientation Percent_from_start Gene_hit Ambiguity Barcode
Well
A_1_PC01_PR01 A_1_PC01_PR01 A 1 PC01 PR01 nan nan nan nan nan nan nan nan nan nan
A_1_PC01_PR02 A_1_PC01_PR02 A 1 PC01 PR02 NC_007946.1;NC_007946.1 689;6692 thrA;Not_found UTI89_RS00010;UTI89_RS00040 +;- -;- 0.14372716199756394;0.8057302585604472 True;True Unambiguous;Ambiguous nan
A_1_PC01_PR03 A_1_PC01_PR03 A 1 PC01 PR03 NC_007946.1;NC_007946.1;NC_007946.1 4056;5156;5581 thrC;nan;yaaA UTI89_RS00020;nan;UTI89_RS00035 +;nan;- +;-;+ 0.2517482517482518;nan;0.9832689832689833 True;False;True Ambiguous;Ambiguous;Ambiguous nan
A_1_PC01_PR04 A_1_PC01_PR04 A 1 PC01 PR04 nan nan nan nan nan nan nan nan nan nan
A_1_PC01_PR05 A_1_PC01_PR05 A 1 PC01 PR05 NC_007946.1 5715 yaaA UTI89_RS00035 - - 0.810811 True Ambiguous nan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
H_12_PC08_PR04 H_12_PC08_PR04 H 12 PC08 PR04 NC_007946.1;NC_007946.1 3370;3884 thrB;thrC UTI89_RS00015;UTI89_RS00020 +;+ +;- 0.6120042872454448;0.11810411810411811 True;True Unambiguous;Ambiguous nan
H_12_PC08_PR05 H_12_PC08_PR05 H 12 PC08 PR05 NC_007946.1 3566 thrB UTI89_RS00015 + + 0.822079 True Ambiguous nan
H_12_PC08_PR06 H_12_PC08_PR06 H 12 PC08 PR06 NC_007946.1 5399 Not_found UTI89_RS00030 + + 0.56229 True Ambiguous nan
H_12_PC08_PR07 H_12_PC08_PR07 H 12 PC08 PR07 nan nan nan nan nan nan nan nan nan nan
H_12_PC08_PR08 H_12_PC08_PR08 H 12 PC08 PR08 nan nan nan nan nan nan nan nan nan nan

6144 rows × 15 columns

References

[1] Langmead, B. and Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), pp.357-359.