Analyses – pbxplore.analysis
¶
Build occurence matrix¶
-
pbxplore.analysis.
count_matrix
(pb_seq)[source]¶ Count the occurences of each block at each position.
The occurence matrix has one row per sequence, and one column per block. The columns are ordered in as
pbxplore.PB.NAMES
.Parameters: pb_seq – a list of PB sequences. Returns: pb_count – The occurence matrix. Return type: numpy array Raises: pbxplore.PB.InvalidBlockError
– encountered an unexpected PB
-
pbxplore.analysis.
read_occurence_file
(name)[source]¶ Read an occurence matrix from a file. It will return the matrix as a numpy array and the indexes of residues.
Parameters: name (str) – Name of the file. Returns: - count_mat (numpy array) – the occurence matrix without the residue number
- residues (list) – the list of residues indexes
Raises: ValueError
– when something is wrong about the file
-
pbxplore.analysis.
plot_map
(fname, count_mat, idx_first_residue=1, residue_min=1, residue_max=None)[source]¶ Generate a map of the distribution of PBs along protein sequence from an occurence matrix.
Parameters: - fname (str) – The path to the file to write in
- count_mat (numpy array) – an occurence matrix returned by count_matrix.
- idx_first_residue (int) – the index of the first residue in the matrix
- residue_min (int) – the lower bound of the protein sequence
- residue_max (int) – the upper bound of the protein sequence
Compare protein block sequences¶
-
pbxplore.analysis.
compare
(header_lst, seq_lst, substitution_mat, fname)[source]¶ Command line wrapper for the comparison of all sequences with the first one
When the –compare option is given to the command line, the program compares all the sequences to the first one and writes these comparison as sequences of digits. These digits represent the distance between the PB in the target and the one in the reference at the same position. The digits are normalized in the [0; 9] range.
This function run the comparison, write the result in a fasta file, and display on screen informations about the process.
Parameters: - header_lst (list of strings) – The list of sequence headers ordered as the sequences
- seq_lst (list of strings) – The list of sequences ordered as the headers
- substitution_mat (numpy.array) – A substitution matrix expressed as similarity scores
- fname (str) – The output file name
Visualize deformability¶
-
pbxplore.analysis.
compute_neq
(count_mat)[source]¶ Compute the Neq for each residue from an occurence matrix.
Parameters: count_mat (numpy array) – an occurence matrix returned by count_matrix. Returns: neq_array – a 1D array containing the neq values Return type: numpy array
-
pbxplore.analysis.
plot_neq
(fname, neq_array, idx_first_residue=1, residue_min=1, residue_max=None)[source]¶ Generate the Neq plot along the protein sequence
Parameters: - fname (str) – The path to the file to write in
- neq_array (numpy array) – an array containing the neq value associated to the residue number
- idx_first_residue (int) – the index of the first residue in the array
- residue_min (int) – the lower bound of the protein sequence
- residue_max (int) – the upper bound of the protein sequence
-
pbxplore.analysis.
generate_weblogo
(fname, count_mat, idx_first_residue=1, residue_min=1, residue_max=None, title='')[source]¶ Generates logo representation of PBs frequency along protein sequence through the weblogo library.
The weblogo reference: G. E. Crooks, G. Hon, J.-M. Chandonia, and S. E. Brenner. ‘WebLogo: A Sequence Logo Generator.’ Genome Research 14:1188–90 (2004) doi:10.1101/gr.849004. http://weblogo.threeplusone.com/
Parameters: - fname (str) – The path to the file to write in
- count_mat (numpy array) – an occurence matrix returned by count_matrix.
- idx_first_residue (int) – the index of the first residue in the matrix
- residue_min (int) – the lower bound of residue frame
- residue_max (int) – the upper bound of residue frame
- title (str) – the title of the weblogo. Default is empty.
Utils¶
-
pbxplore.analysis.
substitution_score
(substitution_matrix, seqA, seqB)[source]¶ Compute the substitution score to go from
seqA
toseqB
Both sequences must have the same length.
The score is either expressed as a similarity or a distance depending on the substitution matrix.
-
pbxplore.analysis.
compute_freq_matrix
(count_mat)[source]¶ Compute a PB frequency matrix from an occurence matrix.
The frequency matrix has one row per sequence, and one column per block. The columns are ordered in as
pbxplore.PB.NAMES
.Parameters: count_mat (numpy array) – an occurence matrix returned by count_matrix
.Returns: freq – The frequency matrix Return type: numpy array
-
pbxplore.analysis.
compute_score_by_position
(score_mat, seq1, seq2)[source]¶ Computes substitution score between two sequences position per position
The substitution score can represent a similarity or a distance depending on the score matrix provided. The score matrix should be provided as a 2D numpy array with
score[i, j]
the score to swich the PB at the i-th position inpbxplore.PB.NAMES
to the PB at the j-th position inpbxplore.PB.NAMES
.The function returns the result as a list of substitution scores to go from
seq1
toseq2
for each position. Both sequences must have the same length.Note
The score to move from or to a Z block (dummy block) is always 0.
Raises: pbxplore.PB.InvalidBlockError
– encountered an unexpected PB