Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

Burrows Wheeler in Python. The magical algorithm to index and… | by Cody Glickman, PhD | Sep, 2022

admin by admin
September 10, 2022
in Artificial Intelligence


The magical algorithm to index and compress large string data then find substrings quickly

Photo by SOULSANA on Unsplash

The Burrows Wheeler Transform (BWT) was developed in 1994 by Michael Burrows and David Wheeler. In simple terms, BWT is a string transformation that acts as a preprocessing step for lossless compression. BWT has implementations that exhibit both a linear O(n) performance and space complexity. Originally designed to prepare data for compression with techniques like bzip2, BWT has found prominence in bioinformatics allowing the fast mapping of short reads paving the way for high throughput genetic sequencing.

In this article, we will implement a simple BWT in python then show how to find small subsegments with mismatches using a streamlined suffix array BWT.

BWT Algorithm:

  1. Rotate letters: apple becomes [‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’, ‘apple’]
  2. Sort rotated words alphabetically: [‘apple’, ‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’]
  3. Take last column: elppa

BWT of apple becomes elppa

Simple Implementation of BWT

Implementation of BWT with Suffix Array for Fuzzy String Searching

In this python implementation, we will be generating a suffix array and use that to perform the BWT.

BWT from Suffix Array Algorithm:

  1. Generate suffix array: apple becomes [5, 0, 4, 3, 2, 1]
  2. Implementation of suffix array BWT: apple + [5, 0, 4,3,2,1] becomes: e$lppa
  3. Find match locations: app in apple returns position [0]
  4. Find imperfect matches: apl in apple with mismatches =1 returns positions [0, 1]

The following code chunk defines functions that will be used to find imperfect matches. The terminal function in the chunk below generate_all(“apple”) returns a tuple of unique letters, the BWT output, the LF Map, the index of unique letters, and the suffix array.

generate_all("apple")

Returns:

({‘a’, ‘e’, ‘l’, ‘p’},
‘e$lppa’,
{‘e’: [1, 1, 1, 1, 1, 1, 1, 0],
‘p’: [0, 0, 0, 1, 2, 2, 2, 0],
‘l’: [0, 0, 1, 1, 1, 1, 1, 0],
‘$’: [0, 1, 1, 1, 1, 1, 1, 0],
‘a’: [0, 0, 0, 0, 0, 1, 1, 0]},
{‘a’: 0, ‘e’: 1, ‘l’: 2, ‘p’: 3},
[5, 0, 4, 3, 2, 1])

Identifying (Fuzzy Matching) Substrings within BWT Data

Here we identify matches in a BWT string by the following algorithm. All searches occur in reverse for example finding app in apple starts by finding “p”, then “pp”, then “app”.

Note: Last First (LF) property is the ith occurrence of a letter [X] in the last column corresponds to the ith occurrence of [X] in the first column.

  1. Find the range of last letter in search string in the first column of the BWT
  2. Look at the same range of last column of the BWT
  3. Find the next character of the search in the LF mapping. Set “next character” column [NC] equal to the mapping entry for the observed range in the LF Mapping matrix. Set the next column [NC+1] in the LF mapping matrix equal to the “next character” value in the last + 1 row of the current range.
  4. Find the range for the “next character” in the first row and use NC & NC+1 to find the right subrange within the “next character” range.
find("app", 'apple', mismatches=0)

Returns: [0]

We also implement a way to identify mismatches in the subsequence. The characters must be the same though.

find("apZ", 'apple', mismatches=1)

Returns: [] “Empty return because Z is not in reference string”

find("ape", 'apple', mismatches=1)

Returns: [0]

The full notebook to perform the suffix array BWT can be found on GitHub.

Photo by Annie Spratt on Unsplash

We the following chunks will demonstrate the usefulness and speed of BWT. We will download a copy of Alice in Wonderland from Project Gutenberg and munge the text to start from Chapter 8. This leaves us with a text document with 61235 characters comprising of 27432 words in 3762 lines (Still plenty large for our purposes). We will then look for the number of times the phrase “Off with her head” is mentioned. We will compare our BWT search with a default python string search.

It takes about 3 seconds to create the suffix array and perform the BWT. Here is where the magic happens after creating the BWT, we can perform searches orders of magnitude faster than a standard string search. Even a fuzzy search is almost 100x times faster than a standard string search. Wow!

Speed performance comparison of string searching. Image by author.

The code for this article can be found on my personal GitHub. The speed to map small genetic sequences to a giant genome string is why BWT searching has become popular with bioinformatics tools like Bowtie and BWA. My name is Cody Glickman and I can be found on LinkedIn. Be sure to check out some of my other articles!



Source link

Previous Post

HOG Feature Descriptor. In computer vision there are many… | by Dahi Nemutlu | Sep, 2022

Next Post

HOG Özellik Tanımlayıcısı. Bilgisayarlı görüde, görüntü… | by Dahi Nemutlu | Sep, 2022

Next Post

HOG Özellik Tanımlayıcısı. Bilgisayarlı görüde, görüntü… | by Dahi Nemutlu | Sep, 2022

A Gentle Introduction to Time Series Analysis and Forecasting | by Azmine Toushik Wasi | Sep, 2022

4 Types of Machine Learning and Explained | by Si Thu Khant | Sep, 2022

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.