Part 2: The python implementation
This article follows the previous discussion about how trie structure works and how it could be applied to build a real time autocompletion feature. The first article was dedicated to the theoretical aspects behind the trie construction and the information retrieval. This one will be more practical as we will go through the python implementation.
In the first part of this article, we explained how the autocomplete feature that we use on daily basis to perform search queries online works. We saw that this feature relies on a data structure called trie that make it quicker to find queries sharing same prefixes. In case you missed the first part, we deeply encourage you to follow this link to get detailed view on how data is stored in this structure and how the search algorithm could retrieve the matching queries in no time.
In today’s article, we will focus on the practical part. First, we propose a python implementation of the trie and the search algorithm. Then, we build a flask API to expose it.
Visually the trie we would like to build looks like the following, each node stores a character and could have multiple children. Also, each node in the trie should have a flag that marks the query’s end.
To achieve this, we start with the building block: the node. We propose the following code snippet that implements a Trie node object that could store the character, the children as dict of nodes and the count initiated at -1. The count will be updated with the frequency to mark the end of a query and its occurrences in the historical queries data.
Once the node object is defined, it’s time to start building the trie. For this, we have to loop over the historical queries character by character and insert each one in its right position. This logic is implemented in the insert function.
At this stage, we have the needed functions to build the trie and store the historical queries in it. The next part of the code would be dedicated to queries retrieval. First step would be to loop over the input string (prefix). Then, traverse the tree to end up on the root containing the last character of the prefix. We know then that the solution could be retrieved by traversing the trie recursively starting from the selected node.
Let’s consider the figure below and take ‘h’ as input string. We locate first the prefix ‘’h’’, then perform a depth-first traversal of the trie to collect (-elp,-ell, -ide, -i) as shown by the orange arrows. Same thing applies if we consider the prefix ‘he’, the output would be (-lp,-ll) as shown by the red arrows.
This could be achieved with the code below.
In order to simulate what happens in real time on a search engine when we start typing a query, we are going to build an API endpoint on Flask.
We will use a real word historical queries dataset. To our knowledge, this is the biggest open dataset available online with more than 448 Million queries. It can be download here.
However, here we will be using just the first 1 Million rows as the tree building process could be slow if no parallelization is done. So, for sake of simplicity we build here a 1 million queries trie. Also, in this example, frequencies are generated randomly.
Yet, if anyone interested in tackling this project and analyzing the performance with the full dataset, he/she could make a contribution on the projet repository on Github.
The API we are building here, has a single endpoint /autocomplete that takes a string as a parameter and returns the TOP 10 historical queries matching the input prefix.
Notice that the trie is built only once before running the app. Once this is done, the api becomes ready to receive requests.
We can test the implementation by sending few requests to the API endpoint. As shown in the figure below, the response time here is 3 milliseconds which is acceptable for a realtime application (average 10ms/query on 10k random queries chosen from the 1 million historical queries data). Obviously, we are not taking into account the trie building time as this is done once before launching the api. In this case, the trie building time was around 30 seconds.
In this article, we have seen how we could implement the autocompletion feature using the Trie structure. The key behind this structure is to arrange in a smart way the historical data so that queries sharing the same prefix are located under the same nodes in the trie.
We have shown that the complexity of the search algorithm is independent of the size of the historical data. This is crucial as we need the autocompletion to be scalable and respond in realtime.