Shubham Gupta
September 29, 2023
•
10 min
When building any search application, having performant data ingestion is crucial. Users expect results instantly, so crawling and indexing content must be lightning-fast.
One way to do this is to store data on the server side rather than on the client side. It helps us mold the data the way we want and makes the search results much more contextual. Technologies like Elasticsearch and OpenSearch are, of course, great help here.
But before we can store data, we need an efficient and speedy crawler to fetch it for us. This is the story of our experiments with building one of the fastest crawlers out there and how we got it to crawl over 100 documents in under six seconds.
Oh, and you can experience the crawler in action with OSlash Copilot!
We knew right from the start that simply building a crawler wouldn’t do. We wanted to see the disbelief and awe on the faces of OSlash users when they realized just how fast our crawler was. We didn’t want to give them time to notice the process at all. We wanted the data to be ready for search as soon as the users integrated their apps within OSlash Copilot.
The only way to do this? Using a multi-threaded environment.
So, we had two choices. One was to write our own code. The second was to use a library.
Writing our own code would have given us more control and customization but we wouldn’t have been able to move as fast as we liked, especially given that the code was going to be concurrent.
These constraints made us choose Java because of its excellent libraries for concurrent programming. This let us focus on productization rather than building threading constructs from scratch.
We started looking for the right Java framework and zeroed in on Akka.
With Akka we gained:
Akka's reactive architecture delivered the high-throughput crawling we needed. But all was not well with the system.
Any search app is only as effective as its integrations. While fetching data from connected applications, we ran into a few issues.
We created multiple integrations for apps such as Google Drive, Gmail, Google Calendar, Slack, Freshdesk, Coda etc. Each integration had a different kind of authentication. Some worked on OAuth 2.0, some used API keys, and for some we had to rely on usernames and passwords. This led to a lot of manual, repetitive, and duplicate effort, nullifying efficiency.
The crux of the producer-consumer problem lies in the following facts:
(i) Producers produce data items and place them in the buffer.
(ii) Consumers consume data items from the buffer.
(iii) The buffer has limited capacity. If the buffer is full, producers must wait until there is space, and if the buffer is empty, consumers must wait until there is data.
(iv) Producers and consumers must operate concurrently, and their interactions should be synchronized to prevent issues like data corruption or deadlock.
The crawler would fail when authentication failed. For instance, if someone changed their password, their access key and secret key would also expire, leading to a failure in crawling.
We faced these issues majorly while crawling JIRA tickets and Confluence documentation. Both these tools use access tokens to authenticate requests. When an access token is close to expiration, we use the refresh token to generate a new access token.
But here's the catch: even the refresh token has an expiration time. Once it has been used to generate a new access token, it will eventually expire as well. This means we need to store the refresh token securely to facilitate the creation of new access tokens.
If, for any reason, the process of generating a new access token fails due to an expired refresh token or any other issue, the user will be required to reauthenticate.
On the one hand this ensures that only authorized users have access to the system and their information remains secure. On the other hand, it gave me sleepless nights, trying to find a fix for the issue.
We partially remedied the problem through a unified index for every integration that could convert data from all connectors into a single format which could be stored in OpenSearch or Elasticsearch.
While we knew we had built a sound crawler, we were also aware of its limitations. The authentication woes made the top of the list.
To fix these issues, we decided to build our own authentication service that could work independently of the integrations. That way we would be able to resolve authentication issues within a single, separate service, making the whole process much faster!
This is how it works: We had a unique identifier called "workspace pk" for each workspace. Additionally, we used an "integration name" to indicate a specific integration. Using the workspace key and integration name, we generated a key that allowed us to obtain a lock and perform authentication when the service would be called from the integration.
Once the service was called, it provided the necessary authentication credentials, such as client ID and client secret or API key specific to that integration. The integration could then use these credentials to authenticate itself and access the relevant data.
As we progressed in building keyword-based search, the era of Large Language Models (LLMs) and Generative AI dawned.
Since we wanted to ride the wave and put our existing search expertise to the best use, we gradually combined semantic search and Retrieval-Augmented Generation (RAG) in the product construct of the OSlash Copilot. Here’s how it would work:
The user could paste any public URL (say to their website, knowledge base, help center, or product documentation) into the product interface and ask questions in natural language. The Copilot would then look for the relevant articles and passages where the answer would be found and generate an AI-summary for it, citing the source for reference.
Behind the scenes, our crawler would crawl the input URL, clean the data to rid it of unnecessary attributes irrelevant to the search, and then store it in OpenSearch. Our AI algorithm would then process this stored data, use ML to model it, and generate the relevant content based on the question asked.
If you are curious about the setup, let me just say diving into it would require a blog post of its own! Stay tuned!
For now, let’s move on to the next phase of our crawler journey.
After thinking through all of these issues, we finally knew what to do! With the solutions in sight, it was time to design the problem and begin development.
We had prior experience writing crawlers in Java but the crawling wasn’t fast enough.
We had also made a file search with Rust which was extremely fast. So we picked Rust to implement the crawler.
We chose a library called Spider to crawl URLs and fetch content. It took us about two days to write a crawler with the help of an open source library. The speed of the resulting crawler left even us surprised—we were able to crawl over 100 docs in around eight seconds. It also made us question: Could we make the crawler even faster?
We knew our approach was correct and testing the implementation gave us the confidence to experiment further. We decided to deploy the code on AWS Lambda’s event-driven serverless computing platform. But there was one more hurdle to cross.
Lambda doesn’t support Rust as runtime. So we needed a customized lambda. Our search led us to Cargo Lambda which we could use to make custom lambda and deploy it. While trying to put this into effect, we ran into yet another issue. The libraries out there for Rust were inefficient, mature, and came up with a whole lot of other issues.
“Back to square one?” we scratched our heads in frustration.
We were left with two choices. The first was to find a new library and do a proof of concept (POC). The second was to write it on our own. Even after doing POCs with multiple libraries, we couldn’t go much further.
By this point, we were fed up with wasting our time on open-source libraries that would not work the way we wanted. So one fine day we decided to write our own in-built crawler. Yes, you heard it right!
We started writing our crawler from scratch, customizing it for crawl depth and allowed domains. What’s more? With our own code, we were able to integrate concepts such as multithreading, channels, semaphores, and many other techniques seamlessly.
When the crawler was ready, we converted our code to customized lambda and deployed it on AWS Lambda. Boom!
The crawler was now able to crawl 100+ pages in a record six seconds! The excitement in the engineering team was palpable and spread to non-engineers at OSlash as well.
I still remember when we demonstrated the crawler in action on the Copilot interface—I’m not exaggerating when I say people gasped.
Does that mean this story has a happy ending? Maybe.
Everyone who tries out OSlash Copilot today is awed by its speed. But of course, the crawler that powers it is still far from perfect.
To give just one example, the crawler was great at crawling specific tags but when it came to crawling the entire HTML code, it was duplicating data, among other issues.
The current phase of our journey in building one of the world’s fastest search crawlers involves overcoming the following pressing issues:
Search that returns all results is of no use to the searcher. Only search that returns relevant results which fulfill a need is useful. With this principle in mind, we started cleaning the HTML data of the sites crawled of duplicated elements such as general tags, ID, and class such as header, footer, menu etc. which would be part of every page, and hence, every search result. The holy grail of the cleanup remains elusive but we are slowly inching toward a solution.
We started testing the speed and capacity of the crawler on a lot of public knowledge bases. We discovered that websites such as GitHub and Datadog have a HUGE repository of knowledge with tens of thousands of documents. The crawler would simply not index anything for such knowledge-heavy websites.
When we performed Root Cause Analysis (RCA) to understand the cause of this behavior, we found out that AWS Lambda times out the crawl after 15 minutes, causing the crawler to stop abruptly and being unable to write any data in the database.
To solve this problem we relied on AWS Batch, which has a configurable timeout. We recorded the crawl time manually and once it exceeded 14 minutes and 30 seconds we would trigger the AWS Batch and send the same configuration to AWS Batch to crawl the data.
To further reduce the crawl time, we decided to follow the Robots.txt implementation, which tells bots which parts of the website to crawl and what not to. By reducing the number of pages that are allowed to be crawled, we make the crawler even faster.
Finally, there are some security best practices we make sure to adhere to, no matter what—to guarantee user and data security as well as ship a good product. Here is a ready reckoner for you:
Our quest for the fastest enterprise search crawler shows that custom optimization pays off exponentially. The engineering journey continues as we find new ways to squeeze out performance while respecting security and safety concerns.
Stay tuned for more adventures in search!