Inside exalead.com engine (1)

Technical_illustration_of_space_station_inner_space

This post is the first of a series about EXALEAD‘s web search engine activities. We’ll first introduce the engine and the team, and next posts will dive into more technical details of the problems we solved to make all of this work.

History 

Since its founding by search engine pioneers in 2000, EXALEAD has always crawled and indexed the world wide web as a showcase of its software platform’s capabilities and scalability. The exalead.com team is responsible for both the development of the crawler software, and operation of the web search engine. The software components of the crawler are also used in Cloudview, EXALEAD’s search platform. Today, exalead.com is one of the biggest web crawlers in the world, and is run by a team of 6 engineers.

The team belongs to EXALEAD’s Research & Development department, alongside the team responsible for EXALEAD’s main product, Cloudview. Cloudview is powered by EXALEAD’s core search engine software, which the exalead.com team uses to index 16 billion web pages. The exalead.com team is thus the first large-scale “client” of Cloudview, and its web search engine allows everyone to evaluate Cloudview’s semantic features and large scale indexing performance.

Tech & Projects

Exalead.com contains several specialized search indexes:

  • the main web search index, containing 16 billion pages.
  • a news index with 50 million news and blog articles, polling 7600 feeds in near real-time, a wikipedia index.

We’ve also included image and video search in the web interface (for science). They include 1 billion images and 100 million videos.

We’ll focus on the main web search engine because it is the largest and most important. It is split in 4 parts:

  • the crawl
  • indexing
  • search indexes
  • front servers.

The 32 crawlers explore the whole web to find new content and update already known pages, and store them in a large database.

The 27 indexing servers crunch these documents to extract the text and semantic properties, and build index data structures optimized for search.

The 30 search indices store these data on fast SSD storage to process user queries and find matching documents.

The 8 front servers serve the exalead.com web application, pre-process user queries, and build the result page.

All these servers are physically hosted in Telehouse, Magny-lès-Hameaux.

servers
One of our latest blade servers, with 4 attached disk arrays

The indexing and search is completely handled by a standard installation of Cloudview. The crawlers are web-specific, and the front application is a custom design using Django and python.

Each release of Cloudview is first beta-tested on our pre-prod servers. The bugs caught are then fixed before final releases are deployed to the whole indexing chain and EXALEAD’s other clients. exalead.com‘s volume allows us to find and fix even the most stubborn bugs, for example scaling issues and exceptional conditions happening only once on the whole web. All EXALEAD’s clients thus finally benefit from this large QA. It is worth noting that, even though the team has always used Cloudview pre-releases, exalead.com has never suffered any significant service disruption, thanks to careful operations planning and monitoring.

Exalead.com search engine today

The team is in the middle of a large project of hardware migration and upgrade. The crawler storage has already been migrated from a proprietary solution based on LSM-trees to HBase, and the index storage itself will use HDFS from the end of the year. 

Up to 2015Next
Pages16 billion32 billion
Total raw HTML volume500TB>1PB
Disk Storage150TB (HDFS cache) + 50TB (index)Full HDFS: 300TB (cache)
+ 600TB (index)
Crawl speed3000 pages/second>6000 pages/second
Total raw indexed volume3PB/year~6PB/year
Crawl bandwith~400 Mbps>800 Mbps
Sustained queries (per second)up to 10 qps>at least 30 qps

 Refresh

The team has developed many algorithms to keep only the most relevant documents of the web in our 16 billion page cache. Having a fixed budget, the crawler mainly refreshes existing documents. To optimize the hardware resources and crawl bandwidth, the crawler uses adaptive refresh: each page’s change rate is inferred from its crawl history, so that important and fast-changing pages are refreshed up to once a day, whereas static pages such as archives are only refreshed once a year.

Team

The whole exalead.com development and operations are handled by a team of only 6 full-time engineers.

  • Morgan Champenois graduated from Université de Caen in 2008 with a background in Software Development, Search Engines and Network communications. He joined EXALEAD in 2008 as R&D engineer, to work on exalead.com crawler software, and Datamining with Hadoop.
  • Julien Leproust (@jleproust) graduated from Ecole Centrale Paris in 2004, with a background in software development and network communications. He joined EXALEAD in 2007 as R&D engineer, to work on core platform technologies. He moved on to the web team in 2011, and became team manager in 2012.
  • Omar Bouslama graduated from Université de Technologie de Compiègne in 2010 and earned a master in distributed computing from Université Pierre et Marie Curie in 2011. He joined EXALEAD in 2011 as R&D engineer and is now in charge of exalead.com operations, and crawl software components in Cloudview.
  • Riadh Trad (@riadhtrad) PhD, graduated from Telecom ParisTECH in 2013, with a background on High Performance & Distributed Computing, Content Based Search Engines & Machine Learning. He joined EXALEAD in 2012 as R&D engineer, to work on Hadoop integration and operations within exalead.com, distributed software development and data mining.
  • Raphaël Claude (@heapoverflow) is an INSA-Lyon 2013 alumni with a background in Distributed Algorithms and a focus on statistics & machine learning. He joined the team in 2013 to work on data mining with Hadoop and distributed software development.
  • Fabrice Bacchella (@fbacchella) graduated from Université d’Orsay in 1995, and joined EXALEAD in 2006 as a system administrator, for exalead.com and other resources. He never moved away from his keyboard since, except to take a few pictures (Abec.Photographe). He is also the main author of JRDS (http://jrds.fr/), a monitoring software extensively used at EXALEAD.
  • Andrey Krivonogov is an intern for 6 months with the team. He is from Novossibirsk, Russia, and will graduate from École Polytechnique next year. He is busy enhancing our data mining algorithms, especially page and site relevancy ones

 

More to come

Crawling billions of web pages may seem easy at first, but EXALEAD’s web team has encountered lots of problems in 15 years, spanning performance scaling, relevancy, spam fighting. In the next months, we will publish a series of articles about some issues we faced and how we solved them:

  • How to make a web crawler scale to billions of pages
  • Finding information in an ocean of data: document relevance, redundancy, and user queries
  • Identifying and fighting web spammers

Stay tuned!