What is a Search Engine?
Search Engine is a software that displays relevant webpage results based on the search query keyword by a user.
How Search Engine works?
It is achieved by using techniques like
- Web Crawling
- Web Indexing and
- Some intelligent algorithms to gather data.
A few thousand searches were being made in the meantime this webpage was loaded on your device. Isn’t it amazing how Search Engine works?
Now, how Google serves the best results in the fraction of seconds? In reality, it doesn’t make a difference until search engines like Google, Bing, Yahoo, and Baidu are there. The situation would’ve been very different if there was no Google, Bing, Baidu or Yahoo. Give us a chance to jump into the world of search engines and see, how a search engine works.
Peeping into the history
- During 1990s when Tim Berners-Lee used to enlist every new webserver which went online, to the list maintained by the CERN webserver.
- Until September, 1993, there were no search engines. There existed just the internet.
- But only a few tools which were capable of maintaining a database of file names.
- Some tools were like Archie, Veronica, and Jughead.
- W3Catalog was the first search engine which was created by Oscar Nierstrasz from the University of Geneva.
- Through expert Perl scripting he came out with the world’s first search engine on September 3, 1993.
- During the year 1993, a lot of search engines like JumpStation, AliWeb, WWW Worm, etc.
- Yahoo! was launched in 1995 as web-directory.
- In 2000, Yahoo started using Inktomi’s engine search and then shifted to Microsoft’s Bing in 2009.
It all begins with a crawl
In the beginning search engine starts exploring the World Wide Web, every other link it finds on a webpage and stores them in its database.
Now, We will focus on some background running activity.
- A search engine uses Web Crawler software, which is an internet bot which has been assigned the task to open all the hyperlinks on a webpage.
- It creates a database of text and metadata from all the links.
- It begins with an initial set of links to visit, called Seeds.
- As soon as it proceeds with visiting those links, adds new links in the existing list of URLs to visit, known as Crawl Frontier.
- As the Crawler navigates through the links, it downloads the data from those pages to be seen later as depictions, as downloading the entire website page would consume a lot of bandwidth and data.
Now, the web crawler follow certain rules like:-
- Selection: Crawler decides whether it should download a page or not. The selection rule emphasize on downloading the most relevant content of a web page.
- Re-Visit: Crawler schedules when it should re-open the web pages and make changes in its database. The internet these days is very dynamic and hence crawlers has to do it more frequently and according to the versions of the web pages.
- Parallelization: Crawlers utilize various procedures to explore the links known as Distributed Crawling, It may happen that different processes may download the same web page and to avoid doubling , the crawler maintains a co-ordination between all the procedures..
- Politeness: Usually a Crawler downloads webpages from a website while it navigates through it. This enhances load on the host webserver and to avoid this the crawler is made to wait for a few seconds after it downloads some data from a webserver, This intermittent delay caused intentionally is called Crawl-Delaying.
Also read: SEO for Beginners: What you should know
High-level Architecture of a standard Web Crawler:
The above illustration shows the working of a web crawler. It open the initial list of links and then links inside those links and so on.
While it is genuinely simple to fabricate a moderate crawler that downloads a couple of pages for each second for a brief time frame, building a superior framework that can download countless pages more than half a month introduces various difficulties in framework outline, I/O and system effectiveness, and vigor and sensibility.
Indexing the crawls
After initial crawling, the crawler creates an Index of all the webpages it finds in its way. Indexing saves a lot of time as finding the search query from a heap of large sized documents every time is really hectic and resource-consuming.
There are various factors which add to creating an efficient indexing system for a search engine like:-
- Storage techniques used by the indexers
- size of the index
- the ability to quickly find the documents containing the searched keywords, etc
The huge problem we have to deal with is Process Collision. Assuming if one process wants to search a document and meanwhile another process wants to add a document in the index. This problem is more escalated with the implementation of distributed computing by the search engines. It is usually done to handle more data.
Types of Index
Forward: In forward indexing all the keywords present in a document are stored in a list. In the beginning, it’s easy to create a forward index as asynchronous indexers can collaborate with each other this way.
Reverse: The forward indices are sorted and converted to reverse indices, in which each document containing a specific keyword is put together with other documents containing that keyword. Reverse indices ease up the process of finding relevant documents for a given search query, which is not the case with forward indices.
Parsing of Documents
Parsing is also called Tokenization which refers to the breaking down of document components like keywords , images and other media. This method focuses on native language consideration and possible search keywords, which helps in creating effective indexing systems.
There are many real Challenges posed by different languages. That’s the reason why Baidu is the widely used search engine in China or chinese language. One algorithm may be effective on one language but may lag in other languages. For example In chinese language there are no white spaces. Certain webpages have mixed language content and the website language is not clearly defined. These all factors increase workload on indexing system and reduce overall search efficiency.
Search engines have the ability to recognize various file formats and successfully extract data from them, and it is necessary that utmost care should be taken in these cases.
Meta Tags help in creating index very quickly. Moreover, they reduce web indexer’s efforts and reduce the need of parsing the whole document. You’ll find Meta Tags attached at the bottom of this article.
Web crawlers can perceive different document positions and effectively remove information from them, and it is fundamental that most extreme care ought to be taken in these cases.
Meta Tags are additionally exceptionally helpful in making the records rapidly, they diminish web indexer's endeavours and facilitates the need to totally parse the entire archive. You'll discover Meta Tags joined at the base of this article.
Searching the index
Now, the crawler has indexed things and has learnt, how to crawl and how to grab things quickly and efficiently, and how to arrange his things systematically. Suppose, his friend asks him to find something from his arrangement, what will he do? Search queries can be categorized in following four ways:-
Navigational: There are sometimes queries through which the user wants to go to a specific webpage or website. For example, if you search vishyat on Google, then its called a Navigational Query.
Informational: Informational queries have thousands of results. It covers general topics that increase the searchers knowledge. For example, if you search Narendra Modi, it will show up all the links relevant to the Indian Prime Minister.
Transactional: Sometimes user wants to perform a specific action using a query. It usually involve a pre-defined set of instructions.
Connectivity: Some queries are rarely used, they focus on the connection of index with a website. For example, if you search, How many pages are there on Vishyat.com?
Google and Bing have worked hard in creating and maintaining algorithms to show relevant results for your query. Google claims to calculate the search results based on over 250 factors like
- quality of the content,
- safety of the webpage,
Big search engine companies have best minds to devise mind-blowing formulae and algorithms to make the Search more simpler and quicker for you.
Other notable features*
Image Search: This search options enables user to search using images.
Voice Search: Google was first to introduce voice search on its search engine after a lot of hard work and subsequently other search engines have also implemented it.
Spam Fighting: Search engines have intelligent algorithms to guard you from spam attacks which is a message or a file that is spread mostly to infect machines with Virus.
Location Optimization: The search engines are now capable of displaying results based on user location. If you search weather, it will show weather in Chandigarh.
Understands you better: Nowadays search engines have become smart enough to understanding what the user intends to search rather than just yielding results on the basis of keywords entered by the user.
Auto-complete: It is one of most handy feature and that really helps almost everytime anyone queries on the internet. Auto-complete works on history of similar searches made by you or other users.
Knowledge Graph: This Google Search feature, showcases the search results based on real people, places, and events.
Parental Control: Parental control options are more efficient specially in protecting kids from inappropriate elements or webpages.
Search engines have made our lives way too simpler with their priceless efforts. Search engines have become so vital these days that we cant imagine our day to day life without their existence.
Stay alert, stay safe and Google it.