

Application areas that can benefit from the use of these algorithms include marketing, CV retrieval, laws and regulations exploration, competitive intelligence, web reputation, business intelligence, news articles search, topic tracking, and innovative technologies search. In other words, algorithms are required to speed up human browsing or to support the actual crawling process. This in turn proves that: 1) treating web-based textual data effectively is a challenging task, and 2) further improvements are needed in the area of Web mining.

The huge amount of textual digital data and the dynamicity of natural language actually can make it difficult for an Internet user (either human or automated) to extract the desired information effectively: thus people every day face the problem of information overloading, whereas search engines often return too many results or biased/inadequate entries. This chapter deals with the predominant portion of the web-based information, i.e., documents embedding natural-language text. Thus, in recent years, Web mining research tackled this issue by applying data mining techniques to Web resources. In principle, humans can best extract relevant information from posted documents and texts on the other hand, the overwhelming amount of raw data to be processed call for computer-supported approaches. Although large-bandwidth communications yield fast access to virtually any kind of contents by both human users and machines, the unstructured nature of most available information may pose a crucial issue. The World Wide Web has become a fundamental resource of information for an increasing number of activities, and a huge information flow is exchanged today through the Internet for the widest range of purposes.
