love Shanghai Technical Committee Chairman Chen Shangyi said, "the amount of data love Shanghai daily handling of nearly 100 PB, 1PB is equal to 1 million G, the sum of the amount of information is equivalent to 5000 of the national library".
Chinese English is relative to the word segmentation.
but, as long as any program involves massive data processing and the difficulty of its development and the development cycle will become very large. As a simple example, to determine whether a link is the crawler crawl, each of a link to do judgment. If your memory is only thousands, tens of thousands of links, even if it is a contrast traversal can basically meet the requirements, but if it is one hundred thousand, million, billion level? These algorithms can barely cope with the red black tree, one billion, tens of billions, hundreds of billions of levels? Can only establish index.
software engineers in the studio smell did not develop people participate in a large-scale search engine, but are interested in it. According to some similar project experience and public information, as the relevant technology of the search engine is a shallow solution.
2, Chinese segmentation data pretreatment
is now more and more enterprises realize the importance of data, reptiles as an important source of data, the future will be applied in more fields.
crawler English is actually translated into Spider, the spider is more understandable, countless links constitute a huge web search engine, content acquisition program as only industrious in the web spider crawling, a node of interest is recorded for other procedures each encounter.
Chinese is an important word search engine technology, accurate segmentation is directly related to whether the structure meets the search query search intention
search engine as a source of huge amounts of data, the technology of search engine crawler is an important part of the Wendao Studio software development has its own crawler, so this technology is very familiar with.
search engine in many scenarios are applied to the crawler technology. As of now the emerging public opinion analysis system, data mining system etc..
crawler is not difficult, the author used C++ to develop a prototype crawler only about 500 lines of code, but if you use the python, less than 100 lines.
Wen software development studio several software and search engine technology has a lot of overlap, such as the upcoming on-line projSpider贵族宝贝 is actually a simple vertical search engine, in addition to our web crawler module application in multiple projects and search engine technology is an important part of the.
(Spider) – Crawler data source of
such a huge data, remarkable technical strength to love Shanghai.
In addition to