aboutsummaryrefslogtreecommitdiff

Simple search engine

Installing

It is recommended to use virtualenv.

pip install -r requirements.txt

Testing

If you just want to test and don't want to install a PostgreSQL database but have Docker installed, juste use the docker-compose.yml.

This is only for test, don't use this shit on production (the docker-compose file)!

Sphinx-search / Manticore-search

You must use Manticore-search because of the usage of the JSON search API in the searx engines.

But you can use Sphinx-search if you don't want to use the JSON search API. You need to know, as of January 2019, the last version of Sphinx-search is distribued in closed-source instead of open-source (for versions 3.x)

Configuration

Database

The database used for this project is PostgreSQL, you can update login information in config.py file.

Manticore-search

The configuration for this is in sphinx_search.conf file. For update this file please view documentation Manticore-search. Keep in mind you must keep up to date the file config.py in accordance with the sphinx_search.conf file.

Crawling

For now there is an example spider with neodarz website. For launch all the crawler use the following command:

python app.py crawl

You can also specific a spider to crawl, for example nevrax_crawler with the command:

python app.py crawl nevrax_crawler

Indexing

Before lauch indexing or searching command you must verifiy that the folder of path option is present in your system (Warning: the last word of the path option is the value of the source option, don't create this folder but only his parent folder).

Example with the configuration for the indexer datas:

index neodarznet {
    source = neodarznet
    path = /tmp/data/neodarznet
}

Here the folder is /tmp/data/

The command for indexing is:

indexer --config sphinx_search.conf --all

Don't forget to launch the crawling command before this ;)

Searching

Before you can make search, you must lauch the search server

searchd -c sphinx_search.conf

Updating

For update the database and only crawl url that are one week old and and his content modified, you can use:

python app.py update

Cron

If you want to use cron for start tasks you can use it. But there is a cron like function with this app who do nothing more that cron for the moment. For start it just use the following command:

python app.py cron

All the configuration are in the file cron.conf and the syntax is the same that UNIX cron but there is some some project specification, checkout the first line of the file who is a comment about his structure.

Note 1: There is an id column, make sure all ids are different elsewhere the last task erase the previous one with the same id.

Note 2: This is only implemented on the crawl function for the moment.

Enjoy

For start searching send a POST request with the manticoresearch json API, for example:

http POST 'http://localhost:8080/json/search' < mysearch.json

This is the content of the mysearch.json:

{
  "index": "neodarznet",
  "query": { "match": { "content": "Livet" } },
  "highlight":
  {
    "fields":
    {
      "content": {},
      "url": {},
      "title": {}
    },
    "pre_tags": "_",
    "post_tags": "_",
  }
}

You can find more information about the HTTP sear API avaiblable in the Manticores-earch documentation

Resultat are in json format. If you whant to know witch website is indexed, search in the file sphinx_search.conf all the line who start by index.