Simple search engine # Installing It is recommended to use [virtualenv](https://virtualenv.pypa.io). ``` pip install -r requirements.txt ``` ## Testing If you just want to test and don't want to install a PostgreSQL database but have Docker installed, juste use the `docker-compose.yml`. This is only for test, don't use this shit on production (the docker-compose file)! ## Sphinx-search / Manticore-search You must use [Manticore-search](https://manticoresearch.com/) because of the usage of the JSON search API in the searx engines. But you can use [Sphinx-search](http://sphinxsearch.com/) if you don't want to use the JSON search API. You need to know, as of January 2019, the last version of Sphinx-search is distribued in closed-source instead of open-source (for versions 3.x) # Configuration ## Database The database used for this project is PostgreSQL, you can update login information in `config.py` file. ## Manticore-search The configuration for this is in `sphinx_search.conf` file. For update this file please view documentation [Manticore-search](https://docs.manticoresearch.com). Keep in mind you must keep up to date the file `config.py` in accordance with the `sphinx_search.conf` file. # Crawling For now there is an example spider with neodarz website. For launch all the crawler use the following command: ``` python app.py crawl ``` You can also specific a spider to crawl, for example `nevrax_crawler` with the command: ``` python app.py crawl nevrax_crawler ``` # Indexing Before lauch indexing or searching command you must verifiy that the folder of `path` option is present in your system (Warning: the last word of the `path` option is the value of the `source` option, don't create this folder but only his parent folder). Example with the configuration for the indexer `datas`: ``` index neodarznet { source = neodarznet path = /tmp/data/neodarznet } ``` Here the folder is `/tmp/data/` The command for indexing is: ``` indexer --config sphinx_search.conf --all ``` Don't forget to launch the crawling command before this ;) # Searching Before you can make search, you must lauch the search server ``` searchd -c sphinx_search.conf ``` # Updating For update the database and only crawl url that are one week old and and his content modified, you can use: ``` python app.py update ``` # Cron If you want to use cron for start tasks you can use it. But there is a cron like function with this app who do nothing more that cron for the moment. For start it just use the following command: ``` python app.py cron ``` All the configuration are in the file `cron.conf` and the syntax is the same that [UNIX cron](https://apscheduler.readthedocs.io/en/latest/modules/triggers/cron.html?highlight=cron) but there is some some project specification, checkout the first line of the file who is a comment about his structure. Note 1: There is an id column, make sure all ids are different elsewhere the last task erase the previous one with the same id. Note 2: This is only implemented on the crawl function for the moment. # Enjoy For start searching send a `POST` request with the manticoresearch json API, for example: ``` http POST 'http://localhost:8080/json/search' < mysearch.json ``` This is the content of the `mysearch.json`: ``` { "index": "neodarznet", "query": { "match": { "content": "Livet" } }, "highlight": { "fields": { "content": {}, "url": {}, "title": {} }, "pre_tags": "_", "post_tags": "_", } } ``` You can find more information about the HTTP sear API avaiblable in the [Manticores-earch documentation](https://docs.manticoresearch.com/latest/html/httpapi_reference.html) Resultat are in json format. If you whant to know witch website is indexed, search in the file [sphinx_search.conf](https://git.khaganat.net/neodarz/khanindexer/blob/master/sphinx_search.conf) all the line who start by `index`.