Simple search engine
Installing
It is recommended to use virtualenv.
pip install -r requirements.txt
Testing
If you just want to test and don't want to install a PostgreSQL database
but have Docker installed, juste use the docker-compose.yml
.
This is only for test, don't use this shit on production (the docker-compose file)!
Sphinx-search / Manticore-search
You must use Manticore-search because of the usage of the JSON search API in the searx engines.
But you can use Sphinx-search if you don't want to use the JSON search API. You need to know, as of January 2019, the last version of Sphinx-search is distribued in closed-source instead of open-source (for versions 3.x)
Configuration
Database
The database used for this project is PostgreSQL, you can update login
information in config.py
file.
Manticore-search
The configuration for this is in sphinx_search.conf
file. For update this
file please view documentation Manticore-search.
Keep in mind you must keep up to date the file config.py
in accordance with
the sphinx_search.conf
file.
Crawling
For now there is an example spider with neodarz website. For launch all the crawler use the following command:
python app.py crawl
You can also specific a spider to crawl, for example nevrax_crawler
with the
command:
python app.py crawl nevrax_crawler
Indexing
Before lauch indexing or searching command you must verifiy that the folder of
path
option is present in your system (Warning: the last word of the path
option is the value of the source
option, don't create this folder but only
his parent folder).
Example with the configuration for the indexer datas
:
index neodarznet {
source = neodarznet
path = /tmp/data/neodarznet
}
Here the folder is /tmp/data/
The command for indexing is:
indexer --config sphinx_search.conf --all
Don't forget to launch the crawling command before this ;)
Searching
Before you can make search, you must lauch the search server
searchd -c sphinx_search.conf
Updating
For update the database and only crawl url that are one week old and and his content modified, you can use:
python app.py update
Cron
If you want to use cron for start tasks you can use it. But there is a cron like function with this app who do nothing more that cron for the moment. For start it just use the following command:
python app.py cron
All the configuration are in the file cron.conf
and the syntax is the same
that UNIX cron
but there is some some project specification, checkout the first line of the
file who is a comment about his structure.
Note 1: There is an id column, make sure all ids are different elsewhere the last task erase the previous one with the same id.
Note 2: This is only implemented on the crawl function for the moment.
Enjoy
For start searching send a POST
request with the manticoresearch json API,
for example:
http POST 'http://localhost:8080/json/search' < mysearch.json
This is the content of the mysearch.json
:
{
"index": "neodarznet",
"query": { "match": { "content": "Livet" } },
"highlight":
{
"fields":
{
"content": {},
"url": {},
"title": {}
},
"pre_tags": "_",
"post_tags": "_",
}
}
You can find more information about the HTTP sear API avaiblable in the Manticores-earch documentation
Resultat are in json format. If you whant to know witch website is indexed,
search in the file sphinx_search.conf
all the line who start by index
.