Simple search engine

# Installing

It is recommended to use [virtualenv](https://virtualenv.pypa.io).

```
pip install -r requirements.txt
```

## Testing

If you just want to test and don't want to install a PostgreSQL database
but have Docker installed, juste use the `docker-compose.yml`.

This is only for test, don't use this shit on production (the docker-compose
file)!

## Sphinx-search / Manticore-search

You must use [Manticore-search](https://manticoresearch.com/) because of the
usage of the JSON search API in the searx engines.

But you can use [Sphinx-search](http://sphinxsearch.com/) if you don't want to
use the JSON search API. You need to know, as of January 2019, the last
version of Sphinx-search is distribued in closed-source instead of open-source
(for versions 3.x)

# Configuration

## Database

The database used for this project is PostgreSQL, you can update login
information in `config.py` file.

## Manticore-search

The configuration for this is in `sphinx_search.conf` file. For update this
file please view documentation [Manticore-search](https://docs.manticoresearch.com).
Keep in mind you must keep up to date the file `config.py` in accordance with
the `sphinx_search.conf` file.

# Crawling

For now there is an example spider with neodarz website.
For launch all the crawler use the following command:

```
python app.py crawl
```

You can also specific a spider to crawl, for example `nevrax_crawler` with the
command:

```
python app.py crawl nevrax_crawler
```

# Indexing

Before lauch indexing or searching command you must verifiy that the folder of
`path` option is present in your system (Warning: the last word of the `path`
option is the value of the `source` option, don't create this folder but only
his parent folder).

Example with the configuration for the indexer `datas`:

```
index neodarznet {
    source = neodarznet
    path = /tmp/data/neodarznet
}
```
Here the folder is `/tmp/data/`

The command for indexing is:
```
indexer --config sphinx_search.conf --all
```

Don't forget to launch the crawling command before this ;)

# Searching

Before you can make search, you must lauch the search server
```
searchd -c sphinx_search.conf
```

# Updating

For update the database and only crawl url that are one week old and and his
content modified, you can use:
```
python app.py update
```

# Cron

If you want to use cron for start tasks you can use it.
But there is a cron like function with this app who do nothing more that cron
for the moment. For start it just use the following command:

```
python app.py cron
```

All the configuration are in the file `cron.conf` and the syntax is the same
that [UNIX cron](https://apscheduler.readthedocs.io/en/latest/modules/triggers/cron.html?highlight=cron)
but there is some some project specification, checkout the first line of the
file who is a comment about his structure.

Note 1: There is an id column, make sure all ids are different elsewhere the last
task erase the previous one with the same id.

Note 2: This is only implemented on the crawl function for the moment.

# Enjoy

For start searching send a `POST` request with the manticoresearch json API,
for example:

```
http POST 'http://localhost:8080/json/search' < mysearch.json
```

This is the content of the `mysearch.json`:

```
{
  "index": "neodarznet",
  "query": { "match": { "content": "Livet" } },
  "highlight":
  {
    "fields":
    {
      "content": {},
      "url": {},
      "title": {}
    },
    "pre_tags": "_",
    "post_tags": "_",
  }
}
```

You can find more information about the HTTP sear API avaiblable in the
[Manticores-earch documentation](https://docs.manticoresearch.com/latest/html/httpapi_reference.html)

Resultat are in json format. If you whant to know witch website is indexed,
search in the file [sphinx_search.conf](https://git.khaganat.net/neodarz/khanindexer/blob/master/sphinx_search.conf)
all the line who start by `index`.