README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152

Simple search engine

# Installing

It is recommended to use [virtualenv](https://virtualenv.pypa.io).

```
pip install -r requirements.txt
```

## Testing

If you just want to test and don't want to install a PostgreSQL database
but have Docker installed, juste use the `docker-compose.yml`.

This is only for test, don't use this shit on production (the docker-compose
file)!

## Sphinx-search / Manticore-search

You must use [Manticore-search](https://manticoresearch.com/) because of the
usage of the JSON search API in the searx engines.

But you can use [Sphinx-search](http://sphinxsearch.com/) if you don't want to
use the JSON search API. You need to know, as of January 2019, the last
version of Sphinx-search is distribued in closed-source instead of open-source
(for versions 3.x)

# Configuration

## Database

The database used for this project is PostgreSQL, you can update login
information in `config.py` file.

## Manticore-search

The configuration for this is in `sphinx_search.conf` file. For update this
file please view documentation [Manticore-search](https://docs.manticoresearch.com).
Keep in mind you must keep up to date the file `config.py` in accordance with
the `sphinx_search.conf` file.

# Crawling

For now there is an example spider with neodarz website.
For launch all the crawler use the following command:

```
python app.py crawl
```

You can also specific a spider to crawl, for example `nevrax_crawler` with the
command:

```
python app.py crawl nevrax_crawler
```

# Indexing

Before lauch indexing or searching command you must verifiy that the folder of
`path` option is present in your system (Warning: the last word of the `path`
option is the value of the `source` option, don't create this folder but only
his parent folder).

Example with the configuration for the indexer `datas`:

```
index neodarznet {
    source = neodarznet
    path = /tmp/data/neodarznet
}
```
Here the folder is `/tmp/data/`

The command for indexing is:
```
indexer --config sphinx_search.conf --all
```

Don't forget to launch the crawling command before this ;)

# Searching

Before you can make search, you must lauch the search server
```
searchd -c sphinx_search.conf
```

# Updating

For update the database and only crawl url that are one week old and and his
content modified, you can use:
```
python app.py update
```

# Cron

If you want to use cron for start tasks you can use it.
But there is a cron like function with this app who do nothing more that cron
for the moment. For start it just use the following command:

```
python app.py cron
```

All the configuration are in the file `cron.conf` and the syntax is the same
that [UNIX cron](https://apscheduler.readthedocs.io/en/latest/modules/triggers/cron.html?highlight=cron)
but there is some some project specification, checkout the first line of the
file who is a comment about his structure.

Note 1: There is an id column, make sure all ids are different elsewhere the last
task erase the previous one with the same id.

Note 2: This is only implemented on the crawl function for the moment.

# Enjoy

For start searching send a `POST` request with the manticoresearch json API,
for example:

```
http POST 'http://localhost:8080/json/search' < mysearch.json
```

This is the content of the `mysearch.json`:

```
{
  "index": "neodarznet",
  "query": { "match": { "content": "Livet" } },
  "highlight":
  {
    "fields":
    {
      "content": {},
      "url": {},
      "title": {}
    },
    "pre_tags": "_",
    "post_tags": "_",
  }
}
```

You can find more information about the HTTP sear API avaiblable in the
[Manticores-earch documentation](https://docs.manticoresearch.com/latest/html/httpapi_reference.html)

Resultat are in json format. If you whant to know witch website is indexed,
search in the file [sphinx_search.conf](https://git.khaganat.net/neodarz/khanindexer/blob/master/sphinx_search.conf)
all the line who start by `index`.