GSWA WAMEX Search Interface

Overview

This website is built to allow the quick searching of all text in the WAMEX collection of reports. There are currently 95,090 reports that are searchable in the collection, with all text coming from the OCR of any ‘pdf’ files in that report.

Documentation

To perform a basic search, click on the search bar, and type in your query. You can then search by hitting enter or clicking the search icon on the right.
Search Bar

Change the number of results

To change the number of results given, you can hover over the “10 results” button and select a different amount from the dropdown. As a general warning, asking for more results will mean a longer load time.

dropdown


Search Results

The results for your query will be displayed in cards. The left side of the card contains the report’s project name, report number and project title. The white section of the card will contain an extract of where your query was found within the document. In the bottom right of the card, you can click “Open Report” which is a link to the original report from in the GSWA portal. From there the report can be downloaded.
enter image description here

Query Language

Capitals

The search is not case-sensitive, so capitalisation does not matter.

To return results that contain one phrase, or another phrase anywhere within the report, you must use the word " or " with spaces either side.

For example:

The ‘and’ query is different from the ‘or’ query in that it will return a result if either side of the " and " is matched within the same section of the report. An important distinction is that the “and” query does not match the two phrases over the whole report, only when they are in the same section. A section is defined as either a paragraph or when a paragraph goes on for more than 13 words it is split up into multiple sections. An explanation for this is within the technical details section of this documentation.

For example:

Technical Details

This webpage is hosted on AWS and built using Python and Flask amongst some other tools. A local Elasticsearch database is being used to perform the search.

The reason the and query can only match within a section is because of how the report is stored in Elasticsearch. To store a large report, which may be several hundred pages long, it must be broken up into smaller documents and entered into Elasticsearch individually, otherwise, the performance is far too slow to be acceptable.

The ‘paragraph’ is the size that we decided to split the document into as Elasticsearch is far faster and searching many smaller chunks of text, compared to fewer large chunks of text. This means that when an ‘and query’ is called, Elasticsearch only matches within the small chunk it is given. We cannot feasibly aggregate all of those chunks and compare them to perform an ‘and query’, because it would be too costly over the 260,000+ files.


‘WAMEX Search’ produced by Expedio on behalf of GSWA