As we have seen in the previous example, it is usually insufficient to index the documents' meta information. Most of the time the query string is only present in the document's content. In order to achieve that, we need to parse thedocument and index the content; ZendSearch\ Lucene provides support indexing the contents of the following document types:
◆ For HTML documents the following are the index document creation methods: ZendSearch\Lucene\Document\Html::loadHTMLFile($filename) ZendSearch\Lucene\Document\Html::loadHTML($htmlString)
◆ For Word 2007 documents the following is the index document creation method:
ZendSearch\Lucene\Document\Docx::loadDocxFile($filename)
ZendSearch\Lucene\Document\Pptx::loadPptxFile($filename)
◆ For Excel 2007 documents the following is the index document creation method:
ZendSearch\Lucene\Document\Xlsx::loadXlsxFile($filename)
All these methods return a document of type ZendSearch\Lucene\Document, which can be improvised further by adding more index fields to it.
So let's get started by indexing the documents that are available in the uploads section.
Perform the following steps for indexing document files:
1. To index office documents, add a new uploads section for sample Word and Excel documents. In this case, we will upload a Word document and an Excel spreadsheet as follows:
Sample Word 2007 document
Sample Excel 2007 spreadsheet
2. Add the following lines to the indexing function present in SearchController, which is present in CommunicationApp/module/Users/src/Users/ Controller/SearchController.php, so that the method picks up and indexes Word documents and Excel spreadsheets separately:
if (substr_compare($fileUpload->filename, ".xlsx",
strlen($fileUpload->filename) - strlen(".xlsx"), strlen(".xlsx")) === 0) {
// index excel sheet
$uploadPath = $this->getFileUploadLocation();
$indexDoc = Lucene\Document\Xlsx::loadXlsxFile(
$uploadPath ."/" . $fileUpload->filename);
} else if (substr_compare($fileUpload->filename, ".docx",
strlen($fileUpload->filename) - strlen(".docx"), strlen(".docx")) === 0) {
// index word doc
$uploadPath= $this->getFileUploadLocation();
$indexDoc = Lucene\Document\Docx::loadDocxFile(
$uploadPath ."/" . $fileUpload->filename);
}
else {
$indexDoc = new Lucene\Document();
}
$indexDoc->addField($label);
$indexDoc->addField($owner);
$indexDoc->addField($fileUploadId);
$index->addDocument($indexDoc);
3. Now update the index (navigate to http://comm-app.local/users/search/ generateIndex), come back to the Document Search page, and try searching for keywords that are present in the document.You should be able to see the search results as shown in the following screenshot:
Search results for the content inside Office documents will be as shown in the following screenshot:
What just happened?
In the last task we saw the implementation of indexing and searching the content of Microsoft Office documents. As you can see, it is relatively easy to implement these features using ZendSearch\Lucene.
Here is a simple task for you before you move on to the next chapter. Now that we have implemented indexing and searching, your task will be to modify the entities so that the index is updated each time changes are made touploads. If a new upload is made, a document needs to be added to the index, and if an upload is deleted, it should be removed from the index, and so on.
Q1. Which of the following field types is not tokenized, yet is indexed and stored?
1. keyword ()
2. unStored ()
3. text()
4. unIndexed()
Q2. Which of the following file formats is not supported for ZendSearch\Lucene as a valid document format for content indexing?
1. .docx
2. .pdf
3. .xslx
4. .html
Summary
In this chapter we have learned about implementing a simple search interface using ZendSearch\Lucene. This would be very useful when implementing search in any web application that you work with. In the next chapterwe will be learning about implementing a simple e-commerce store using Zend Framework 2.0.
Không có nhận xét nào:
Đăng nhận xét