Thứ Năm, 24 tháng 4, 2014

Indexing Microsoft Office documents [Search Using Lucene]

As we have seen in the previous example, it is usually insufficient to index the documents' metinformation. Most of the time the query string is only present in the document's content. In order to achieve that, we need to parse thedocument and index the content; ZendSearch\ Lucene provides support indexing the contents of the following document types:

        For HTML documents the following are the index document creation methods: ZendSearch\Lucene\Document\Html::loadHTMLFile($filename) ZendSearch\Lucene\Document\Html::loadHTML($htmlString)

        For Word 2007 documents the following is the index document creation method:
ZendSearch\Lucene\Document\Docx::loadDocxFile($filename)


        For Powerpoint 2007 documents the following is the index document creation method:

ZendSearch\Lucene\Document\Pptx::loadPptxFile($filename)

        For Excel 2007 documents the following is the index document creation method:
ZendSearch\Lucene\Document\Xlsx::loadXlsxFile($filename)

All these methods return a document of type ZendSearch\Lucene\Document, which can be improvised further by adding more index fields to it.

So let's gestarted by indexing the documents that aravailable in the uploads section. 

Perform the following stepfor indexing document files:

1.       To index office documents, add a neuploads section for sample Word and Excel documents. In this case, we will upload a Word document and an Excel spreadsheet as follows:


Sample Word 2007 document


Sample Excel 2007 spreadsheet

2.       Add the following lines to the indexing function present in SearchController, which is present in CommunicationApp/module/Users/src/Users/ Controller/SearchController.php, so that the method picks up and indexes Word documents and Excel spreadsheets separately:
if (substr_compare($fileUpload->filename, ".xlsx",
strlen($fileUpload->filename) - strlen(".xlsx"), strlen(".xlsx")) === 0) {
// index excel sheet
$uploadPath = $this->getFileUploadLocation();
$indexDoc = Lucene\Document\Xlsx::loadXlsxFile(
$uploadPath ."/" . $fileUpload->filename);
} else if (substr_compare($fileUpload->filename, ".docx",
strlen($fileUpload->filename) - strlen(".docx"), strlen(".docx")) === 0) {
// index word doc
$uploadPath= $this->getFileUploadLocation();
$indexDoc = Lucene\Document\Docx::loadDocxFile(
$uploadPath ."/" . $fileUpload->filename);
}
else {
$indexDoc = new Lucene\Document();
} 
$indexDoc->addField($label);
$indexDoc->addField($owner);
$indexDoc->addField($fileUploadId);
$index->addDocument($indexDoc);

3.       Now update the index (navigatthttp://comm-app.local/users/search/ generateIndex), come back to the Document Search page, and try searching for keywords that are present in the document.You should be able to see the search results as shown in the following screenshot:


Search results for the content inside Office documents will be as shown in the following screenshot:


What just happened?
In the lastask we saw the implementation of indexing and searching the content of Microsoft Office documents. As you can see, it is relatively easto implement these features using ZendSearch\Lucene.

Here is a simple task for you before you move on to the next chapter. Now that we havimplemented indexing and searching, your task will be to modify the entities so that the indeis updated each time changes are made touploads. If a new upload is made, a documenneeds to be added to the index, and if an upload is deleted, it should be removed from the index, and so on.

Q1. Which of the following field types is not tokenized, yet is indexed and stored?

1.       keyword ()
2.       unStored ()
3.       text()
4.       unIndexed()

Q2. Which of the following file formats is not supported for ZendSearch\Lucene as a valid documenformafor content indexing?

1.       .docx
2.       .pdf
3.       .xslx
4.       .html


Summary


In this chapter we have learned about implementing a simple search interface using ZendSearch\Lucene. This would be very useful when implementing search in anweb application thayou work with. In the next chapterwe will be learning about implementing a simple e-commerce store using Zend Framework 2.0.

Không có nhận xét nào:

Đăng nhận xét