Unified Search for Business Data from Multiple Sources

Jason Gray — 2013-11-25T00:00:00+00:00

Background

One of our clients provides an information-rich website for their customers, containing business data from multiple sources. Like all companies, our client wants to make their content easily accessible for their website users; however, without a unified search for business data strategy our client could only provide search for a small portion of their data. Our client’s business data is comprised of static HTML that changes infrequently, reported financial data that changes quarterly, and real-time business data concerning real-estate holdings. This particular client happens to be a publicly traded company who is also subject to regulatory and reporting requirements concerning their quarterly reports and supporting documents. They wanted these quarterly reports to be available for search as well. As unstructured data, these reports lacked a database or other means of providing a search. Viewed as a whole, the site has numerous pages of edited copy with great detail about the company’s expertise, financials, investments, holdings, etc. There are numerous electronic documents (primarily PDF and Excel quarterly reports) available for download from the site, but not stored in a database. There is also a database that stores data on real estate properties. Today there are many CMS options that can address the search challenges we have presented. Our client built their custom CMS system years ago and has invested a significant amount of time, energy, and money maintaining the system as well as training their users. It was important for BHW to provide our client with an option to extend their current system and leverage their investment while addressing these search challenges. Our client decided to extend their current system rather than adopt a new CMS, build new integrations to their back-end systems, and retrain their users. The client requested a search mechanism that could handle:

The static “infrequently changing” html content.
The quarterly changing documents and spreadsheets of data.
The constantly changing data on properties managed through the CMS.

The client could easily identify the data that was valuable to include in searching, but they did not have the infrastructure to manage that data in a single place or to push it to one place (database) for storage. The “website” was the system of record for the valuable data.

The Problem – Our data comes from multiple sources

We’ve defined the data we would like to search, but the data is not stored in a consistent place or format. Focusing on any one of the data sources will leave out huge chunks of valuable data from the others. We need a way to categorize the data for search so we can point each search result to the correct data source. Pointing back to the data source allows matching search results to direct the user to a document, a static web page or a dynamic web page with database content. We also need a way to keep this search index updated frequently while and not affecting website traffic. The website is hosted on a Windows Network Load Balanced (NLB) pair of servers, so we also need our solution to support a web cluster, be redundant, and provide high-availability and scaling options.

The Solution – Lucene.NET

We created a solution using Lucene.NET that would allow us to create a search index on:

All web pages in the website directory (with or without dynamic content embedded).
- We needed the ability to filter out specific pages that were “template” pages with dynamic data from SQL server tables.
- Focus only on the true “content” of each page and not any of the repeated navigational elements, page titles, meta tags, etc…
All electronic documents (PDF and XLSX) that could be added to each web page through the CMS.
Specific data from tables in the SQL Server 2008 database.

The Lucene.Net index is generated by a stand-alone scheduled task. The indexer updates the search index every day in the early morning and system admins can force an update of the index at any time. The current search index is not affected by the creation of the new index. The indexer creates new index files within the main website root directory. {% include full-blog-image.html src="unified_search.png" alt="unified search using Lucene" caption="A process to index business data from multiple sources using Lucene." %}

The indexer code

Lucene.Net
iTextSharp for PDF parsing

```csharp using Lucene.Net.Analysis; using Lucene.Net.Analysis.Standard; using Lucene.Net.Documents; using Lucene.Net.Index; using Lucene.Net.Store; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; using Directory = Lucene.Net.Store.Directory; using Version = Lucene.Net.Util.Version; ```

Index steps (high level):

Index the web pages

```csharp string[] directories = System.IO.Directory.GetDirectories(ConfigurationManager.AppSettings["Webroot"]); foreach (string directory in directories) { string[] files = System.IO.Directory.GetFiles(directory); foreach (string file in files) { //We read the content of the page as it would be served from the web server. //open file HttpWebRequest request = (HttpWebRequest)WebRequest.Create(ConfigurationManager.AppSettings["WebsiteUrl"] + "/" + directory + "/" + file); HttpWebResponse response = (HttpWebResponse)request.GetResponse(); Stream stream = response.GetResponseStream(); //We did various regular expression parsing and read only specific parts of the web page HTML that were considered “CONTENT”. We also set specific formatting based on which section of the website we were in so that the search result was formatted to look like it belonged to a part of the website. //Then we added a “Document” to the Lucene index that contained the formatted string of data. } } ```

Index the database tables (of interest)

Connect, query and convert the data into a Lucene document and add it to index.

Index the content of the PDF and XLS files

The pointers to our downloadable document “Assets” are stored in a database table.
We grab those pointers and open the document(s) with iTextSharp, read the content of the document, convert it to a Lucene document and add it to the index.

Store the index on the web server in the main webroot directory where it is easily accessible form the website search feature.

Using the Index

The ASP.Net web application has a custom search feature in the main top level menu throughout the site and an “advanced search” page that is hooked into the search index. Search results are returned to the “search results” page in a way that identifies the result as a web page, downloadable document or a dynamic property result and the search results page formats the result accordingly. ```csharp //Setup Index string indexDirectory = ConfigurationManager.AppSettings["IndexDirectory"]; Directory directory = FSDirectory.Open(new DirectoryInfo(indexDirectory)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29); IndexReader indexReader = IndexReader.Open(directory, true); Searcher indexSearch = new IndexSearcher(indexReader); var queryParser = new QueryParser(Version.LUCENE_29, "content", analyzer); var query = queryParser.Parse(searchBy); TopDocs resultDocs = indexSearch.Search(query, indexReader.MaxDoc()); var hits = resultDocs.scoreDocs; foreach (var hit in hits) { //I mentioned before that we do quite a bit of custom “configuration” within the search index so that each result is categorized appropriately so that we can format the result appropriately (color coding based on section of website) and point the user to the correct web page or downloadable document. } ```

Additional Resources

http://www.codeproject.com/Articles/320219/Lucene-Net-ultra-fast-search-for-MVC-or-WebForms http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net

Do you need an expert in web development? With a team of web development specialists covering a wide range of skill sets and backgrounds, The BHW Group is prepared to help your company make the transformations needed to remain competitive in today’s high-tech marketplace.

Bug Tracking with Open Source Tool Chains

Paul Francis — 2013-10-28T00:00:00+00:00

{% include full-blog-image.html src="test-email_0.png" alt="E-mail Bug History" %} ##The Problem – Manual bug tracking is inefficient Software projects include numerous stakeholders who have to collaborate to successfully launch a project. The bug tracking, correction, and validation process is usually the project phase that requires input from the greatest number of these shareholders. This process is made more difficult by the complexity of bug reports themselves. Developers must understand how to reproduce the bug, the environment(s) in which the bug occurs, and the severity of the bug. Quality Assurance must be aware of all current issues, so as to not report duplicate reports, and know when bug fixes have been deployed. Project managers benefit from having an easy way to view the statuses, quantity, and severity of bugs. This is far from a complete list of stakeholders and their requirements from a bug tracking system. With this in mind, it is concerning to realize that so many companies rely on tools such as email and Excel to keep track of bugs. Excel reports quickly become outdated, individuals might mistakenly update and distribute an outdated workbook, and companies often use email to distribute these files. Other companies rely solely on email for bug tracking. This leads to cumbersome email chains and does not provide a master bug list, resulting in wasted time, unnecessary frustration, and dropped bugs. ##The Solution – Bug tracking utilizing open source tools There are a number of open source options to create a bug tracking process. This article will discuss a solution utilizing Bugzilla, Git, and Jenkins, however various other tools can be substituted to fit your current workflow. ##Bug Reporting The most crucial element of this process is using a tool specifically built for bug reporting. [Bugzilla](http://www.bugzilla.org) offers a master copy of the bug list, email notification of bug status changes, and an extreme level of detail for each bug. New bugs can be assigned to individuals or teams, emails are automatically sent to the assignees, and the assignees can easily access a detailed report of the new bug. Assigners can monitor the statuses of the bugs and are notified when their attention is required for one of their reports. Project managers can easily view reports of bugs including summaries of statuses, severities, and the time-to-correction of bugs. ##Source Control Integration Every software company should already be utilizing some form of source control. In recent years, Git has become the open source solution of choice. Plug-ins for Git, such as [GitZilla](https://github.com/gera/gitzilla), can simplify the process of updating the status of bugs when commits are made that address certain bug reports. Often developers will add comments in their commits indicating what bugs they fixed. Without using a tool chain, one would have to read through the commit logs to learn in which version of the source code a particular bug was fixed. If developers are already taking the time to write which bugs have been affected by a commit, why not integrate the commit with the bug? Gitzilla is one of the available tools that do just that. Deployment Integration There is a whole host of reasons to utilize a build/deployment tool, but most pertinent amongst them is the ability to complete the bug reporting chain. There are several open source tools that integrate Jenkins builds with Git and Bugzilla. These can be used to automatically trigger builds after commits ([Git plug-in for Jenkins](https://wiki.jenkins-ci.org/display/JENKINS/Git+Plugin)), create links from build descriptions to bug reports, as well as several other actions. Utilizing these tools allows for bug fixes to be deployed as early as possible and reduces the time developers spend deploying and updating bug reports. ##Conclusion Utilizing an open source tool-chain leads to reliable, maintainable, and automated tracking of bug reports that alert specific stakeholders when their actions are required. This practice reduces waste and confusion caused more manual forms of bug tracking, while requiring minimal set up or additional work. Furthermore, since these products are all open source and have substantial plug-in support they can be tailored for your specific processes and needs.

thebhwgroup.com