Project Fast Hacks: Solr and Newscoop
Improving search in a content management system
As part of a series of shorter pieces about prototypes, experiments, and other projects, Adam Thomas tells how one content management system, Newscoop, took on the challenge of providing better search using Solr.
News in the country of Georgia is changing. Having achieved independence in 1991, its five million people have experienced varying degrees of press freedom in the past two decades. Now, despite being ranked only 105th of 179 countries in the latest World Press Freedom Index, a new breed of innovative, regional and open source-powered media organizations are beginning to make a difference.
One such organization is NetGazeti, which has been using Newscoop, Sourcefabric’s open-source media CMS, since 2009. But NetGazeti are more than just users. Alongside 10 other news startups in the region, they’ve been active in helping to develop the tool. So much so, in fact, that the new version of the software is named Sakartvelo; the Georgian name for the country.
In answer to a requirement from these Georgian sites for better search, Newscoop site search can now be powered by Solr, the open-source enterprise search platform used by used by Netflix, Instagram, SourceForge, Internet Archive, NASA, WhiteHouse.gov, Apple, and many more.
Solr is an open-source HTTP search server from the Apache Lucene project. Though jointly developed, Lucene and Solr are two different things in reality. Lucene is a Java library, but not a standalone search engine.
Solr is a search engine server built with Lucene at its core. Solr’s major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.
New features in Solr 4.0, released last year, include a new web-based user interface a spell checker and support for geolocation data (excellent for news sites wanting to offer geographic search). Given the functionality, choosing Solr was an easy decision, but not one we took lightly. “It is a proven solution, powering huge platforms and excelling,” says Newscoops lead developer Holman Romero. “It is open source and member of the Apache group, with just the set of features you need, and more. From a technical point of view, it is fast and easy to integrate due to its architecture.”
The challenge for Holman and his team (most of whom are based in Prague) was more on how to build the Solr index given the fact Newscoop allows custom article types and fields. To solve this, there is a script that runs periodically (via cron) looking for new content and updating the Solr index accordingly. There is then a listener setup in Newscoop, which allows the same script to run whenever an article is removed.
The second part of the puzzle, was dealing with the search requests coming from website users. Once the user performs a search action via the search form, the request is handled by Newscoop, which prepares the query in a way Solr understands. A request is sent by Newscoop to Solr using its API and Solr sends back the response in JSON format. Newscoop parses the response and sends it to its template engine to finally be rendered to the user.
Setting up Solr
Holman helped me with this small guide on how to install Solr alongside Newscoop. There’s more info on our wiki about requirements and set-up. Feel free to leave questions below and Holman will help as much as he can.
Firstly, you have to download Apache/Solr from Apache and then unpack it in any directory you want to run it from.
$ cp solr-4.1.0.tgz /var/www/
$ cd /var/www
$ tar xvzf solr-4.1.0.tgz
Then we copy over Newscoops solr configuration into Solr itself
$ cp -a /var/www/newscoop/example/solr/* /var/www/solr-4.1.0/example/solr/
It’s important to edit example/solr/solr.xml and add
To run Solr, head to the example folder and run
$ java -jar start.jar
All you need to do in Newscoop is to enable Solr in the application config file. Open /var/www/newscoop/application/configs/application.ini-dist with
your editor of choice and uncomment this line (by removing the semi-colon at the beginning of the line):
listener = search.indexer.article
Newscoop looks for Solr by default in http://localhost:8983/solr. If Solr is running on a different address/port, you can override this by changing the following line in the
search.solr_server = 'http://<url>:<port>/solr'
(Make sure to add that line to production and CLI environments.)
Now, we need to store Newscoop content in Solr so that search can start. In other words, we need to populate the Solr index with your content. Run the following command manually to get some data for testing.
$ cd /var/www/newscoop
$ php scripts/newscoop.php index:update
In production environments you’d want to set up a cron job to run the same command periodically, so that your Solr index is updated with any new article or article changes in Newscoop.
Newscoop’s templates have a lot more power and customizability than many other CMS templates. If you want to display Solr search results on the front end (to list the most popular hits on the site for example) you have to add a little extra code to your templates. Fields that can be queried include title, type, webcode, authors, topics, keywords, and a custom field name. More on that over on our wiki.
So, there you have it; how to set up a Newscoop site with Solr-powered search. Thanks to Holman for help on the technical guide, Petr for leading the Solr development, and the other Newscoop team members. And, of course, to all the Georgian sites for their testing and feedback during the development process!