Solr test dataset

Home > Solr > Solr test dataset

Solr test dataset

December 29, 2010 Bas de Nooijer Leave a comment Go to comments

For an opensource project I’m working on I need a good Solr test dataset. More info about the project will follow soon, but as a teaser I can already tell it’s Solr and PHP related 😉
The dataset needs to be of a reasonable size (not unrealisticly small, but not huge either) and it should be free to use for anyone, as anyone should be able to test the project.

I’ve worked on quite a lot of Solr projects by now, and have a local environment of most. But obviously I cannot use these indexes for anything other than the projects they belong to, let alone redistribute the data.

For some demos and my post complex solr faceting I’ve used the dataset from the book ‘Solr 1.4 Enterprise Search Server‘, based on MusicBrainz data. But this dataset is not so great for faceting, which is one of the more important items to me. So I decided to look for a better alternative.

The first thing that came to my mind was IMDB. But at first glance their licensing terms seem to be an issue. Then I remembered using a geocoding service based on free location data. With a some searching I found it, GeoNames. It seems like a perfect fit:

it uses a Creative Commons Attribution 3.0 License
the dataset is big enough, and can easily be extended by importing more parts of the dataset
the data lends itself for faceting
lots of special characters in the data for testing utf8 stuff

The only downside is that this dataset has no large texts, just some names for items. So for text analysis related items I would need another dataset, maybe some wikipedia content. But for now this dataset will do just fine.

Next step was to get the data into Solr. I did a quick search and came across this blog post: Solr and Geonames
I used some of the steps he describes as a starting point, these were the steps I took:

create a new core in my solr test instance
create a schema, see my settings below
I also needed to alter my solrconfig.xml, enableRemoteStreaming needs to be set to “true”. Be aware of security though, don’t do this if your Solr instance can be reached by others!
reload solr to load the new core
download the data file: http://download.geonames.org/export/dump/cities1000.zip
unzip it
I had issues with importing the file due to some quotes in the data. Since only a few of the about 100k records gave the error I didn’t really look into it as it doesn’t matter for my tests, so I ‘fixed’ it by removing the quotes that gave the errors:
sed ‘s/”//g’ cities1000.txt > cities1000_fixed.txt
start the import by calling this URL (adapt to your own environment):
http://localhost:8983/solr/geonames/update/csv?commit=true&separator=%09&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate&stream.file=/path/to/file.txt&overwrite=true&stream.contentType=text/plain;charset=utf-8
after wating a few seconds you should get a confirmation, and you should be able to see the results at:
http://localhost:8983/solr/geonames/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on

If you are new to Solr these steps are probably too briefly described, just let me know and I’ll make a step-by-step tutorial at request

These are the schema settings (I left out the standard parts, Solr comes with good examples):

<fields>   
   <field name="id" type="int" indexed="true" stored="true" />
   <field name="name" type="string" indexed="true" stored="true" />
   <field name="alternative_names" type="text" indexed="true" stored="true" />
   <field name="latitude" type="string" indexed="true" stored="true" />
   <field name="longitude" type="string" indexed="true" stored="true" />   
   <field name="countrycode" type="string" indexed="true" stored="true" />
   <field name="population" type="int" indexed="true" stored="true" />
   <field name="elevation" type="string" indexed="true" stored="true" />
   <field name="timezone" type="string" indexed="true" stored="true" />
   <field name="lastupdate" type="string" indexed="true" stored="true" />
 </fields>

 <uniquekey>id</uniquekey>
 
 <defaultSearchField>name</defaultSearchField>

Categories: Solr

Comments (6) Trackbacks (0) Leave a comment Trackback

Yash Ranadive

February 26, 2011 at 23:06

Reply

Thanks for the article. Good to know SOlr can load csv files out of box. Can it load XML files as well? How about other types?
I’ve been trying to come up with a good ETL strategy to load Solr but so far nothing that works flawlessly. Talend comes close but it has its own problems.
- Bas de Nooijer
  
  February 27, 2011 at 08:38
  
  Reply
  
  There are many ways of getting data into Solr. It really depends on your situation (datasource, available tools, schema etcetera) which one is the best.
  
  The Solr DataImportHandler has support for multiple datasources.
  – XML files (this includes remote files like RSS feeds)
  – CSV files
  – a direct JDBC database connection using queries
  – emails using an IMAP connection
  – lots of document types using tika (see http://tika.apache.org/0.9/formats.html)
  
  For info about the above options see http://wiki.apache.org/solr/DataImportHandler
  
  You can also use Apache Nutch, a web crawler that can use Solr as an index. This way you can index a complete website just like search engines. An easy and quick way to add search to a site but limited in options.
  
  Finally you can also create your own solution by using the update API, this allows for the most flexibility. There are libraries for most languages available.
  A hybrid solution is also a possibility, use some of your own tools to create XML files in a suitable format and let the Solr DIH import these XML files.
Yash Ranadive

February 27, 2011 at 18:28

Reply

Thanks for the quick response. IF you want sample customer data try http://www.briandunning.com/sample-data/ It has up to 35k records for free. I just loaded my solr instance with that. You can get a csv file that you can load directly in to solr. I tried loading it using Talend with the SolrJ library.

Data import handler works for a full load fine but if you want to do frequent near-real time loads and keep track of failures, logging, etc. I think doing the ETL in an external tool is better.
Yash

February 28, 2011 at 18:04

Reply

If you’re interested I wrote a blog post about ETLing in to Solr using Talend
Yash

February 28, 2011 at 18:05

Reply

http://technologyenablingbusiness.blogspot.com/2011/02/using-talend-to-do-etl-in-to-solr.html
Abhishek Gupta

December 14, 2011 at 06:41

Reply

Hi,

I am new for solr integration, Can you please send me all steps with details. It would be very helpful for me.

Thanks,
Abhishek

No trackbacks yet.

Raspberry blog

Solr test dataset

Leave a comment Cancel reply

Categories

Popular

Links

Solarium news and updates

Raspberry blog

Solr test dataset

Share this:

Related

Leave a comment Cancel reply

Categories

Popular

Links

Solarium news and updates