18: text mining tools

November 20, 2015 – 1:44 pm
ACULibrary
Gold Mining in California. Scenes of the 1849 Californian Gold Rush showing cradling, panning, washing with a 'long tom' and hydraulic mining. Coloured lithograph by Currier and Ives 1871.

Gold Mining in California. Scenes of the 1849 Californian Gold Rush showing cradling, panning, washing with a ‘long tom’ and hydraulic mining. Coloured lithograph by Currier and Ives 1871.

 

Text Mining, also often referred to as Text Data Mining or Text Analytics, is a process of filtering out specific or high-quality information from (usually) a large collection of texts via the use of various statistical and/or machine-learning algorithms.

Text mining tools enable us to extract core facts and trends from a large body of data and process those facts to derive patterns and structures that will help us make inferences and predictions about the output. This is a big topic and there are a large number of tools available, but to get started with text mining we’ll look at some examples that are easy to learn and that should help you to get started with basic text analysis.

Voyant Tools

Voyant Tools (formerly known as Voyeur) is a user-friendly, web-based reading and analysis environment for digital texts. Voyant Tools lets you work with your own text collections in a variety of formats (e.g., plain text, HTML, XML, PDF, RTF, and MS Word). It also allows you to work directly with existing text collections on the Internet just by typing in the website’s URL.

Voyant Tools is probably the most powerful web-based tool for generic text analysis. It particularly excels when you’re dealing with large bodies of text and it also allows you to develop their own scripts to extend its functionality.

Its web interface is extremely easy to use. You can perform many basic text-analysis tasks without spending too much time reading the manual. Many of its built-in functions (e.g., visualising the frequencies and trends of the selected text within a particular document) are performed automatically as soon as the file is loaded. Voyant also allows you to insert a direct URL link to any Web page and start analysing it automatically.

There is also a wide range of tools that can be used with Voyant for additional features. Find out more here.

TAPoRware

TAPoRware is a similar suite of online tools that allows you to perform text analysis on HTML, XML and plain text files. It can also analyse websites via their URLs.

Written in Ruby (an open-source programming language), TAPoRware consists of a set of text analysis tools that you can use online to analyse HTML, XML and plain text files. Again, you can also analyse web pages and documents just by simply providing the relevant URL. Each TAPoRware tool can also be used as a web service via TAPoR Portal.

The interface of each tool is clean-cut with a very minimalist feel to it.

Orange Text Mining

Orange Text Mining is an add-on for Orange data mining software package that extends Orange by providing tools for analysing texts. Orange is an open-source data analysis and visualisation tools for both novice and experts using Python scripting. Several add-ons available for specialised bioinformatics or text mining purposes.

Orange is a desktop application that requires local installation first and it offers the best performance of the three tools discussed in this post but is also perhaps the more complicated. It’s also an ‘open source’ tool as opposed to the other two ‘closed source’ options.

Orange offers different visualisation outputs (e.g., bar charts, scatter plots, dendrograms, networks, heat maps, etc.) and also allows you to design your own data analysis steps via its visual programming environment. A Python scripting interface is also available for users to code their own algorithms as well as develop complex data analysis procedures.

Considerations

As always when using these analytical tools (especially those only available online) to analyse your data, you must consider carefully the potential privacy risks and what measures (e.g. anonymisation of personal or sensitive data) will be needed to mitigate those risks.

Table of Comparisons

Voyant Tools

TAPoRware

Orange Text Mining

Cost

Free

Free

Free

Licence

Closed source

Closed source

GPL / GNU General Public License

Usability

Easy

Easy

Easy

Tool type

Web application

Web application

Desktop application

Import formats

TXT, CSV, HTML, XML, PDF, RTF, URL

TXT, CSV, HTML, XML, URL

TXT, CSV

Export formats

TXT, CSV, XML

TXT, CSV, HTML, XML

CSV, TAB

And even more…

Needless to say, given the depth of this topic, this post is only able to cover a small fraction of available text mining tools. There are certainly other options that might also be worth considering depending on your specific requirements. For instance, Juxta is an open-source multi-platform desktop tool that provides a user-friendly interface and can perform many textual criticism tasks on TXT and XML files.

KNIME Analytics Platform is yet another powerful tool for analysing datasets. It’s open-source (GPL license) and offers rich features, such as data pre-processing and cleansing, data modelling, data analysis and data mining.

Question for Thing 18:

Explore a tool and blog/comment on it. Also take a look at some other participant’s experiences with this *thing*.

 

Image credit:

Gold Mining in California. Scenes of the 1849 Californian Gold Rush showing cradling, panning, washing with a ‘long tom’ and hydraulic mining. Coloured lithograph by Currier and Ives 1871. . [Photography]. Retrieved from Encyclopædia Britannica ImageQuest.
http://quest.eb.com/search/300_2284074/1/300_2284074/cite

Tags: , , , ,

  1. 8 Responses to “18: text mining tools”

  2. I chose to explore Voyant Tools and I was surprised at how basic the tool functions. Maybe because the option is free? I doubt I would ever recommend this particular tool to researcher particularly when ACU researchers have access to NVivo. Voyant might be okay if you were working on a small one-off piece of research requiring text mining however ideally one would want to set-up their own parameters and limitations which don’t appear to be an option with this tool.

    By Tatum on Dec 4, 2015

  3. I agree with Tatum, I had a look at the screencasts for Voyant Tools. I don’t know much about text analysis, but as Tatum said, ACU researchers do have access to NVivo which I would assume has more powerful features.Having said that, it is good to know what is out there as it does depend on the purpose that the tool is used for. The purpose would define the tools suitability for use.

    By Vicki on Dec 7, 2015

  4. I first chose Orange, but downloading the software took 20 minutes. By the time it had finished downloading I’d moved on to TAPoR, and wasn’t in the mood to call the service desk to type in their administrator password to install Orange.
    I tried some of TAPoR’s tools for text files, and have come to the conclusion that TAPoR is either a work in progress or no longer being maintained. I know nothing at all about text mining, so I cannot comment on whether the tools available are what a researcher would need. However, the fact that some tools crash does not inspire confidence in TAPoR.
    I’ll leave Orange for another 23Things participant to explore…

    By Gertrud on Dec 8, 2015

  5. So perhaps just add NVIVO to this thing next time?

    By ACULibrary on Dec 23, 2015

  6. I had a look at Voyant and found it pretty basic. Having zero experience with it I can’t really judge but I do get a lot of request for NVivo training so I am wondering is it worth suggesting this instead if they don’t want to arrange the NVivo download and training?

    By Nica on Jan 7, 2016

  7. I am not sure….I don’t know the capabilities of NVIVO although I do know that ACU use and support NVIVO. Perhaps it would depend on the need. Regardless…there are options, always options!

    By ACULibrary on Jan 8, 2016

  8. I’d like to speak with researchers who have used text-mining. I’ve only ever seen frivolous use of text mining such as creating word clouds and Facebook’s “most used words in posts” feature – mine include my daughter’s name and Matt from Doctor Who!

    Google’s Ngram viewer is fun to play with. Wired has an interesting article about some limitations.

    By Tracy Bruce on Jan 15, 2016

  9. Does ACU offer training sessions to its HDR students on NVivo or is it one of those things that is readily available on an equivalent of Lynda for students? I looked at Voyant and also thought is was basic. Again I have to say it must be wonderful to have these products on the market where collating like responses in this way would be great. I van see it would be good for analysing open comments with any qualitative data gathered. I agree, the anonymity is a critical component of this element of data analysis. Data mining is heavily used by the Go8 who delve into large data sources using generous funding grants. It enables them to manage large scale data analysis in a more manageable way. It also enables data to be shared for future research on the same topic to see if trends change as I understand it. But, a reliable tool is critical. In our day, we paid a data analyst to process this side of things using the old fashioned software tools such as SPSS. Nonetheless, useful to ascertaining broad trends before drilling down further perhaps?

    By Helena on Jan 25, 2016

Post a Comment

*