CorpusStudio

Quick Start Guide

 

Radboud University Nijmegen,

English language department

Erwin R. Komen

Version 1.4

September 2011

1.       Introduction

This quick start guide will take you through the process of installing Corpus Studio on your computer, and creating your first Corpus Research Project. Much information in this Quick Start Guide is specific for students and employees at the Radboud University Nijmegen.

2.       Installation

1)      The installation of CorpusStudio should be done from the internet. By performing the installation directly from the internet, you will be constantly assured of the newest version. Each time the program starts it will, if you have internet connection, check for an update. If it finds one, you can decide to install it or not.

2)      Your first installation of CorpusStudio will store a Settings file under the directory reserved for you as user on a particular computer. These are “general” settings, applicable for all your Corpus Studio work. You are advised to store these general settings in your data drive (usually this will be your D-drive, but if you are at the Radboud university, you are advised to use your U-drive).

3.       Adjusting settings

There is at least one setting that you might want to adjust: the location where you would like to store your Corpus Research Project files—those are the files holding all the information of a Corpus Research Project, like the queries, the definition files, the order of the queries etc.

Adapt this setting using Tools/Settings. You are advised to pick a location on your data drive. This is the U-drive at the Radboud university (don’t use the W-drive for this). If you are working from your laptop, and it has a D-drive, then use that.

The second setting you might want to change is where the CorpusSearch java file is located (your CS.jar, or its equivalent). If you already have one, and know where it is, then go to Tools/Settings, Open the "Project Editor" tabpage, and set the "Executable location". If you don't have CS.jar installed on your computer, CorpusStudio has a copy of it and will suggest to install it on a convenient location (e.g: C:\FOO\CS.jar).

4.       Creating your first Corpus Research Project

1)      Choose File/New (Ctrl+N) to create a new Corpus Research Project.

a)      You are prompted to type a name for your project and to choose a query-type. Currently two types are recognized:

i)         “Penn-psd” – This uses CorpusSearch to query the Penn-Treebank psd files.

ii)        “Xquery-psdx” – This uses Saxon’s Xquery engine to query psdx files (XML files derived from the PSD files, using a derivative of the TEI-P5 standard).

b)      You are subsequently directed to the files tabpage, where you should set the input, output and query directories.

i)        Input: the directory where your *.psd or *.psdx files are located, which you would like to use for this project.

ii)       Output: the directory where you want your HTML and XML output files to appear. The HTML file contains a table with the statistics of your queries, as well as examples of the results (up to the maximum amount you have specified). The XML file contains all the locations of all the results.

iii)     Query: the directory where you want this project to keep a local backup of the query files (you are, otherwise, free to import queries and definition files from anywhere).

iv)     Continue by selecting your input files on this tabpage. Best practice is to set the input file extension and check “Select all files in this directory”.

(1)   If you want to process only a subset of the input files (e.g. for testing), then use Windows Explorer to create a separate directory (e.g. "Pilot") where you store a file .psd or .psdx files. Use that directory as input directory and save yourself time!!

c)      Optionally (but good practice): go to the “General” page and provide Author, Goal and Comments information for this project.

d)      Preferences: while you are at the General tab, fill in the preferences

i)        Preceding context lines: e.g. 2 or 3 lines of preceding context should be sufficient for information structure purposes.

ii)       Following context lines: perhaps 1 line suffices here.

iii)     Show syntax of each result: check this if you want to have the treebank structure of your results shown.

iv)     Lock: check this if you are going to send this project (.crpx file) in for an assignment, to colleagues, or if you want to keep it "as is" for future reference (e.g. when writing a paper).

e)      Go to the “Period Editor” tabpage, and choose Period/Import period information. You should now try to locate the file called “EnglishPeriods.xml”. If it is not on your local drive, import it from the W-drive (it will subsequently be saved in your query directory or else in your working directory!)

2)      Choose File/Save (Ctrl+S) to save this file. You will be prompted for a file name. By default the filename is built up of the project name and the default extension for Corpus Research Projects, which is .crpx.

3)      Navigate to the Period Editor and choose Period/ImportPeriodInformation to load the information from an existing period definition file (e.g. EnglishPeriods.xml) into this particular corpus research project. Save your project again once you've done this.

4)      Add queries (*.q) and, if you want to, definitions (*.def):

a)      If you have an existing definition file, add it to this project using Definition/Add.

i)        You can make use of the definitions file provided on the CorpusStudio website above. Load this into this project using Definition/Add (provided you have already taken it from the website and saved it in your CorpusStudio/Query directory).

b)      Creating a new definition file is done using Definition/New.

c)      If you have existing queries that you would like to add to this project, then use Query/Add.

d)      If you would like to make a new query, choose Query/New. You will be prompted to supply some basic information. Think about your choices here! E.g: do you really want to check "Remove nodes"? If you are unsure, check the online reference for CorpusSearch using Help/QueryLanguaes/CorpusSearch.

e)      Don’t forget to use Ctrl+S!!!

5)      Put the queries in the correct order using the “Constructor Editor”.

a)      Add lines through Constructor/Add

i)        Choose which query must be executed in this line.

ii)       Choose the input (i.e: “Source” or a combination of the previous lines).

iii)     The output:

(1)   Can be left open. In that case the system makes temporary output files, and clears them up later.

(2)   Named output: don’t add .out or .cmp.

iv)     Check “Make a complement file” if you would like this line to produce a complement.

v)      Do use “Goal” and “Comments” to help you understand at a later stage what it was you wanted to do here.

vi)     The option “Open output file” can only be used when an actual output has been produced.

b)      Use Constructor/Insert to insert a line between existing ones.

6)      Check the sequence of your queries and the results they are supposed to produce using tabpage “Hierarchy” or tabpage “Tree”.

a)      Hierarchy provides a nested table view of your queries.

b)      Tree provides a treeview. Use Shift+F8 to alternate between an expanded or unexpanded tree. Click individual nodes to see where that leads you…

7)      Having checked everything, use F10 or Tools/Execute query to actually execute the queries in the order you have put them.

a)      When a query produces an error, the program will ask you whether you would like to have a look at the error file.

b)      Execution of the constructor lines is tracked in the “Output Monitor”.

c)      The “log file” shown by the Output Monitor can be saved using File/Save current log…

 

 

5.       Checking the results

1)      There are, for the moment, several ways by which you can look at your results—the output and complement files.

a)      Go to the “Constructor Editor” tabpage, and double click a line to see the output produced by that line (the .out file).

i)        Note: temporary output is only shown if you have checked the “Keep temporary” box in the “General” tabpage.

b)      Select a line in the constructor editor, and use “Open complement file” or “Open output” to either look at the .cmp or the .out file.

c)      Go to the “Tree” tabpage, and do the following:

i)        F8 (or View/Update):               re-calculate the tree

ii)       Shift+F8 (or View/Expand):      expand the tree

iii)     Double click on an output or complement leaf.

(1)   By the way: double clicking on a blue query file brings you straight into the query editor tabpage.

 


6.       Quick Reference

#

Topic

Question

Solution

1

Working directory

I want to change the directory used as basis

Go to Tools/Settings

2

Projects

I want to use a project sent to me by a friend

Save the .crpx file, then open it in CorpusStudio with Ctrl+O

3

Projects

I want to use an existing project as the basis to build another one

(2)      Open the existing project, do File/Save As, save it under a different name, and open this new project, or…

(3)      Make a copy of the crpx file in Explorer, rename it, and open this copy in CorpusStudio

Give the project a new name in the General tab!

4

Output

I want to look at the intermediate output

Before executing the queries you have to go to the General tab and set “Keep temporary files” under preferences.

5

Input

I only want to process certain files

Best solution: Put all the files you want to process in one special directory, and set the input to that directory.

6

Input

I want to process all files from all periods

Put texts from each period in a separate directory. Set the input to the parent directory of them.

7

Recycle

I want to use one definition file for different projects. How do I make sure changes flow through between projects?

(1)       Make sure each project points to the same definition file (in the same directory)

(2)       Make sure that in the “General” tab you have checked the two “Synchronize” options.

Whenever you change the definitions in project A, the copy on your computer will be synchronized as soon as you exit CorpusStudio (or use Tools/Synchronize). When you open project B, it will detect the changed definition file, and load it over the definitions it had.

8

Constructor

Can I insert a query before the execution of others?

Yes. Open the Constructor Editor tab. Select the query before which you want to insert one. Then choose Constructor/Insert. Check, and if necessary adapt the input of all queries (i.e. check by selecting the “Hierarchy” tab).