- IMPORTANT NOTE
- UplugWeb v0.1 - Frequently Asked Questions
- Registration and User Managment
- Corpus Management
- How do I create a corpus?
- How do I add documents to my corpus?
- How can I remove documents from a corpus?
- How can I remove the complete corpus?
- Can I restore data that I removed by accident?
- Can I download documents from my repository?
- Can I look at my documents?
- Can I show larger/smaller parts of the document at once?
- Can I edit my documents?
- Can I edit/modify sentence alignment files?
- Can I edit word alignment files?
- Can I word-align bitexts by hand?
- What is the info mode?
- What is preprocess?
- What is align?
- What is index?
- What is query?
- How do I use the CWB in UplugWeb?
- Task Management
- Local Installation
Uplug::Web - a web interface for Uplug
This part of Uplug is not maintained anymore and should not be considered to be stable and it is possible not enirely compatible with current versions of the software.
This is a collection of frequently asked questions and their answers related to UplugWeb - the web interface to the Uplug tools. http://uplug.sourceforge.net
UplugWeb is the web interface of the Uplug corpus tools. Registered users can use Uplug on-line with small-size corpora. UplugWeb can be used to create and manage multi-lingual parallel corpora.
The original version of UplugWeb is installed at the Department of Linguistics and Philology at Uppsala University: http://stp.ling.uu.se/cgi-bin/joerg/Uplug Other installations may exist elsewhere. Let me (email@example.com) know if you see it anywhere!
You may look at any public corpus in the collection (click on Public corpora). You have to register first before you can use any of the other features (click on Register now). No personal data will be given further to third parties!
UplugWeb is free for non-commercial usage. It is provided "as-is". No warranties or guranaties are given. Read also the License Agreement when registering to UplugWeb. This service may dissapear without prior notice (hopefully not ;-))
Yes, you can! Go to http://uplug.sourceforge.net and download the uplug-package. Follow the instructions in uplug/web/INSTALL or other information that is hopefully there soon.
Links for registration and user management are collected in the second menu (User management) in the left column.
Registration is easy! Click on Register now and fill out the form. Fields marked with * are required. Your e-mail adress will be used as your UplugWeb user name. Click on the send button at the bottom if everything is ok and you agree to the license agreement. Hopefully, this finishes your registration and you may now login to your UplugWeb account by clicking on Login.
Click on Lost Password and type your e-mail adress that you used for registration. The password will be sent to you by e-mail when you click on the send button.
Click on Uplug users in the User management menu. You may look at user details if you click on info. All registered users can do that. The edit function is not implemented yet.
Right now you can't! This will may be added in the next version.
UplugWeb functions related to corpora and corpus management are collected in the Corpus management menu in the left column.
A corpus in UplugWeb is a collection of one or more documents. Each document may have several translations. Click on Create new corpus for creating a new corpus (surprise ;-). You have to specify a unique name for the corpus in your repository. The name has to be mo longer than 10 characters using ASCII letters [a-zA-Z] and '_'. You may, of course, have several corpora in your account! The corpus name will appear in the list of your corpora (My corpora). Check the private checkbox if you don't want your corpus to appear in the collection Public corpora. Public corpora can not be changed by others but viewed and downloaded by everyone!
Initially, the corpus is empty. You have to add documents using the add link in the task list behind the corpus name.
Use the add link in the task-list behind the name of your corpus! You will see this task list for each corpus in My corpora. If you click on add, a new form will be opened. Corpus documents that you submit have to be in PLAIN TEXT FORMAT! Any annotation will be ignored and interpreted as common text. Choose the correct character encoding format in the Encoding option menu! Defaullt encoding is Unicode UTF-8. All data submitted will be converted to UTF-8!
A document has to have a unique name in the corpus. It has to be shorter than 16 characters using ASCII letters [a-zA-z], digits [0-9], dots '.' and underscores '_'.
There may be several translations of each document in the corpus. DO NOT CHOOSE DIFFERENT NAMES FOR EACH TRANSLATION OF THE SAME DOCUMENT! Translations may (should) have the same name as the original. Choose the language of each document to distinguish them!
The local document itself is inserted in the Upload file field. Add the document to the corpus by clicking on the submit button.
The upload size is restricted. The total amount of POST-data is limited. UplugWeb is intended for small-size corpora. However, you may add as many documents as you want to each corpus.
Removing documents can be done with the remove function in the task-list that you can find for each corpus. Click on remove and the corpus manager will be in "remove-mode" (you can see the mode by checking which of the tasks is not linked anymore in the task-list). Each document is represented by a link from the language identifier (e.g. 'en' for English) behind the document name. Click on the link that corresponds to the document you would like to remove. BE CAREFUL! CLICKING ON THE LINKS REMOVES IMMEDIATELY THE CORREPSONDING DOCUMENT!
Click on remove in the task-list behind the corpus name. A new for should appear in your browser. Check the checkbox that you are really sure to remove the entire corpus and click on submit. The corpus will be deleted!
Yes you can! Click on Restore documents and click on the links in your collection of removed documents.
Not directly. But you can send documents to your e-mail adress. Select the send mode in the corpus manager (in the task-list for one of your corpora) and click on the document you want to be sent to you.
Of course! The "view mode" is the default mode in the corpus manager. Otherwise you may always activate it by clicking on view in the task-list for each corpus. If you click on document links the corresponding document will be shown in your browser. The display style is different depending on the type of document you're looking at. For some document types you will have the choice between different display styles (e.g. for word alignment files). Alignment files can even be modified/revised. Check further down!
No! Not right now.
No! Not right now. This is potentially dangerous and therefore not supported (yet). For alignment files it is possible to modify the links using the edit-functions provided for these file types. See further down.
Sentence alignment is done automatically and, therefore, often includes errors. If you open a sentence alignment file (sent) from the view mode you will see linked up/down arrows around each sentence ID. Use these links to move the attached sentence up or down. You can do this using both display styles text and xml (default).
Editing alignment files link by link is not very convenient if there are many follow-up errors. Check the file first before starting to rervise the alignment. Sometimes the alignment is totally out of control at some point and it is not worth doing the revision by hand in this way. Modify your original files instead and re-run the sentence aligner! (Check the section on sentence alignment in the description of the task manager)
Open the chosen word alignment file in view mode and click on edit in the list of display styles. Word alignments will now be shown as link-matrix with checkboxes for each word pair. Change the links as you wish and click on the change button if your satisfied. Go to the next sentence pair by clicking on next.
Yes you can! Open a sentence alignment file (the once called sent) in view mode and click on wordalign. The system will create an empty word alignment file if there is no word alignment file for this language pair already. Otherwise it will open the existing one and you may edit it.
There is some more information for each document (e.g. status infomation). Click on info to activate the info mode and select the document by clicking on the corresponding link (if you are in the info mode).
The preprocess task automatically adds pre-processing processes to the queue for all documents from the chosen corpus that are still in plain text format. Documents will be pre-processed with language-specific pre-processing modules if available. Otherwise, it will add the basic pre-processing modules that adds simple XML markup and runs the sentence splitter and the general tokenizer. Documents that have been finished will be sent to you. Look att the process queue in the task manager to track queued processes.
The align task automatically adds all sentence alignment processes to the process queue possible for the chosen corpus. Each document with one or more translations will be sentence aligned. All possible alignment pairs will be considered (only in one direction). NOTE: documents have to be tokenized before they can be aligned! There is a quick-task for doing this for all documents automatically: preprocess. Otherwise, use the pre-processing functions described in the section about the Task Manager.
The index function can be used to create CWB (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/index.html) index files from the chosen corpus. All tokenized document will be indexed. Sentence alignments will be included as well. Indexed corpora can be searched using CQP and the CWB Query function in corpus management menu. Note that the entire corpus will be in one index per language. Documents will not be seperated from each other!
By clicking on query you will get to the CWB query form. From here you can search your indexed corpus data.
First of all you have to index one or more corpora in youre repository (use index in the corpus task-list). You can view all indeces and query each corpus by selecting CWB Query from the corpus management menu. For each indexed corpus you will have a list of indexed sub-corpora linked to their language (using 2-letter language ID's). Select the sub-corpus you are interested in by clicking on the corresponding link. Now you should get the query form you may use for searching the data. Sentence alignment can also be searched (if available). Select the languages you want to include in the column to the right. More information about queries can be found elsewhere (....).
The task manager is used to start Uplug processes and to manage running processes. New processes will be queued and executed when possible. There are several queues managed by UplugWeb:
the list of processes to be done
processes that have been taken by a server and wait for their execution
processes in progress
finished processes (only recent once)
processes that failed at some point (can be restarted)
The Main menu contains several tasks that can be run by the system:
All kinds of pre-processing tasks such as basic XML markup, sentence splitting, tokenization, language-specific tools (tagging, chunking)
POS tagger for several languages
Syntactic parsers/chunkers for several languages (currently only for English and Swedish)
Sentence alignment using Gale&Church's length-based alignment algorithm
Word alignment using the Clue Aligner and other tools (e.g. GIZA++)
Go to pre-processing and choose either the basic pre-processing module or one of the language specific pre-processing modules if you have an appropriate document in your corpus. There will be a form for choosing documents from the corpus if you have appropriate once in your corpus. Click on the add jobb button for sending the job to the process queue. (Note: Select the correct corpus up at the top of the page if you have several corpora in your repository!)
The pre-processor will overwrite the original text-document and replace it with the tokenized XML version. It will also be sent by mail if the process is finished.
You can also run the pre-processor on all documents in a corpus by clicking on the preprocess taks in the corpus manager. Check the section on corpus management above!
Tokenize the document first. Then, choose the tagger from the tagger menu and select the appropriate document from the corpus. You can switch between corpora at the top of the page. The old document will be overwritten and the result will be sent to you by e-mail.
Go to the sentence aligner and select the 2 documents you want to align. They have to be tokenized first! Add the job to the queue by clicking on add job.
The sentence aligner uses "hard boundaries" (paragraph breaks and page breaks) to synchronize the alignment process. They may cause problems (follow-up errors) if they are not detected correctly. A simple solution is to remove all double empty lines from the text files before submitting them to the UplugWeb repository.
You can also run the sentence aligner for all possible document pairs ini a corpus by clicking on the align task in the corpus manager. Check the section on corpus management above.
Go to the word aligner and select one of the three possible settings: basic, advanced, and GIZA++. After that select one of the sentence aligned corpora in your repository (you have to sentence align first!). Click on add job to add the alignment process to the queue.
NOTE: Word alignment takes quite some time even for small corpora! Be patient! (This is a Uplug problem)
More information about word alignment will be added later on.
You can install your own UplugWeb server! Download the uplug package from http://uplug.sourceforge.net and follow the instructions in uplug/web/INSTALL and other documentation if available ....