- Can we better deal with ``comparable corpora''? Example: the
SDA-G corpus and the AP corpus cover the same domain (news articles)
over the same period of time (88-90). There are surely many
semantically equivalent English-German pairs lurking in there. Can we
find them and use them for training?
- Does it help to use modern weighting schemes? I bet!
- Does it help to use better phrasing and stemming? If so, how do
you do this well for non-English languages?
- How can you take advantage of the statistical information
available in the monolingual documents? Typically, there are far more
unaligned documents than aligned ones, and these monolingual documents
do contain valuable cooccurrence information that is not exploited by
our current approach.
- Can we get by using ``transitive training''? For example, say
we've got some English-French pairs and some English-German pairs.
Can we create a vector space containing English, French, and German
words? If so, we could do French-German CLIR without any
French-German training documents! Some positive and negative results
already.
Up: RESEARCH TOPICS
Previous: RESEARCH TOPICS