John E. Ortega

John E. Ortega
  • Doctor of Philosophy
  • Researcher at New York University

About

55
Publications
7,907
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
614
Citations
Current institution
New York University
Current position
  • Researcher

Publications

Publications (55)
Preprint
Full-text available
Recent years have seen a marked increase in research that aims to identify or predict risk, intention or ideation of suicide. The majority of new tasks, datasets, language models and other resources focus on English and on suicide in the context of Western culture. However, suicide is global issue and reducing suicide rate by 2030 is one of the key...
Preprint
Full-text available
Suicidal ideation is a serious health problem affecting millions of people worldwide. Social networks provide information about these mental health problems through users' emotional expressions. We propose a multilingual model leveraging transformer architectures like mBERT, XML-R, and mT5 to detect suicidal text across posts in six languages - Spa...
Preprint
Full-text available
How effective is peer-reviewing in identifying important papers? We treat this question as a forecasting task. Can we predict which papers will be highly cited in the future based on venue and "early returns" (citations soon after publication)? We show early returns are more predictive than venue. Finally, we end with constructive suggestions to ad...
Preprint
Full-text available
Translating between languages with drastically different grammatical conventions poses challenges, not just for human interpreters but also for machine translation systems. In this work, we specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation. By manually ins...
Preprint
Full-text available
This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using "gold" parses...
Preprint
Full-text available
This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attra...
Preprint
Full-text available
The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near...
Preprint
Full-text available
Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s'). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the...
Preprint
Full-text available
Nollywood, based on the idea of Bollywood from India, is a series of outstanding movies that originate from Nigeria. Unfortunately, while the movies are in English, they are hard to understand for many native speakers due to the dialect of English that is spoken. In this article, we accomplish two goals: (1) create a phonetic sub-title model that i...
Article
This work starts with the TWISCO baseline, a benchmark of suicide-related content from Twitter. We find that hyper-parameter tuning can improve this baseline by 9%. We examined 576 combinations of hyper-parameters: learning rate, batch size, epochs and date range of training data. Reasonable settings of learning rate and batch size produce better r...
Preprint
Full-text available
Models based on bidirectional encoder representations from transformers (BERT) produce state of the art (SOTA) results on many natural language processing (NLP) tasks such as named entity recognition (NER), part-of-speech (POS) tagging etc. An interesting phenomenon occurs when classifying long documents such as those from the US supreme court wher...
Preprint
Full-text available
Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are...
Article
Full-text available
There has been considerable work recently in the natural language community and elsewhere on Responsible AI. Much of this work focuses on fairness and biases (henceforth Risks 1.0), following the 2016 best seller: Weapons of Math Destruction . Two books published in 2022, The Chaos Machine and Like, Comment, Subscribe , raise additional risks to pu...
Article
Full-text available
Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for...
Conference Paper
Full-text available
Indigenous languages, including those from the Americas, have received very little attention from the machine learning (ML) and natural language processing (NLP) communities. To tackle the resulting lack of systems for these languages and the accompanying social inequalities affecting their speakers, we conduct the second AmericasNLP competition (a...
Conference Paper
Full-text available
In the effort to minimize the risk of extinction of a language, linguistic resources are fundamental. Quechua, a low-resource language from South America, is a language spoken by millions but, despite several efforts in the past, still lacks the resources necessary to build high-performance computational systems. In this article, we present WordNet...
Conference Paper
Full-text available
Proxecto Nós is an initiative aimed at providing the Galician language with openly licensed resources, tools, and demonstrators in the area of intelligent technologies. The Project has two main scientific and technological objectives: (i) to integrate the Galician language into cutting-edge AI and language technologies, thus enabling the natural us...
Conference Paper
Full-text available
The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first offic...
Conference Paper
Full-text available
The development of language technologies (LTs) such as machine translation, text analytics, and dialogue systems is essential in the current digital society, culture and economy. These LTs, widely supported in languages in high demand worldwide, such as English, are also necessary for smaller and less economically powerful languages, as they are a...
Conference Paper
Full-text available
This work provides a survey of several networking cipher algorithms and proposes a method for integrating natural language processing (NLP) as a protective agent for them. Two main proposals are covered for the use of NLP in networking. First, NLP is considered as the weakest link in a networking encryption model; and, second, as a hefty deterrent...
Preprint
Full-text available
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it aga...
Conference Paper
Full-text available
We present a neural machine translation (NMT) system for translating Spanish to Galician (ES-GL). Galician is a low-to-medium resource language found in Spain. Our NMT system is trained on a large-scale synthetic ES → P T → GL parallel corpus created by the spelling transliteration of Portuguese to Galician from a high-quality Spanish to Portuguese...
Chapter
In this article, we report the construction of a web-based Galician corpus and its language model, both made publicly available, by making use of CCNet tools and data. An in-depth analysis of the corpus is made so as to provide insights on how to achieve optimum quality through the use of heuristics to lower the perplexity.
Conference Paper
Full-text available
Low-resource languages sometimes take on similar morphological and syntactic characteristics due to their geographic nearness and shared history. Two low-resource neighboring languages found in Peru, Quechua and Ashaninka, can be considered, at first glance, two languages that are morphologically similar. In order to translate the two languages, va...
Conference Paper
Full-text available
We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and...
Preprint
Full-text available
We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and...
Conference Paper
Full-text available
This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training s...
Preprint
Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible...
Conference Paper
Full-text available
Languages can be considered endangered for many reasons. One of the principal reasons for endangerment is the disappearance of its speakers. Another, more identifiable reason, is the lack of written resources. We present an automated sub-segmentation system called AshMorph that deals with the morphology of an Amazonian tribal language called Ashani...
Article
Full-text available
Low-resource languages (LRL) with complex morphology are known to be more difficult to translate in an automatic way. Some LRLs are particularly more difficult to translate than others due to the lack of research interest or collaboration. In this article, we experiment with a specific LRL, Quechua, that is spoken by millions of people in South Ame...
Preprint
Computer-aided translation tools based on translation memories are widely used to assist professional translators. A translation memory (TM) consists of a set of translation units (TU) made up of source- and target-language segment pairs. For the translation of a new source segment $s^{\prime }$ , these tools search the TM and retrieve the TUs $...
Cover Page
Full-text available
Post-Editing in Modern-Day Translation workshop for 2020, hosted at AMTA 2020, virtually. We are accepting research papers and demos.
Conference Paper
Full-text available
Two of the more predominant technologies that professional translators have at their disposal for improving productivity are machine translation (MT) and computer-aided translation (CAT) tools based on translation memories (TM). When translators use MT, they can use automatic post-editing (APE) systems to automate part of the post-editing work and...
Patent
Full-text available
According to some aspects , a system for automatically processing text comprising information regarding a patient encounter to assign medical codes to the text is provided . The system comprises at least one storage medium storing processor - executable instructions , and at least one processor configured to execute the processor - executable instr...
Conference Paper
Full-text available
In recent years, deep learning has shown promising results when used in the field of natural language processing (NLP). Neural networks (NNs) such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used for various NLP tasks including sentiment analysis, information retrieval, and document classification. In this...
Patent
Full-text available
A method, computer program product, and computing system for processing content concerning a plurality of patients using a CAC system to define one or more billing codes concerning a social habit status of one or more patients of the plurality of patients. The one or more billing codes concerning the social habit status of the one or more patients...
Article
Full-text available
The Termolator is an open-source high-performing terminology extraction system, available on Github. The Termolator combines several different approaches to get superior coverage and precision. The in-line term component identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks tha...
Conference Paper
Full-text available
While systems using the Neural Network based Machine Translation (NMT) paradigm achieve the highest scores on recent shared tasks, phrase-based (PBMT) systems, rule-based (RBMT) systems and other systems may get better results for individual examples. Therefore, combined systems should achieve the best results for MT, particularly if the system com...
Conference Paper
Full-text available
Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to erro...
Conference Paper
Full-text available
Fuzzy-match repair (FMR), which combines a human-generated translation memory (TM) with the flexibility of machine translation (MT), is one way of using MT to augment resources available to translators. We evaluate rule-based, phrase-based, and neural MT systems as black-box sources of bilingual information for FMR. We show that FMR success varies...
Patent
Full-text available
Techniques for coding a medical report include identifying an acronym or abbreviation in the medical report, and a plurality of phrases not explicitly included in the medical report that are possible expanded forms of the acronym or abbreviation in the medical report. From the plurality of phrases, a most likely expanded form of the acronym or abbr...
Conference Paper
Full-text available
Computer-aided translation (CAT) tools often use a translation memory (TM) as the key resource to assist translators. A TM contains translation units (TU) which are made up of source and target language segments; translators use the target segments in the TU suggested by the CAT tool by converting them into the desired translation. Proposals from T...
Conference Paper
Full-text available
When a computer-assisted translation (CAT) tool does not find an exact match for the source segment to translate in its translation memory (TM), translators must use fuzzy matches that come from translation units in the translation memory that do not completely match the source segment. We explore the use of a fuzzy-match repair technique called pa...

Questions

Question (1)
Question
would love to help if possible

Network

Cited By