Nature, Language, Process

MeLeL lab logo
Blazing Hyphens
Personal blog (mostly Hebrew), not often updated
Dagesh Kal
Collaborative linguablog (Hebrew)
My Github
Miscellaneous projects
My Scholarly Work
As indexed by Google
My LinkedIn
My Spouse's Lab
Banner Photo
Dientes de Navarino, Chile. Feb 2012.
Photo by Yuval Pinter (CC-BY-SA 3.0)

  A Note
The people of Israel are still reeling from the horrendous Hamas attack on October 7, 2023. I am no exception.

I am first and foremost an Israeli. My family and I were not personally harmed but we were all targets. We are still targets. This is a difficult situation to grasp but it is, nevertheless, the truth. If you are an Israeli colleague or student reading this, know that you are not alone in the community. If you are a colleague and expressed your care or support, I want to express my gratitude from here as well. If you're still looking for the words, just say anything and it would be appreciated.

We will come out of this stronger.

I'm Yuval Pinter (יובל פינטר, approximated English pronunciation /yoo-VAHL PIN-ter/), a Senior Lecturer in the department of Computer Science at Ben-Gurion University of the Negev (BGU), heading the MeLeL lab working on natural language processing (NLP), and a founding member of BGU's School of Brain sciences and Cognition. As of 2023, I am also a Visiting Academic at the Amazon Alexa Shopping science team.

I completed my Computer Science PhD from the School of Interactive Computing at Georgia Institute of Technology, advised by Prof. Jacob Eisenstein and supported by a Bloomberg Data Science PhD Fellowship. Before moving to the States, I worked as a Research Engineer at Yahoo Labs and as a Computational Linguist at Ginger Software, and before that I got my BSc in Computer Science and Mathematics and my MA in Linguistics at Tel Aviv University as an Adi Lautman Interdisciplinary Program for Outstanding Students scholar.

A full CV (from September 2021) is available here. And here is my affiliation color timeline for the 2000s:

My interests within NLP are broad. They include lexical semantics, vector representation of text, morphology, interpretability, syntactic parsing, language use in different social settings, user-generated content, machine learning approaches to NLP, application of NLP in information retrieval and code understanding, and more.

A note to prospective students סטודנטיםות פוטנציאליותים האזינו האזינו: while I am at a CS department, my background and continued research motivation is interdisciplinary and multidisciplinary. If you want to get my attention for an advisor role, or special consideration in academic matters, it would help to demonstrate a similar mindset and/or some familiarity with the work that I have been doing. In any event, cold emails that contain no information that does not apply to many other faculty members at the department are likely to remain unanswered.

I am proud to be a member of BGU's Eitan program steering committee, searching for our next generation of interdisciplinarians.


At Ben-Gurion University

Fall 5784 &
Fall 5783 &
Fall 5782:     202-2-5211 Natural Language Processing (links to public material forthcoming)
Spring 5784: 202-1-2051 Principles of Programming Languages

My main area of current research is lexical semantics and word embeddings, exploring (a) the challenges presented to text representation techniques by difficult lexical forms; (b) how sub-word information can be used to create better word representations, as well as predict representations for words unseen during training; and (c) how to incorporate semantic analysis into embeddings so that they exhibit better similarity properties.

At Ben-Gurion University

Subword Segmentation and Challenges of Tokenization

2024: Khuyagbaatar Batsuren et al.. Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge. Preprint.
2024: Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter. An Analysis of BPE Vocabulary Trimming in Neural Machine Translation. Preprint.
2024: Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter. Greed is All You Need: An Evaluation of Tokenizer Inference Methods. Accepted to ACL. Preprint.
2024: Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner. Tokenization Is More Than Compression. Preprint.
2024: Anaelia Ovalle et al.. Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies. Findings of NAACL. PDF.
2023: Lisa Beinborn and Yuval Pinter. Analyzing Cognitive Plausibility of Subword Tokenization. EMNLP. PDF. Code.
2023: Shaked Yehezkel and Yuval Pinter. Incorporating Context into Subword Vocabularies. EACL. PDF. Code. Video.
2022: Cassandra L. Jacobs and Yuval Pinter. Lost in Space Marking. Preprint. We'd be particularly happy to hear people's thoughts on this focused work nugget.

NLP for Hebrew

2022: Elazar Gershuni and Yuval Pinter. Restoring Hebrew Diacritics Without a Dictionary. Findings of NAACL. PDF. Talk (Hebrew) at the TAU Digital Humanities seminar. Code. Demo. Chrome extension.

Resources and Evaluation

2024: Stephen Mayhew et al.. Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark. NAACL. Project website. PDF. Code.
2024: Carinne Cherf and Yuval Pinter. BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation. LREC-COLING. PDF.

Large Language Models and their Ecosystem

2024: Re'em Harel, Yair Elboher, and Yuval Pinter. Protecting Privacy in Classifiers by Token Manipulation. Accepted to PrivateNLP. Preprint. Code.
2023: Yuval Pinter and Michael Elhadad. Emptying the Ocean with a Spoon: Should We Edit Models? Findings of EMNLP. Preprint. Between January 20 and 22, 2024, the PDF version was unavailable due to an unsanctioned, unauthorized action on part of the EMNLP General Chair. The paper is still under some kind of scrutiny, which we are monitoring closely. Expect updates.
2022: Ramit Sawhney, Ritesh Soun, Shrey Pandit, Megh Thakkar, Sarvagya Malaviya, and Yuval Pinter. CIAug: Equipping Interpolative Augmentation with Curriculum Learning. NAACL. PDF.

Applying NLP Techniques to Parallelize Code

2023: Tal Kadosh et al. Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks. Preprint.
2023: Nadav Schneider, Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, and Gal Oren. MPI-rical: Data-Driven MPI Distributed Parallelism Assistance with Transformers. Preprint. Code, data.
2022: Re'em Harel, Yuval Pinter, and Gal Oren. Learning to Parallelize in a Shared-Memory Environment with Transformers. Preprint. Code, data. Model.

In addition, I am continuing my involvement with the UniMorph project (2020: v. 3.0; 2022: v. 4.0); and with the Academy of the Hebrew Language, codifying the Universal Dependencies standards for Hebrew and maintaining the Hebrew Treebank (2023: v. 2.12). I co-organized the Stories Shared and Lessons Learned workshop at EMNLP 2022, as well as two BlackboxNLP workshops (2020, 2021). I am a member of the EU COST action on Universality, diversity and idiosyncrasy in language technology (UniDive).

At Georgia Tech

2021: Yuval Pinter. Integrating Approaches to Word Representation. Preprint. This is an edited version of my dissertation introduction.
2021: Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein. Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information. Preprint.
2020: Yuval Pinter, Cassandra L. Jacobs, Max Bittker. NYTWIT: A Dataset of Novel Words in the New York Times. COLING. PDF. Data.
2020: Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein. Will it Unblend?. Findings of EMNLP. PDF. Video (lay audience). Handout (linguist audience). Also presented at SCiL 2021.
2019: Nicolas Garneau, Jean-Samuel Leboeuf, Yuval Pinter, Luc Lamontagne. Attending Form and Context to Generate Specialized Out-of-Vocabulary Words Representations. Preprint.
2019: Yuval Pinter, Marc Marone, Jacob Eisenstein. Character Eyes: Seeing Language through Character-Level Taggers. Blackbox NLP Workshop. PDF. Slides. Code. In June 2019 I gave a talk about this project at CUNY, as well as a (different) talk in December 2019 - February 2020 at Amazon Research, at the Tel Aviv University Machine Learning Seminar, and at AISC (video). Slides from the academic venues available upon request.
2018: Yuval Pinter and Jacob Eisenstein. Predicting Semantic Relations using Global Graph Properties. Proceedings of EMNLP. PDF. Blog post. Talk. Slides. Code.
In December 2018 I gave talks about this project in Israel, at: Technion - IIT, Ben-Gurion University, and Yahoo Research. Slides from the academic venues available upon request.

2017: Yuval Pinter, Robert Guthrie, Jacob Eisenstein. Mimicking Word Embeddings using Subword RNNs. Proceedings of EMNLP. PDF. Blog post. Talk. Slides. Code.
In December 2017 I gave talks about this project in Israel, at: Bar Ilan University, Technion - IIT, Amazon Research, and Google Research. Slides from the academic venues available upon request.

In addition to my main research projects, a critical look at work inspecting the ability of Attention mechanisms to explain model behavior got me into research on model interpretability:

2020: Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron Wallace. Learning to Faithfully Rationalize by Construction. Proceedings of ACL. PDF. In May 2020 I gave a talk about this project "at" Stony Brook University's NLP seminar. Slides available upon request.

2019: Sarah Wiegreffe*, Yuval Pinter*. Attention is not not Explanation. Proceedings of EMNLP-IJCNLP. PDF. Blog posts: public facing; professional. Code. Slides. Talk (Sarah's). The December 2019 - May 2020 talks mentioned above cover this work as well.

A replication project done for the graduate seminar in Computational Social Science yielded a paper about language selection based on political stance.

2018: Ian Stewart*, Yuval Pinter*, Jacob Eisenstein. Sí o no, ¿què penses? Catalonian Independence and Linguistic Identity on Social Media. Proceedings of NAACL-HLT. PDF. Code. Slides. Live Tweets. I later presented this paper at AACL 2018.

At Yahoo

Most of my research at Yahoo involved analysis of the web search process, specifically within the domain of Community Question Answering sites (like Yahoo Answers, StackOverflow, Quora, etc.), from both an Information Retrieval (IR) perspective and a linguistic / NLP-y one.

2016: Yuval Pinter, Roi Reichart, Idan Szpektor. Syntactic Parsing of Web Queries with Question Intent. Proceedings of NAACL-HLT. PDF. Blog post. Talk. Data Release Notes. Data available to researchers via the Webscope program.
2016: Gilad Tsur, Yuval Pinter, Idan Szpektor, David Carmel. Identifying Web Queries with Question Intent. Proceedings of WWW. PDF.
2014: David Carmel, Avihai Mejer, Yuval Pinter, Idan Szpektor. Improving Term Weighting for Community Question Answering Search using Syntactic Analysis. Proceedings of CIKM. PDF. Blog post.

Our team, with collaborators from Academia and Government, started the TREC LiveQA Challenge which has so far been run three times at considerable success. In short: participants write a server which has to answer real questions coming into the Yahoo Answers feed, within 1 minute, for one entire day. Details in the overview papers:

2017: Asma Ben Abacha, Eugene Agichtein, Yuval Pinter, Dina Demner-Fushman. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. PDF.
2016: Eugene Agichtein, David Carmel, Dan Pelleg, Yuval Pinter, Donna Harman. Overview of the TREC 2016 LiveQA Track. PDF.
2015: Eugene Agichtein, David Carmel, Donna Harman, Dan Pelleg, Yuval Pinter. Overview of the TREC 2015 LiveQA Track. PDF. Blog post.
Code for creating a server.
Code for running a challenge.

Pet Projects

In 2015, I pursued a hypothesis regarding Israeli media bias: Is it possible to predict which news outlet a headline is from, based on relatively few examples and using only simple textual features? The answer, in so many words, is yes. I presented my preliminary results in:
2015: Yuval Pinter, Oren Persico, Shuki Tausig. Exploring Israeli News Website Bias using Simple Textual Analysis. ISCOL. Extended Abstract (PDF). PPTX (Hebrew).
I have not since followed up on this (but I'm still planning to). The code and data are available here.

Since 2001 I've been looking into the various suggestions to reform the Hebrew script. Here's one of my latest installations, and here's a video of a talk I gave about the subject in 2011. All in Hebrew.

I have some other things up my sleeve. Once I get them into presentable form I'll add them here - stay tuned.

Early Days of Linguistics

A paper I co-authored, about the dynamics of group and authority in Hebrew language communities on online editorial projects, is available online:
2018: Carmel Vaisman, Illan Gonen, and Yuval Pinter. Nonhuman Language Agents in Online Collaborative Communities: Comparing Hebrew Wikipedia and Facebook Translations. Discourse, Context & Media. Link.
This work began in 2011 when I contributed to a chapter on the topic to the book Hebrew Online by Carmel Vaisman and Illan Gonen. It's in Hebrew, and you can buy it here from the publisher.
Dr. Vaisman and I gave a talk about this topic at the 2012 meeting of the Israeli Association for the Study of Language and Society. Video (Hebrew).

My MA thesis from 2014, about the semantics of nonveridical BEFORE, is available here (PDF). I presented parts of it at IGDAL (the first Israeli ling grad student conf) in 2011 with these slides that go right-to-left in each row. It's not a good summary for the completed thesis though.

In 2010 I presented a modification for graded modality comparison. Slides. Seminar Paper this was based on.

Much smaller-scale explorations, for class projects and seminars: Hypercorrection in Hebrew conjunctions; Linguistic exploration of Text messages in Hebrew (Hebrew); Comparing usage of the Hebrew few/little (PPT, Hebrew); Can the acceptance rate of Hebrew neologisms be predicted using Optimality Theory? (tl;dr - NO.) (Hebrew).

To sum, here's my Google Scholar page. My Erdős number is at most 4. I haven't been in any feature films, but the TV-lenient version of the Bacon number puts mine also at most 4 (via), So my Erdős-Bacon number is at most 8.


Online Presence

I have a personal blog, mainly in Hebrew, not often updated nowadays.

I contribute to a communal Linguistics Blog in Hebrew, Dagesh Kal, also not often updated but with a "Hall of Fame" section that I feel comfortable recommending to Hebrew readers who want a decent but laid-back intro to some topics in linguistic research, a-la Language Log. I also gave a short talk (Hebrew again) about peevology, based on our writings.

I'm an admin in what started as a Facebook group devoted to syntactically interesting sentences in Hebrew (Garden Paths, etc.) and somehow balooned into a monstrosity with over 24,000 members (as of January 2022).

I co-run the Fake Noam Chomsky Twitter account (not very active these days).

Various Activities

I translated Jabberwocky into Hebrew.

I've run a couple of marathons. My personal best is 3:51:32 hours (Tiberias, 2015).

Media (Professional)

On June 12, 2023 I participated in a Knesset hearing about risks and opportunities in AI. Recording. Press release. In 2024 there was another one I didn't speak in.

I gave my two cents to Georgia Tech's College of Computing piece about predictions for AI in 2019 (including in the video). I was even kinda right :)

The Mimick work (2017) was covered by several professional blogs.
So was the attention paper (2019).

In January 2023 I was interviewed on Israeli radio about ChatGPT. Recording & Hebrew writeup.

In early 2020 I was interviewed on the Israeli Center for Educational Technology's podcast "Kululusha" on the topic of technological effects on human language (Hebrew).

In February 2017 I was interviewed on Israeli radio about Google Translate's algorithmic upgrade. Recording; Hebrew writeup; Google-Translated English version that's not half bad.

In May 2018 I was interviewed on the same station about Google Duplex. Recording; Hebrew writeup.

I'm in the background of this local Czech coverage of an official visit to Zlín from May, 2023.

Media (Personal)

I helped design a tiny part of the set in a scene in the 2020 movie Superintelligence. My credit was limited to page 9 of the Mimick paper appearing in the background.

My 15-minutes include Some participation on BBC's WHYS, and a train-delay project that got me a front page headline on Israel's main news website and an invitation to a(nother) Knesset hearing.

I've been on four Israeli trivia game shows between 2002 and 2008. Don't ask me how much I've won. Unfortunately, no videos are available online, but one day I might feel like converting the VHS tapes I have them on. More media appearances are linked to from my personal blog.

  Contact Information
Email: u then v then p, no spaces, at cs dot bgu dot ac dot il

Physical: Building 37 / Office 203
     Ben-Gurion University of the Negev
     P.O.B. 653 Beer-Sheva 8410501 Israel

Last Update: July 3, 2024