31st International Symposium on String Processing and Information Retrieval

Puerto Vallarta, Jalisco, México
September 23th-25th, 2024

Keynote Speakers

Juliana Freire

New York University, USA

Juliana Freire is an Institute Professor at the Tandon School of Engineering and Professor of Computer Science and Engineering and Data Science at New York University. She develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, as well as different application areas, including urban analytics, misinformation, predictive modeling, and computational reproducibility. She is an active member of the database and Web research communities, having published over 250 technical papers (including 12 award-winning papers), several open-source systems, and 12 U.S. patents. She is an ACM Fellow, a AAAS Fellow, and the recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She was awarded the ACM SIGMOD Contributions Award in 2020. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She has received M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook, and a B.S. degree in computer science from the Federal University of Ceara (Brazil).

Dataset Search for Data Discovery, Augmentation, and Explanation

In recent years, we have have witnessed an explosion in our capacity to collect and catalog vast amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should, in theory, allow us to make progress on many of our most important scientific and societal questions.

However, this opportunity is often unrealized due to a central technical barrier: it is remains nearly impossible for domain experts to sift through the overwhelming amount of available information to discover datasets they need for their specific applications. While search engines have addressed the discovery problem for Web documents, supporting the discovery of structured data presents new challenges. These include crawling the Web in search of datasets, indexing datasets and supporting dataset-oriented queries, creating new techniques to rank and display results.

In this talk, I will discuss these challenges and present our recent work in this area. Specifically, I will describe strategies for finding relevant datasets on the web and deriving metadata to be indexed. Additionally, I will introduce a new class of data-relationship queries and outline a collection of methods that efficiently support various types of relationships, demonstrating how they can be used for data explanation and augmentation. Finally, I will showcase Auctus, an open-source dataset search engine that we have developed at the NYU Visualization, Imaging, and Data Analysis (VIDA) Center. I will conclude by highlighting open problems and suggesting directions for future research.


 

Marinella Sciortino

University of Palermo, Italy

Marinella Sciortino is a full professor of Computer Science at the University of Palermo, where she currently serves as a member of the Academic Senate and Director of the CINI (National Interuniversity Consortium for Informatics) research unit. She was a member of the Scientific Committee of the GRIN (Group of Italian Professors of Informatics) from 2020 to 2023. She has held numerous positions at the University of Palermo, including Coordinator of the bachelor and master degree programs in Computer Science, and Member of the Research Evaluation Committee for Mathematics and Computer Science. She is a member of the editorial board of the scientific journal Theoretical Computer Science published by Elsevier and is included in the List of Mentors for the CiE Women in Computability Mentorship Programme. Her research interests mainly include Automata Theory and Formal Languages, Combinatorics on Words, String Algorithms, Symbolic Dynamics, and Data Compression. Some of her most cited publications concern the Burrows-Wheeler Transform, particularly an extension of the BWT to collections of sequences, boosting sequence compression based on the BWT, as well as its mathematical and combinatorial properties.

Exploring Repetitiveness in Texts: From BWT to Morphisms

The notion of repetitiveness plays a fundamental role in processing very large collections of texts. In many applications, massive and highly repetitive data need to be stored, analyzed, and queried. Therefore, having good measures capable of capturing repetitiveness implies having effective parameters to evaluate the performance of compressed indexing data structures for such types of data.

Many repetitiveness measures are defined using compression schemes. One of these measures, denoted r, is the number of maximal equal-letter runs in the output produced by the Burrows-Wheeler Transform (BWT), a transformation which permutes the characters of a text to boost the effects of run-length encoding. Besides having a crucial role in the definition of recent compressed indexing data structures, such as the r-index, the measure r has attracted attention in Combinatorics on Words because it has allowed for defining and recognizing properties of repetitive strings. A pioneering result is the characterization of finite Sturmian words as the binary strings for which r assumes its minimum value.

From a complementary perspective, morphisms are classic tools in Combinatorics on Words for generating collections of repetitive texts. Injective morphisms, known as codes, are widely used in Information Theory. Recently, morphisms, combined with copy-paste mechanisms, have been used to define new repetitiveness measures and compressors, called NU-systems.

In this talk, I will explore our recent results on the properties of the measure that allow to analyze the combinatorial characteristics of input texts. I will then show very recent interesting findings on the identification of collections of generic highly repetitive strings using the measure r. Next, we will see recent results on the evaluation of some compression-based repetitiveness measures for collections of strings generated by morphisms. I will close with our latest research on the close correlations between morphisms and the measure r, with exciting implications in the theory of codes.


 

Gerardo Sierra

National Autonomous University of Mexico (UNAM), Mexico

Gerardo Sierra is a Civil Engineer with a Master's in Hispanic Linguistics, both from the UNAM, and a doctorate in Computational Linguistics from what is now the University of Manchester. In 1999, with the support of the authorities of the UNAM Engineering Institute, he formed the Group of linguistic engineering, a research group of students ranging from degree to postdoctoral degree whose main characteristic is that it covers language and information technologies from various theoretical and applied perspectives.
He is currently a full-time Senior Researcher B within the Institute of Engineering, and part of the Mexican System of Researchers (level III). He has to his credit different positions and appointments, among which stand out: head and founder of the Engineering Group Linguistics, member of the board of directors and founder of the Mexican Association for the Natural Language Processing (NLPNL); member of the Ruling Commission of the Electrical Engineering Division of the Faculty of Engineering; and member of the H. Council University of UNAM. He has been responsible for ongoing sponsored projects of language technologies, including for the National Security Commission or health and mass media analysis companies. He has published three books and more than one hundred articles in magazines, book chapters, and refereed articles from conferences.

Preservation and Accessibility of Documentary Heritage

Preservation and accessibility of documentary heritage are essential for maintaining and disseminating the cultural and historical wealth of a society. These concepts encompass a set of actions and strategies aimed at conserving historical documents and ensuring their availability for future generations, fostering research and knowledge across various disciplines.

National libraries play a crucial role as the primary reservoir of a country's documentary heritage. They store and protect a vast collection of documents, both printed and digital, that reflect a nation's cultural diversity and legacy.

Printed documents include codices, manuscripts, documents in indigenous languages, and multimodal texts. Each type presents unique preservation challenges due to its fragility, rarity, and linguistic and material diversity. The preservation of printed documents faces several challenges, such as the need for specialized techniques for physical conservation, the digitization of multimodal texts, and the translation and cataloging of documents in indigenous languages. These tasks require an interdisciplinary approach and advanced technologies to ensure the integrity and accessibility of these materials.

Natural language processing (NLP) and artificial intelligence (AI) offer powerful tools to address these challenges. These technologies can support, among others: Metadata extraction, cataloging and classification, summary generation.

The use of NLP and AI not only enhances preservation but also increases the accessibility of documentary heritage. These technologies enable the creation of digital access platforms, vectorized databases, and advanced search tools, which are essential for research in digital humanities, stylometry, literary studies, and more.