Keynote Speakers
New York University, USA
In recent years, we have have witnessed an explosion in our capacity to collect and catalog vast amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should, in theory, allow us to make progress on many of our most important scientific and societal questions.
However, this opportunity is often unrealized due to a central technical barrier: it is remains nearly impossible for domain experts to sift through the overwhelming amount of available information to discover datasets they need for their specific applications. While search engines have addressed the discovery problem for Web documents, supporting the discovery of structured data presents new challenges. These include crawling the Web in search of datasets, indexing datasets and supporting dataset-oriented queries, creating new techniques to rank and display results.
In this talk, I will discuss these challenges and present our recent work in this area. Specifically, I will describe strategies for finding relevant datasets on the web and deriving metadata to be indexed. Additionally, I will introduce a new class of data-relationship queries and outline a collection of methods that efficiently support various types of relationships, demonstrating how they can be used for data explanation and augmentation. Finally, I will showcase Auctus, an open-source dataset search engine that we have developed at the NYU Visualization, Imaging, and Data Analysis (VIDA) Center. I will conclude by highlighting open problems and suggesting directions for future research.
University of Palermo, Italy
The notion of repetitiveness plays a fundamental role in processing very large collections of texts. In many applications, massive and highly repetitive data need to be stored, analyzed, and queried. Therefore, having good measures capable of capturing repetitiveness implies having effective parameters to evaluate the performance of compressed indexing data structures for such types of data.
Many repetitiveness measures are defined using compression schemes. One of these measures, denoted r, is the number of maximal equal-letter runs in the output produced by the Burrows-Wheeler Transform (BWT), a transformation which permutes the characters of a text to boost the effects of run-length encoding. Besides having a crucial role in the definition of recent compressed indexing data structures, such as the r-index, the measure r has attracted attention in Combinatorics on Words because it has allowed for defining and recognizing properties of repetitive strings. A pioneering result is the characterization of finite Sturmian words as the binary strings for which r assumes its minimum value.
From a complementary perspective, morphisms are classic tools in Combinatorics on Words for generating collections of repetitive texts. Injective morphisms, known as codes, are widely used in Information Theory. Recently, morphisms, combined with copy-paste mechanisms, have been used to define new repetitiveness measures and compressors, called NU-systems.
In this talk, I will explore our recent results on the properties of the measure that allow to analyze the combinatorial characteristics of input texts. I will then show very recent interesting findings on the identification of collections of generic highly repetitive strings using the measure r. Next, we will see recent results on the evaluation of some compression-based repetitiveness measures for collections of strings generated by morphisms. I will close with our latest research on the close correlations between morphisms and the measure r, with exciting implications in the theory of codes.
National Autonomous University of Mexico (UNAM), Mexico
Preservation and accessibility of documentary heritage are essential for maintaining and disseminating the cultural and historical wealth of a society. These concepts encompass a set of actions and strategies aimed at conserving historical documents and ensuring their availability for future generations, fostering research and knowledge across various disciplines.
National libraries play a crucial role as the primary reservoir of a country's documentary heritage. They store and protect a vast collection of documents, both printed and digital, that reflect a nation's cultural diversity and legacy.
Printed documents include codices, manuscripts, documents in indigenous languages, and multimodal texts. Each type presents unique preservation challenges due to its fragility, rarity, and linguistic and material diversity. The preservation of printed documents faces several challenges, such as the need for specialized techniques for physical conservation, the digitization of multimodal texts, and the translation and cataloging of documents in indigenous languages. These tasks require an interdisciplinary approach and advanced technologies to ensure the integrity and accessibility of these materials.
Natural language processing (NLP) and artificial intelligence (AI) offer powerful tools to address these challenges. These technologies can support, among others: Metadata extraction, cataloging and classification, summary generation.
The use of NLP and AI not only enhances preservation but also increases the accessibility of documentary heritage. These technologies enable the creation of digital access platforms, vectorized databases, and advanced search tools, which are essential for research in digital humanities, stylometry, literary studies, and more.