TextMachina: An open-source framework for building machine-generated text datasets

In the spirit of openness and collaboration, Genaios is proud to contribute TextMachina to the open-source community

The rise of capable Generative AI has ushered in a new era of AI-generated content bringing with it countless positive use-cases, but also potential perils in the form of widespread automated misinformation and disinformation, fake reviews, reputational damage, and so forth.

This has brought with it a pressing need to automatically detect human content, detect AI-generated content, and determine the boundaries between the two. Explainability and transparency are also needed in the form of attribution of AI-generated text to the underlying Large Language model.

These tasks have remained a challenge for even the most capable LLM organisations, and news abounds of the potential negative consequences of misattribution and being falsely accused of using AI.

Curating and compiling high-quality, unbiased datasets is one of the crucial steps in developing effective, fair, and ethical systems for Machine Generated Text (MGT) related tasks such as detection, attribution, and boundary detection. As researchers in this field, this gave us the impetus to provide the community with a pipeline of tools to build such MGT datasets.

In the spirit of openness and collaboration, Genaios is proud to contribute TextMachina to the open-source community, a modular, extensible Python framework, designed

for compiling high-quality, unbiased MGT datasets to build robust detection and attribution models. TextMachina is released under the CC-NC-ND-4.0 license.

Our goal with releasing TextMachina to the community is to foster the research and development of MGT detectors, enhance their robustness and reliability, and advance the field.

TextMachina offers a simple way to build datasets for:

Detection – Is this text machine-generated?
Attribution – Which model generated this text?
Boundary detection – Where does the generated content start?

Abstracting away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. Our framework integrates LLMs from Anthropic, Cohere, OpenAI, Google Vertex AI, and any model from HuggingFace deployed locally or remotely. And new LLM providers can be added easily. A CLI is also provided to easily generate datasets and explore the generated datasets with reporting metrics, and statistics.

For more detailed information, please read the associated paper TextMachina: Seamless Generation of Machine-Generated Text Datasets, and find the code on our Github repository.

The Python PyPI package can be found here.

We value your contribution and feedback.

The Genaios Research Team.

—————————————————————————————————————————

The Genaios team is working hard to bring out our first product beta version!

If you would like to be one of the first users of the beta version, to help test and define the product, please sign up!