About
Fedineko is indexing service for federated ActivityPub services.
It is not public so far. But might become... soon later.
I wrote initial implementation after struggling to find posts of interest on one of Mastodon instances.
Now full text search enabled instances are more common, cross-instance search experience is a bit lacking still though.
What is indexed?
Currently, Fedineko indexes public content captured on ActivityPub relays.
Specific rules to determine that indexing is allowed are here.
Fedineko does not index account details, though stores such information as account ID, tags, emojis and opt-out status.
Fedineko does not store images or really any other media attached to ActivityPub documents.
Stored account details are:
- account ID, in other words it is actor URL
- preferred username, usually it is username part of @username@server
- display names such as Fedi Neko
- Fediverse instance this account was created on, it is server part of @username@server
- index status such as opt-out status to deny or ban indexing
- avatar image URL, if any, it is used in search result output
- list of emojis, if any, it is used in search result output
- date and time when the account information was retrieved, it is used to drop account data after a week
- public key to validate signatures
Stored content details are:
- ID of content, it is URL, e.g., https://server/user/username/status/123
- Human friendly URL for content, e.g., https://server/@username/123
- date it was published, so week old content is deleted from index
- content text, it is the actual text data that is indexed
- sensitive content flag, to filter content in search query results
- tags, these are used in search by tag
- mentions, these are used to exclude mentions from index
- emojis, references of emojis to embed into text content when presenting search results
- attachments, reference to image or other media "attached" to content
- languages, used to select tokenizer when indexing
How many documents are indexed?
More than half of content is filtered out as folks prefer to keep it private, so it is not that much actually.
Right now there are 222355 documents in the index.
What is data retention policy?
Data is stored in index for up to 7 days.
Where is it stored?
Data is stored and processed in DigitalOcean and AWS datacenters in US with all pros and cons of it.
What is crabo user-agent?
Crabo is Fedineko component that gets site information (title, description, image) from OpenGraph meta tags.
This information is used to render link preview for content that have links.
Crabo follows robots.txt instructions and robots meta tag options such as "noindex":
<meta name="robots" content="noindex">
<meta name="fedineko-crabo" content="noindex">
<meta name="fedineko-crabo, some-other-bot" content="noindex, noarchive">
so it will not attempt to process pages that explicitly ask not to do it.
What is oceanhorse user-agent?
Oceanhorse is Fedineko component that processes ActivityPub documents, extracts text and meta-data (such as timestamp,
author, tags) and passes it to indexing component. You are likely to see oceanhorse in logs of actual ActivityPub-federated instance.
What is octofedi user-agent?
Octofedi is Fedineko component that accepts ActivityPub documents, does a basic filtering and enqueues documents
for processing by Oceanhorse. You are likely to see requests from octofedi when it requests public keys to verify
document signatures.
What is fedidig.com?
That is where proof of concept was running, it is still used for development purposes.
I do not want content for my account to be indexed
Make sure that either "indexable" or "discoverable" flag is set to false.
Refer to your account settings on ActivityPub service instance or client settings to configure it.
When/if Fedineko search index becomes public, there will be one more way to opt-out from indexing.