CRM stands for "Customer Relationship Management" but we actually use it to manage contacts and donations. It is how we send our massive newsletter once in a while.

[[TOC]]

Tutorial

Basic access

The main website is at:

https://crm.torproject.org/

It is protected by basic authentication and the site's login as well, so you actually need two sets of password to get in.

To set up basic authentication for a new user, the following command must be executed on the CiviCRM server:

htdigest /etc/apache2/htdigest 'Tor CRM' <username>

Once basic authentication is in place, the Drupal/CiviCRM login page can be accessed at: https://crm.torproject.org/user/login

Howto

Updating premiums

From time to time, typically around the annual year-end campaign (YEC), the donation gift/perks offered on https://donate.torproject.org need to be updated.

The first step is to update the data in CiviCRM.

Create the perks

  • Go to: Contributions > Premiums (Thank-you Gifts
  • Edit each product as follows:
  • Name: Name displayed for the premium
  • Description: subtext under the title, ex: "Get this year’s Tor Onions T-shirt"
  • SKU: SKU of the product, or if it’s a t-shirt with variants, the common part of the SKU for all sizes of the product (with no dash at the end)
  • Image: A PNG image can be uploaded using the "upload from my computer" option
  • Minimum contribution amount: minimum for non-recurring donations
  • Market value: not used, can be "1.00"
  • Actual cost of Product: not used, ignore
  • Financial Type: not used, ignore
  • Options: comma-delimited "SKU=label" for size selection and corresponding SKUs. For example: T22-RCF-C01=Small,T22-RCF-C02=Medium,T22-RCF-C03=Large,T22-RCF-C04=XL,T22-RCF-C05=2XL,T22-RCF-C06=3XL,T22-RCF-C07=4XL This field cannot be blank, at least one option is required! (eg. HAT-00=Hat)
  • Enabled?: checked (uncheck if the perk is not used anymore)
  • Subscription or Service Settings: ignore, not used
  • Minimum Recurring Amount: Enter the recurring donation amount that makes this premium available
  • Sort: decimal number that helps sort the items on the list of perks (in ascending order, i.e. a lower order/weight is displayed first)
  • Image alt text: alt text for the perk image html tag

New perks: disable the old perk instead of updating the SKU to avoid problems with older data.

Associate with contributions

Perks must be associated with the CiviCRM "contribution page". TPA does not use these Contribution Pages directly, but that is where the settings are stored for donate-neo, such as the ThankYou message displayed on transaction receipts.

  • Go to: Contributions > Manage Contribution Pages
  • Find the "Your donation to the Tor Project" list item and on right right side, click the "configure" link
  • On the contribution page settings form, click the "Premiums" tab

Here you can then associate the perks (premiums) created in the previous section with the page.

If the "add new" link is not displayed, it’s because all available premiums have already been added.

Export the JSON data for donate-neo

When done, export the data in JSON format using the tpa-perks-json CiviCRM page.

The next steps are detailed on the donate wiki page.

Monitoring mailings

The CiviCRM server can generate large mailings, in the order of hundreds of thousands of unique email addresses. Those can create significant load on the server if mishandled, and worse, trigger blocking at various providers if not correctly rate-limited.

For this, we have various knobs and tools:

The Grafana dashboard is based on metrics from Prometheus, which can be inspected live with the following command:

curl -s localhost:3903/metrics | grep -v -e ^go_ -e '^#' -e '^mtail' -e ^process -e _tls_; postfix-queues-sizes

Using lnav can also be useful to monitor logs in real time, as it provides per-queue ID navigation, marks warnings (deferred messages) in yellow and errors (bounces) in red.

A few commands to inspect the email queue:

  • List the queue, with more recent entries first

    postqueue -j | jq -C .recipients[] | tac
    
  • Find how many emails in the queue, per domain:

    postqueue -j | jq -r .recipients[].address | sed 's/.*@//' | sort | uniq -c | sort -n
    

Note that the qshape deferred command gives a similar (and actually better) output.

In case of a major problem, you can stop the mailing in CiviCRM and put all emails on hold with:

postsuper -h ALL

Then the postfix-trickle script can be used to slowly release emails:

postfix-trickle 10 5

When an email bounces, it should go to civicrm@crm.torproject.org, which is an IMAP mailbox periodically checked by CiviCRM. It will ingest bounces landing in that mailbox and disable them for the next mailings. It's also how users can unsubscribe from those mailings, so it is critical that this service runs correctly.

A lot of those notes come from the issue where we enabled CiviCRM to receive its bounces.

Handling abuse complains

Our postmaster alias can receive emails like this:

Subject: Abuse Message [AbuseID:809C16:27]: AbuseFBL: UOL Abuse Report

Those emails usually contain enough information to figure out which email address filed a complaint. The action to take is to remove them from the mailing. Here's an example email sample:

Received: by crm-int-01.torproject.org (Postfix, from userid 33)
        id 579C510392E; Thu, 4 Feb 2021 17:30:12 +0000 (UTC)
[...]
Message-Id: <20210204173012.579C510392E@crm-int-01.torproject.org>
[...]
List-Unsubscribe: <mailto:civicrm+u.2936.7009506.26d7b951968ebe4b@crm.torproject.org>
job_id: 2936
Precedence: bulk
[...]
X-CiviMail-Bounce: civicrm+b.2936.7009506.26d7b951968ebe4b@crm.torproject.org
[...]

Your bounce might have only some of those. Possible courses of action to find the victim's email:

  1. Grep for the queue ID (579C510392E) in the mail logs
  2. Grep for the Message-Id (20210204173012.579C510392E@crm-int-01.torproject.org) in mail logs (with postfix-trace)

Once you have the email address:

  1. Head for the CiviCRM search interface to find that user
  2. Remove the from the "Tor News" group, in the Group tab

Another option is to go in Donor record > Edit communication preferences > check do not email.

Alternatively, you can just send an email to the List-Unsubscribe address or click the "unsubscribe" links at the bottom of the email. The handle-abuse.py script in fabric-tasks.git automatically handles the CiviCRM bounces that way. Support for other bounces should be added there as we can.

Special cases should be reported to the CiviCRM admin by forwarding the email to the Giving queue in RT.

Sometimes complaints come in about Mailman lists. Those are harder to handle because they do not have individual bounce addresses...

Granting access to the CiviCRM backend

The main CiviCRM is protected by Apache-based authentication, accessible only by TPA. To add a user, on the backend server (currently crm-int-01):

htdigest /etc/apache2/htdigest 'Tor CRM' $USERNAME

A Drupal user also needs to be created for that person. If you yourself don't have access to the Drupal interface yet, you can get access to the admin user through root access to the server with:

sudo -i -u torcivicrm
cd /srv/crm.torproject.org/htdocs-prod && drush uli toradmin

Once logged in a personal account should be created with administrator privileges to facilitate future logins.

Notes:

  • The URL produced by drush needs to be manually modified for it to lead to the right place. https should be used indead of http, and the hostname needs to be changed from default to crm.torproject.org
  • drush uli without a user will produce URLs that give out an Access Denied error since the user with uid 1 is disabled.

Rotating API tokens

See the donate site docs for this.

Pager playbook

Security breach

If there's a major security breach on the service, the first thing to do is probably to shutdown the CiviCRM server completely. Halt the crm-int-01 and donate-01 machines completely, and remove access to the underlying storage from the attacker.

Then API keys secrets should probably be rotated, follow the Rotating API tokens procedure.

Job failures

If you get an alert about a "CiviCRM job failure", for example:

    The CiviCRM job send_scheduled_mailings on crm-int-01.torproject.org
    has been marked as failed for more than 4h. This could be that
    it has not run fast enough, or that it failed.

... it means a CiviCRM job (in this case send_scheduled_mailings) has either failed or has not run in its configured time frame. (Note that we currently can't distinguish those states, but hopefully will have metrics to do so soon.)

The "scheduled job failures" section will also show more information about the error:

To debug this, first find the "Scheduled Job Logs":

  1. Go to Administer > System Settings > Scheduled Jobs
  2. Find the affected job (above send_scheduled_mailings)
  3. Click "view log"

Here's a screenshot of such a log:

This will show the error that triggered the alert:

  • If it's an exception, it should be investigated in the source code.

  • If the job just hasn't ran in a timely manner, the systemd timer should be investigated with systemctl status civicron@prod.timer

There's also the global CiviCRM on-disk log. It's not perfect, because on this server there are sometimes 2 different logs. It can also rather noisy, with deprecation alerts, civirules chatter, etc.

Those are also available in "Administer > Administration Console > View Log" in the web interface and stored on disk, in:

ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log

Note that it's also possible to run the jobs by hand, but we don't have specific examples on how to do this for all jobs. See the Resque process job, below, for a more specific example.

Kill switch enabled

If the Resque Processor Job gets stuck because it failed to process an item, it will stop processing completely (assuming it's a bug, or something is wrong). It raises a "kill switch" that will show up as a red "Resque Off" message in Administer > Administration Console > System Status. Here's a screenshot of an enabled kill switch:

Note that this is a special case of the more general job failure above. It's documented explicitly and separately here because it's such an important part that it warrants its own documentation.

The "scheduled job failures" section will also show more information about the error:

To debug this, first find the "Scheduled Job Logs":

  1. Go to Administer > System Settings > Scheduled Jobs
  2. Find "TorCRM Resque Processing"
  3. Click "view log"

Here's a screenshot of such a log:

This will show the error (typically a PHP exception) that triggered the kill switch. This should be investigated in the source code.

There's also the global CiviCRM on-disk log. It's not perfect, because on this server there are sometimes 2 different logs (it's in my pipeline to debug that). It can also rather noisy, with deprecation alerts, civirules chatter, etc.

Those are also available in "Administer > Administration Console > View Log" in the web interface and stored on disk, in:

ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log

The items in the queue can be seen by searching for "TorCRM - Resque" in the above status page, or with the Redis command: LRANGE "resque:queue:prod_web_donations" 0 -1, in the redis-cli shell.

The job can be ran from the command-line manually with:

sudo -i -u torcivicrm
cd /srv/crm.torproject.org/htdocs-prod/
cv api setting.create torcrm_resque_off=0
cv api Job.Torcrm_Resque_Process

You can also get a backtrace with:

cv api Job.Torcrm_Resque_Process -vvv

Once the problem is fixed, the kill switch can be reset by going to "CiviCRM > Administer > Tor CRM Settings" in the web interface. Note that there's somewhat of a double-negative in the kill switch configuration. The form is:

Resque Off Switch  [0]
Set to 0 to disable the off/kill switch. This gets set to 1 by the "Resque" Scheduled Job when an error is detected. When that happens, check the CiviCRM "ConfigAndLog" logs, or under Administer > Console > View Log

The "Resque Off Switch" is the kill switch. When it's set to zero ("0", as above), it's disabled, which means normal operation and the queue is processed. It's set to "1" when an error is raised, and should be set back to "0" when the issue is fixed.

See tpo/web/civicrm#144 for an example of such a kill switch debugging session.

Disaster recovery

If Redis dies, we might lose in-process donations. But otherwise, it is disposable and data should be recreated as needed.

If the entire database gets destroyed, it needs to be restored from backups, by TPA.

Reference

Installation

Full documentation on the installation of this system is somewhat out of scope for TPA: sysadmins only installed the servers and setup basic services like a VPN (using IPsec) and an Apache, PHP, MySQL stack.

The Puppet classes used on the CiviCRM server is role::civicrm_int. That naming convention reflects the fact that, before donate-neo, there used to be another role named roles::civicrm_ext for the frontend, retired in tpo/tpa/team#41511.

Upgrades

As stated above, a new donation campaign involves changes to both the donate-neo site (donate.tpo) and the CiviCRM server.

Changes to the CiviCRM server and donation middleware can be deployed progressively through the test/staging/production sites, which all have their own databases. See the donate-neo docs for deployments of the frontend.

TODO: clarify the deployment workflow. They seem to have one branch per environment, but what does that include? Does it matter for us?

There's a drush script that edits the dev/stage databases to replace PII in general, and in particular change the email of everyone to dummy aliases so that emails sent by accident wouldn't end up in real people's mail boxes.

Upgrades are typically handled by the CiviCRM consultant.

See also the CiviCRM upgrade guide.

SLA

This service is critical, as it is used to host donations, and should be as highly available as possible. Unfortunately, its design has multiple single point of failures, which, in practice, makes this target difficult to fulfill at this point.

Design and architecture

CiviCRM is a relatively "classic" PHP application: it's made of a collection of .php files scattered cleverly around various directories. There's one catch: it's actually built as a drop-in module for other CMSes. Traditionally, Joomla, Wordpress and Drupal are supported, and our deployment uses Drupal.

(There's actually a standalone version in development we are interested in as well, as we do not need the features from the Drupal site.)

Most code lives in a torcrm module that processes Redis messages through CiviCRM jobs.

CiviCRM is isolated from the public internet through HTTP authentication. Communication with the donation frontend happens through a Redis queue. See also the donation site architecture for more background.

Services

The CiviCRM service runs on the crm-int-01 server, with the following layer:

  • Apache: TLS decapsulation, HTTP authentication and reverse proxy
  • PHP FPM: PHP runtime which Apache connects to over FastCGI
  • Drupal: PHP entry point, loads CiviCRM code as a module
  • CiviCRM: core of the business logic
  • MariaDB (MySQL) database (Drupal and CiviCRM storage backend)
  • Redis server: communication between CiviCRM and the donate frontend
  • Dovecot: IMAP server to handle bounces

Apache answers to the following virtual hosts:

  • crm.torproject.org: production CiviCRM site
  • staging.crm.torproject.org: staging site
  • test.crm.torproject.org: testing site

The monthly newsletter is configured on CiviCRM and archived on the https://newsletter.torproject.org static site.

Storage

CiviCRM stores most of its data in a MySQL database. There are separate databases for the dev/staging/prod sites.

TODO: does CiviCRM also write to disk?

Queues

CiviCRM can hold a large queue of emails to send, when a new newsletter is generated. This, in turn, can turn in large Postfix email queues when CiviCRM releases those mails in the email system.

The donate-neo frontend uses Redis to queue up transactions for CiviCRM. See the queue documentation in donate-neo. Queued jobs are de-queued by CiviCRM's Resque Scheduled Job, and crons, logs, monitoring, etc, all use standard CiviCRM tooling.

See also the kill switch enabled playbook.

Interfaces

Most operations with CiviCRM happen over a web interface, in a web browser. There is a CiviCRM API but it's rarely used by Tor's operators.

Users that are administrators can also access the drupal admin menu, but it's not shown in the civicrm web interface. You can change the URL in your browser to any drupal section (for example https://crm.torproject.org/admin/user) to get the drupal admin menu to appear.

The torcivicrm user has a command-line CiviCRM tool called cv in its $PATH which talks to that API to perform various functions.

Drupal also has its own shell tool called drush.

Authentication

The crm-int-01 server doesn't talk to the outside internet and can be accessed only via HTTP Digest authentication. We are considering changing this to basic auth.

Users that need to access the CRM must be added to the Apache htdigest file on crm-int-01.tpo and have a CiviCRM account created from them.

To extract a list of CiviCRM accounts and their roles, the following drush command may be executed at the root of the Drupal installation:

drush uinf $(drush sqlq "SELECT GROUP_CONCAT(uid) FROM users")

The SSH server is firewalled (rules defined in Puppet, profile::civicrm). To get access to the port, ask TPA.

Implementation

CiviCRM is a PHP application licensed under the AGPLv3, supporting PHP 8.1 and later at the time of writing. We are currently running CiviCRM 5.73.4, released in May 30th 2024 (as of 2024-08-28), the current version can be found in /srv/crm.torproject.org/htdocs-prod/sites/all/modules/civicrm/release-notes.md on the production server (crm-int-01). See also the upstream release announcements, the GitHub tags page and the release management policy.

Upstream also has their own GitLab instance.

CiviCRM has a torcrm extension under sites/all/civicrm_extensions/torcrm which includes most of the CiviCRM customization, including the Resque Processor job. It replaces the old tor_donate Drupal module, which is being phased out.

CiviCRM only holds donor information, actual transactions are processed by the donation site, donate-neo.

Issues

Since there are many components, here's a table outlining the known projects and issue trackers for the different sites.

Site Project Issues
https://crm.torproject.org project issues
https://donate.torproject.org project issues
https://newsletter.torproject.org project issues

Issues with the server-level issues should be filed or in the TPA team issue tracker.

Upstream CiviCRM has their own StackExchange site and use GitLab issue queues

Maintainer

CiviCRM, the PHP application and the Javascript component on donate-static are all maintained by the external CiviCRM contractors.

Users

Direct users of this service are mostly the fundraising team.

Upstream

Upstream is a healthy community of free software developers producing regular releases. Our consultant is part of the core team.

Monitoring and metrics

As other TPA servers, the CRM servers are monitored by Prometheus. The Redis server (and the related IPsec tunnel) is particularly monitored, using a blackbox check, to make sure both ends can talk to each other.

There's also graphs rendered by Grafana. This includes an elaborate Postfix dashboard watching to two mail servers.

We started working on monitoring the CiviCRM health better. So far we collect metrics that look like this:

# HELP civicrm_jobs_timestamp_seconds Timestamp of the last CiviCRM jobs run
# TYPE civicrm_jobs_timestamp_seconds gauge
civicrm_jobs_timestamp_seconds{jobname="civicrm_update_check"} 1726143300
civicrm_jobs_timestamp_seconds{jobname="send_scheduled_mailings"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="fetch_bounces"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="process_inbound_emails"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="clean_up_temporary_data_and_files"} 1725821100
civicrm_jobs_timestamp_seconds{jobname="rebuild_smart_group_cache"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="process_delayed_civirule_actions"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="civirules_cron"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="delete_unscheduled_mailings"} 1726166700
civicrm_jobs_timestamp_seconds{jobname="call_sumfields_gendata_api"} 1726201800
civicrm_jobs_timestamp_seconds{jobname="update_smart_group_snapshots"} 1726166700
civicrm_jobs_timestamp_seconds{jobname="torcrm_resque_processing"} 1726203600
# HELP civicrm_jobs_status_up CiviCRM Scheduled Job status
# TYPE civicrm_jobs_status_up gauge
civicrm_jobs_status_up{jobname="civicrm_update_check"} 1
civicrm_jobs_status_up{jobname="send_scheduled_mailings"} 1
civicrm_jobs_status_up{jobname="fetch_bounces"} 1
civicrm_jobs_status_up{jobname="process_inbound_emails"} 1
civicrm_jobs_status_up{jobname="clean_up_temporary_data_and_files"} 1
civicrm_jobs_status_up{jobname="rebuild_smart_group_cache"} 1
civicrm_jobs_status_up{jobname="process_delayed_civirule_actions"} 1
civicrm_jobs_status_up{jobname="civirules_cron"} 1
civicrm_jobs_status_up{jobname="delete_unscheduled_mailings"} 1
civicrm_jobs_status_up{jobname="call_sumfields_gendata_api"} 1
civicrm_jobs_status_up{jobname="update_smart_group_snapshots"} 1
civicrm_jobs_status_up{jobname="torcrm_resque_processing"} 1
# HELP civicrm_torcrm_resque_processor_status_up Resque processor status
# TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up 1

Those show the last timestamp of various jobs, the status of those jobs (1 means OK), and whether the "kill switch" has been raised (1 means OK, that is: not raised).

Authentication to the CiviCRM server was particularly problematic: there's an open issue to convert the HTTP-layer authentication system to basic authentication (tpo/web/civicrm#147).

We're hoping to get more metrics from CiviCRM, like detailed status of job failures, mailing run times and other statistics, see tpo/web/civicrm#148. Other options were discussed in this comment as well.

Only the last metric above is hooked up to alerting for now, see tpo/web/donate-neo#75 for a deeper discussion.

Note that the donate front-end also exports its own metrics, see the Donate Monitoring and metrics documentation for details.

Tests

TODO: what to test on major CiviCRM upgrades, specifically in CiviCRM?

There's a test procedure in donate.torproject.org that should likely be followed when there are significant changes performed on CiviCRM.

Logs

The CRM side (crm-int-01.torproject.org) has a similar configuration and sends production environment errors via email.

The logging configuration is in: crm-int-01:/srv/crm.torproject.org/htdocs-prod/sites/all/modules/custom/tor_donation/src/Donation/ErrorHandler.php.

Resque processor logs are in the CiviCRM Scheduled Jobs logs under Administer > System Settings > Scheduled Jobs, then find the "Torcrm Resque Processing" job, then view the logs. There may also be fatal errors logged in the general CiviCRM log, under Administer > Admin Console > View Log.

Backups

Backups are done with the regular backup procedures except for the MariaDB/MySQL database, which are backed up in /var/backups/mysql/. See also the MySQL section in the backup documentation.

Other documentation

Upstream has a documentation portal where our users will find:

Discussion

This section is reserved for future large changes proposed to this infrastructure. It can also be used to perform an audit on the current implementation.

Overview

CiviCRM's deployment has simplified a bit since the launch of the new donate-neo frontend. We inherited a few of the complexities of the original design, in particular the fragility of the coupling between frontend and backend through the Redis / IPsec tunnel.

We also inherited the "two single points of failure" design from the original implementation, and actually made that worse by removing the static frontend.

The upside is that software has been updated to use more upstream, shared code, in the form of Django. We plan on using renovate to keep dependencies up to date. Our deployment workflow has improved significantly as well, by hooking up the project with containers and GitLab CI, although CiviCRM itself has failed to benefit from those changes unfortunately.

Next steps include improvements to monitoring and perhaps having a proper dev/stage/prod environments, with a fully separate virtual server for production.

Original "donate-paleo" review

The CiviCRM deployment is complex and feels a bit brittle. The separation between the CiviCRM backend and the middleware API evolved from an initial strict, two-server setup, into the current three-parts component after the static site frontend was added around 2020. The original two-server separation was performed out of a concern for security. We were worried about exposing CiviCRM to the public, because we felt the attack surface of both Drupal and CiviCRM was too wide to be reasonably defended against a determined attacker.

The downside is, obviously, a lot of complexity, which also makes the service more fragile. The Redis monitoring, for example, was added after we discovered the ipsec tunnel would sometimes fail, which would completely break donations.

Obviously, if either the donation middleware or CiviCRM fails, donations go down as well, so we have actually two single point of failures in that design.

A security review should probably be performed to make sure React, Drupal, its modules, CiviCRM, and other dependencies, are all up to date. Other components like Apache, Redis, or MariaDB are managed through Debian package, and supported by the Debian security team, so should be fairly up to date, in terms of security issues.

Note that this section refers to the old architecture, based on a custom middleware now called "donate-paleo".

Security and risk assessment

Technical debt and next steps

Proposed Solution

Goals

Must have

Nice to have

Non-Goals

Approvals required

Proposed Solution

Cost

Other alternatives