How the feeds module in Drupal 7 ended up causing MySQL to sever the connection

This is a short story on an interesting problem we were having with the Feeds module and Feeds directory fetcher module in Drupal 7.

How the feeds module in Drupal 7 ended up causing MySQL to sever the connection

This is a short story on an interesting problem we were having with the Feeds module and Feeds directory fetcher module in Drupal 7.

Background on the use of Feeds

Feeds for this site is being used to ingest XML from a third party source (Reuters). The feed perhaps ingests a couple of hundred articles per day. There can be updates to the existing imported articles as well, but typically they are only updated the day the article is ingested.

Feeds was working well for over a few years, and then all of a sudden, the ingests started to fail. The failure was only on production, whilst the other (lower environments) the ingestion worked as expected.

The bizarre error

On production we were experiencing the error during import:

PDOStatement::execute(): MySQL server has gone away database.inc:2227 [warning] 
PDOStatement::execute(): Error reading result set's header [warning] 
database.inc:2227PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has [error]

The error is not so much that the database server is not alive, more so that PHP's connection to the database has been severed due to exceeding MySQL's wait_timeout value.

The reason why this would occur on only production happens on Acquia typically when you need to read and write to the shared filesystem a lot. As lower environments, the filesystem is local disk (as the environments are not clustered) the access is a lot faster. On production, the public filesystem is a network file share (which is slower).

Going down the rabbit hole

Working out why Feeds was wanting to read and/or write many files from the filesystem was the next question, and immediately one thing stood out. The shear size of the config column in the feeds_source table:

mysql> SELECT id,SUM(char_length(config))/1048576 AS size FROM feeds_source GROUP BY id;
+-------------------------------------+---------+
| id                                  | size    |
+-------------------------------------+---------+
| apworldcup_article                  |  0.0001 |
| blogs_photo_import                  |  0.0003 |
| csv_infographics                    |  0.0002 |
| photo_feed                          |  0.0002 |
| po_feeds_prestige_article           |  1.5412 |
| po_feeds_prestige_gallery           |  1.5410 |
| po_feeds_prestige_photo             |  0.2279 |
| po_feeds_reuters_article            | 21.5086 |
| po_feeds_reuters_composite          | 41.9530 |
| po_feeds_reuters_photo              | 52.6076 |
| example_line_feed_article           |  0.0002 |
| example_line_feed_associate_article |  0.0001 |
| example_line_feed_blogs             |  0.0003 |
| example_line_feed_gallery           |  0.0002 |
| example_line_feed_photo             |  0.0001 |
| example_line_feed_video             |  0.0002 |
| example_line_youtube_feed           |  0.0003 |
+-------------------------------------+---------+
What 52 MB of ASCII looks like in a single cell.

Having to deserialize 52 MB of ASCII in PHP is bad enough.

The next step was dumping the value of the config column for a single row:

drush --uri=www.example.com sqlq 'SELECT config FROM feeds_source WHERE id = "po_feeds_reuters_photo"' > /tmp/po_feeds_reuters_photo.txt
Get the 55 MB of ASCII in a file for analysis

Then open the resulting file in vim:

"/tmp/po_feeds_reuters_photo.txt" 1L, 55163105C
Vim struggles to open any file that has 55 million characters on a single line

And sure enough, inside this config column was a reference to every single XML file ever imported, a cool ~450,000 files.

a:2:{s:31:"feeds_fetcher_directory_fetcher";a:3:{s:6:"source";s:23:"private://reuters/pass1";s:5:"reset";i:0;
s:18:"feed_files_fetched";a:457065:{
s:94:"private://reuters/pass1/topnews/2018-07-04T083557Z_1_KBN1JU0WQ_RTROPTC_0_US-CHINA-AUTOS-GM.XML";i:1530693632;
s:94:"private://reuters/pass1/topnews/2018-07-04T083557Z_1_KBN1JU0WR_RTROPTT_0_US-CHINA-AUTOS-GM.XML";i:1530693632;
s:96:"private://reuters/pass1/topnews/2018-07-04T083557Z_1_LYNXMPEE630KJ_RTROPTP_0_USA-TRADE-CHINA.XML";i:1530693632;
s:97:"private://reuters/pass1/topnews/2018-07-04T083617Z_147681_KBE99T04E_RTROPTT-LNK_0_OUSBSM-LINK.XML";i:1530693632;
s:102:"private://reuters/pass1/topnews/2018-07-04T083658Z_1_KBN1JU0X2_RTROPTT_0_JAPAN-RETAIL-ZOZOTOWN-INT.XML";i:1530693632
457,065 is the array size in feed_files_fetched

So this is the root cause of the problem, Drupal is attempting to stat() ~450,000 files that do not exist, and these files are mounted on a network file share. This process took longer than MySQL's wait_timeout and MySQL closed the connection. When Drupal finally wanted to talk to the database, it was not to be found.

Interesting enough, the problem of the config column running out of space came up in 2012, and "the solution" was just to change the type of the column. Now you can store 4GB of content in this 1 column. In hindsight, perhaps this was not the smartest solution.

Also in 2012, you see the comment from @valderama:

However, as feed_files_fetched saves all items which were already imported, it grows endless if you have a periodic import.

Great to see we are not the only people having this pain.

The solution

The simple solution to limp by is to increase the wait_timeout value of your Database connection. This gives Drupal more time to scan for the previously imported files prior to importing the new ones.

$databases['default']['default']['init_commands'] = [
  'wait_timeout' => "SET SESSION wait_timeout=2500",
];
Increasing MySQL's wait_timeout in Drupal's settings.php.

As you might guess, this is not a good long term solution for sites with a lot of imported content, or content that is continually being imported.

Instead we opted to do a fairly quick update hook that would loop though all of the items in the feed_files_fetched key, and unset the older items.

<?php

/**
 * @file
 * Install file.
 */

/**
 * Function to iterate through multiple strings.
 *
 * @see https://www.sitepoint.com/community/t/strpos-with-multiple-characters/2004/2
 * @param $haystack
 * @param $needles
 * @param int $offset
 * @return bool|int
 */
function multi_strpos($haystack, $needles, $offset = 0) {
  foreach ($needles as $n) {
    if (strpos($haystack, $n, $offset) !== FALSE) {
      return strpos($haystack, $n, $offset);
    }
  }
  return false;
}

/**
 * Implements hook_update_N().
 */
function example_reuters_update_7001() {
  $feedsSource = db_select("feeds_source", "fs")
    ->fields('fs', ['config'])
    ->condition('fs.id', 'po_feeds_reuters_photo')
    ->execute()
    ->fetchObject();

  $config = unserialize($feedsSource->config);

  // We only want to keep the last week's worth of imported articles in the
  // database for content updates.
  $cutoff_date = [];
  for ($i = 0; $i < 7; $i++) {
    $cutoff_date[] = date('Y-m-d', strtotime("-$i days"));
  }

  watchdog('FeedSource records - Before trimmed at ' . time(), count($config['feeds_fetcher_directory_fetcher']['feed_files_fetched']));

  // We attempt to match based on the filename of the imported file. This works
  // as the files have a date in their filename.
  // e.g. '2018-07-04T083557Z_1_KBN1JU0WQ_RTROPTC_0_US-CHINA-AUTOS-GM.XML'
  foreach ($config['feeds_fetcher_directory_fetcher']['feed_files_fetched'] as $key => $source) {
    if (multi_strpos($key, $cutoff_date) === FALSE) {
      unset($config['feeds_fetcher_directory_fetcher']['feed_files_fetched'][$key]);
    }
  }

  watchdog('FeedSource records - After trimmed at ' . time(), count($config['feeds_fetcher_directory_fetcher']['feed_files_fetched']));

  // Save back to the database.
  db_update('feeds_source')
    ->fields([
      'config' => serialize($config),
    ])
    ->condition('id', 'po_feeds_reuters_photo', '=')
    ->execute();
}

Before the code ran, there were > 450,000 items in the array, and after we are below 100. So a massive decrease in database size.

More importantly, the importer now runs a lot quicker (as it is not scanning the shared filesystem for non-existent files).

What this means for Feeds in Drupal 8

It came to my attention from Dinesh, that this same issue may likely impact Drupal 8 feeds. To make things slightly more interesting, is that the functionality of the Drupal 7 module feeds_fetcher_directory is now moved into the main feeds module.

An issue has been opened on Drupal.org to track this. I will update this blog post once we know the outcome.

Update - 27 September 2019

The above update hook has been run on production (to where Gluster is used). Feeds used to take upwards of 30 minutes to run there (even if there was no new files to process). Post the update hook running, it is now under 1 minute. We were also able to remove the wait_timeout setting as well. So this is a nice result.