How the feeds module in Drupal 7 ended up causing MySQL to sever the connection
This is a short story on an interesting problem we were having with the Feeds module and Feeds directory fetcher module in Drupal 7.
This is a short story on an interesting problem we were having with the Feeds module and Feeds directory fetcher module in Drupal 7.
Background on the use of Feeds
Feeds for this site is being used to ingest XML from a third party source (Reuters). The feed perhaps ingests a couple of hundred articles per day. There can be updates to the existing imported articles as well, but typically they are only updated the day the article is ingested.
Feeds was working well for over a few years, and then all of a sudden, the ingests started to fail. The failure was only on production, whilst the other (lower environments) the ingestion worked as expected.
The bizarre error
On production we were experiencing the error during import:
PDOStatement::execute(): MySQL server has gone away database.inc:2227 [warning]
PDOStatement::execute(): Error reading result set's header [warning]
database.inc:2227PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has [error]
The error is not so much that the database server is not alive, more so that PHP's connection to the database has been severed due to exceeding MySQL's wait_timeout
value.
The reason why this would occur on only production happens on Acquia typically when you need to read and write to the shared filesystem a lot. As lower environments, the filesystem is local disk (as the environments are not clustered) the access is a lot faster. On production, the public filesystem is a network file share (which is slower).
Going down the rabbit hole
Working out why Feeds was wanting to read and/or write many files from the filesystem was the next question, and immediately one thing stood out. The shear size of the config
column in the feeds_source
table:
Having to deserialize 52 MB of ASCII in PHP is bad enough.
The next step was dumping the value of the config
column for a single row:
Then open the resulting file in vim:
And sure enough, inside this config
column was a reference to every single XML file ever imported, a cool ~450,000 files.
So this is the root cause of the problem, Drupal is attempting to stat()
~450,000 files that do not exist, and these files are mounted on a network file share. This process took longer than MySQL's wait_timeout
and MySQL closed the connection. When Drupal finally wanted to talk to the database, it was not to be found.
Interesting enough, the problem of the config column running out of space came up in 2012, and "the solution" was just to change the type of the column. Now you can store 4GB of content in this 1 column. In hindsight, perhaps this was not the smartest solution.
Also in 2012, you see the comment from @valderama:
However, as feed_files_fetched
saves all items which were already imported, it grows endless if you have a periodic import.
Great to see we are not the only people having this pain.
The solution
The simple solution to limp by is to increase the wait_timeout
value of your Database connection. This gives Drupal more time to scan for the previously imported files prior to importing the new ones.
As you might guess, this is not a good long term solution for sites with a lot of imported content, or content that is continually being imported.
Instead we opted to do a fairly quick update hook that would loop though all of the items in the feed_files_fetched
key, and unset the older items.
<?php
/**
* @file
* Install file.
*/
/**
* Function to iterate through multiple strings.
*
* @see https://www.sitepoint.com/community/t/strpos-with-multiple-characters/2004/2
* @param $haystack
* @param $needles
* @param int $offset
* @return bool|int
*/
function multi_strpos($haystack, $needles, $offset = 0) {
foreach ($needles as $n) {
if (strpos($haystack, $n, $offset) !== FALSE) {
return strpos($haystack, $n, $offset);
}
}
return false;
}
/**
* Implements hook_update_N().
*/
function example_reuters_update_7001() {
$feedsSource = db_select("feeds_source", "fs")
->fields('fs', ['config'])
->condition('fs.id', 'po_feeds_reuters_photo')
->execute()
->fetchObject();
$config = unserialize($feedsSource->config);
// We only want to keep the last week's worth of imported articles in the
// database for content updates.
$cutoff_date = [];
for ($i = 0; $i < 7; $i++) {
$cutoff_date[] = date('Y-m-d', strtotime("-$i days"));
}
watchdog('FeedSource records - Before trimmed at ' . time(), count($config['feeds_fetcher_directory_fetcher']['feed_files_fetched']));
// We attempt to match based on the filename of the imported file. This works
// as the files have a date in their filename.
// e.g. '2018-07-04T083557Z_1_KBN1JU0WQ_RTROPTC_0_US-CHINA-AUTOS-GM.XML'
foreach ($config['feeds_fetcher_directory_fetcher']['feed_files_fetched'] as $key => $source) {
if (multi_strpos($key, $cutoff_date) === FALSE) {
unset($config['feeds_fetcher_directory_fetcher']['feed_files_fetched'][$key]);
}
}
watchdog('FeedSource records - After trimmed at ' . time(), count($config['feeds_fetcher_directory_fetcher']['feed_files_fetched']));
// Save back to the database.
db_update('feeds_source')
->fields([
'config' => serialize($config),
])
->condition('id', 'po_feeds_reuters_photo', '=')
->execute();
}
Before the code ran, there were > 450,000 items in the array, and after we are below 100. So a massive decrease in database size.
More importantly, the importer now runs a lot quicker (as it is not scanning the shared filesystem for non-existent files).
What this means for Feeds in Drupal 8
It came to my attention from Dinesh, that this same issue may likely impact Drupal 8 feeds. To make things slightly more interesting, is that the functionality of the Drupal 7 module feeds_fetcher_directory is now moved into the main feeds module.
An issue has been opened on Drupal.org to track this. I will update this blog post once we know the outcome.
Update - 27 September 2019
The above update hook has been run on production (to where Gluster is used). Feeds used to take upwards of 30 minutes to run there (even if there was no new files to process). Post the update hook running, it is now under 1 minute. We were also able to remove the wait_timeout
setting as well. So this is a nice result.