logo
banner image
machine-learning image

They Shall Not Parse?

Using Customer Logs As Training Data

If you are familiar with Sentry, one of the leading vendors of application performance monitoring software, then you may also be aware of the quandary they recently faced when rolling out an update to their Terms of Service. Under the terms of the update, Sentry would begin to ingest users' (anonymized) event data into their machine learning system. A small number of users objected to this on various grounds. Some were unhappy at the lack of an opt-out, others expressed concerns around issues such as PII, governance and residency. To their credit, Sentry, reconsidered and decided to postpone the rollout of the new policy until these concerns were allayed.

Some users took to social media to express their discontent and discuss jumping ship to alternative vendors. At this point, however, something of a reality check kicked in. Users who were mooting the prospect of migrating to some other highly respected providers had cold water poured over their plans by the revelation that those companies already have similar clauses enshrined in their own ToS.

Everybody's Quandary

To some extent, Sentry were victims of their own transparency. Rather than being an outlier, they were simply following suit with other vendors in the sector - perhaps their only crime was actually being more up-front about the policy than some other players in the IT industry. The dilemma for Sentry was that their pre-existing conditions were less expansive than those of their competitors, but when they fell into line, they risked antagonising a segment of their customer base. In fact, it turned out that the numbers were small - but inevitably their protests were amplified in the echo chambers of social media.

Ultimately, this is not just an issue about the ToS of one vendor, it is a quandary that might be faced by any vendor. Today, some of a company's users may wish to ring-fence their data, tomorrow those same users may look jealously at ML-powered features in competitors' products. From a functional and practical point of view it seems undeniable that applying ML to logs will bring benefits to the whole market. By mining for patterns in large data sets, analytics engines will have improved capabilities for predicting potential outages and will be able to correlate across other systems to offer potential solutions. The technologies will also enable more powerful and sophisticated anomaly and intrusion detection.

Legitimate Concerns

At the same time though, if customers have concerns, they can't be dismissed out of hand as over-caution or excessive paranoia. So, what are these concerns, and do they stand up to scrutiny? One of the major concerns is over PII. Theoretically, it should not make its way into logs - but, in reality, it does. Given that any such PII would already be stored in the vendor's data centre, what are the additional risks of applying ML to that data? Issue one concerns potential leakages in transferring the data between endpoints or in the storage used by the ML system. There are more potential points of egress and a larger attack surface.

The second possibility is that of regurgitation. That is to say, the possibility of source data somehow being reproduced in a recognisable way in the tools which have been trained on the ML data. On the face of it, this concern would seem to be misplaced. The purpose of the ML training on data such as application or machine logs is to improve diagnostics rather than to generate new content. The end product of the analytical process will be statistical rather than textual. Many vendors already have the capability to identify and scrub PII. Obviously, these features do have limitations. They tend to focus on easily recognisable patterns such as IP addresses or API Keys and may not be able to identify personal information in blocks of text. Equally, vendors may actually apply an additional charge for these features.

Questions of Principle

The third objection is on the grounds of principle. The data belongs to the customer and is being entrusted to the vendor for a particular purpose. There is a sense of impropriety around the vendor assuming rights over that data - a kind of breach of that implicit trust. This raises the question of 'fair use' of data in ML and AI systems - and the legal position on this is now being tested in the court's with the OpenAI law suit. In fact, if we zoom out and look at the bigger picture, ingesting log files barely registers as a misdemeanour compared to the industrial scale dredging of personal data being carried out by some of the tech giants. For example, this upgrade for Google Messages will read and analyse your personal message history - with your data being processed in the Google cloud rather than on your device. Meanwhile, this dispute over an AI simulation of deceased US comedian George Carlin also raises complex issues of ethics and ownership. There are of course, myriad other cases one could cite, such as Reddit signing over its user-generated content or GitHub hoovering up code committed to its public repositories.

Such is AI's hunger for real world data that there is speculation that the LLM's will soon "run out of internet". One possible solution to this is the use of synthetic data. This however still involves challenges since feeding successive generations of synthetic data into an LLM can lead to progressive degradation of quality and the danger of model collapse.

A Level Playing Field

It seems like we are, at the moment, in uncharted territory and just beginning the journey towards establishing frameworks and guidelines. So, what might be the standards which could create a level playing field for ensuring good practice across the sector whilst also giving reassurance to customers? The following might represent a starting point:

  1. transparency - customers must be explicitly aware of all of the purposes for which their telemetry may be used
  2. integrity - rock solid standards around the usage and retention of customer data in out-of-band datastores
  3. security - encryption at rest should be guaranteed
  4. privacy - scrubbing of PII should be provided without cost to the customer
  5. governance - vendors must respect data residency requirements
  6. clarity - vendors should explain procedures for avoiding regurgitation or other risks of data exposure
  7. disclosure - a duty to report leakages

Conclusion

In conclusion, there is clearly a need for a balance to be struck. A balance between concerns around governance on the part of the client and the desire for innovation on the part of the vendor. At present it seems that some of the very largest corporations in the world are able to gain an advantage by mobilising legal and other resources to stretch the limits of the 'fair use' defence. There clearly needs to be a more consensual and rigorous basis for the mass ingestion, storage and analysis of customer data. We have seen a large number of stakeholders come together to collaborate in the OpenTelemetry project. Perhaps the relationships formed in that effort could be built upon to form an advisory body on industry standards.

Comments on this Article

You need register and be logged in to post a comment
Top